What is Root Cause Analysis?
Root cause analysis (RCA) is a method of problem-solving used to investigate known problems and identify their antecedent and underlying causes. While the term root cause analysis seems to imply that issues have a singular cause, this is not always the cause. Problems may have a singular cause, or multiple causes stemming from deficiencies in products, people, processes or other factors.
When is Root Cause Analysis Used?
Root cause analysis is implemented as an investigative tool in a variety of industries. Engineers and product designers use an RCA technique known as failure analysis to proactively evaluate what conditions might cause a product or project to fail. RCA also has applications in healthcare. Doctors may implement RCA as an investigatory tool to help with a diagnosis, and epidemiologists can use RCA to trace the source of an infectious disease outbreak.
For IT organizations, root cause analysis is a key aspect of the cyber security incident response process. When a security breach occurs, SecOps teams must collaborate quickly to determine where the breach originated, isolate the vulnerability that caused the breach and initiate corrective and preventive actions to ensure the vulnerability cannot be exploited again.
What are the Types of Root Causes
Root causes can be divided into three types.
- Physical causes - These are root causes when a physical part of a system breaks down. These include hardware failures, system errors from booting up, issues with tools not functioning, or other tangible components breaking down.
- Human causes - These are root causes of problems that arise from human errors or mistakes. For example, if a person does not have the skills necessary to operate systems properly, they do not have the knowledge of the tools, they created a programming error, or they tried to perform tasks with incorrect tools.
- Organizational causes - These are root causes that arise from organizational issues. For example, if a team lead provides incorrect instructions to team members, organizations made the incorrect selection of people to perform tasks, or an organization does not handle or maintain a staff correctly.
Benefits of Root Cause Analysis
There are many benefits to conducting a root cause analysis. The primary benefits being:
- The ability to define the problem that occurred.
- Reducing the amount of errors that occur from the same root causes.
- Ability to implement tools and solutions to address future issues.
- The foresight to implement tools to log and monitor for potential future issues.
- Putting processes into place for when unforeseen issues arise, enabling your team to address them more quickly than before.
How to Do Root Cause Analysis
When investigating a cyber security incident, security operations teams must act quickly to identify and isolate the root cause of the event. The basic outline of the RCA process is identical across industries, regardless of the tools that individual practitioners choose to implement.
What is the 4 step process of a root cause analysis?
A process for root cause analysis is described in the following four steps:
- Identification and description - the first step to a successful root cause analysis is the accurate identification and description of a problem. If the problem is poorly understood, it may prove difficult to correctly isolate the underlying causes of the problem. For IT operators responding to an automated alert from a security analytics tool, an initial problem statement could be "Our security system sent an alert". Accurate event descriptions also play an important role in RCA. The starting point for a successful analysis should be a collection of accurate event descriptions detailing everything that happened in connection with the problem.
- Chronology - Once IT operators have identified the problem and associated events, they should be arranged in chronological order, as in a timeline or sequence of events. This makes it easy to establish and identify causal relationships between events connected to the problem. Organizations that leverage security analytics software can automate the collection of event logs and the integration of logs from multiple sources into a single, standardized format and platform. This streamlines the RCA process, helping these organizations get to step three of RCA at lightning speed.
- Differentiation - Differentiation is the third step of the RCA process. Here, investigators incorporate additional contextual data surrounding the events to understand how events are correlated. When a cyber security event is detected, security operators must analyze dependencies between events to distinguish between root causes, causal factors and non-causal factors within the system. Using a data analysis technique called event correlation, enterprise security analytics tools can filter through high volumes of computer logs from a variety of different sources and pinpoint the ones that are most likely to be connected to the problem.
- Causal graphing - In the final step of the RCA process, investigators are encouraged to produce a causal graph, diagram or another visual interpretation of the result of the RCA process. Causal graphing illustrates a sequence of key events that begins with the root causes and ends with the problem. This exercise demonstrates the logical pathway that was followed to determine how the problem occurred.
Root Cause Analysis Tools and Techniques
While the general process for root cause analysis remains consistent across industries, investigators differ in the tools and techniques that they use to get to the underlying source of a problem. Even security operators who can automate much of the RCA process with security analytics applications must be familiar with methodologies of root cause analysis to accurately interpret the causes of security events. Here are the two most important tools and methods for RCA in cloud computing environments:
Five Whys of Root Cause Analysis
The "Five Whys" method of root cause analysis is an investigative technique that encourages the practitioner to repeatedly ask "Why?" to get to the deepest chain of causation that leads to an incident, event or problem. When a problem is observed, we can rarely get to the root cause after a single iteration of asking "Why did this happen?" We may have to go through several layers of questioning to understand the root cause of an event and identify an opportunity for corrective actions. Use this example as a template for conducting Five Whys RCA:
Problem Statement: The company data server was infected with malware.
- Why? The server was not updated with the latest malware definitions for our anti-malware application.
- Why? The automated server that deploys the updates is not operational.
- Why? The automated server broke last month and it hasn't been repaired or replaced.
- Why? The person responsible for approving the repair or replacement is on vacation and there was inadequate communication about who should cover change approvals.
- Why? Lack of process.
Solution: Create a process to ensure that repairs can be approved, even when the normal approving person is away.
This simple example illustrates the depth of questioning that can be required to isolate the root cause.
Fishbone/Ishikawa Diagram Analysis
A fishbone diagram is a visual graphing tool that encourages the investigator to identify potential causes for a problem from a variety of sources. Fishbone diagrams help investigators quickly get to the root cause of issues by encouraging them to identify different types of causes that could have resulted in the problem condition. The leading framework for Fishbone diagrams is the 5 Ms, where investigators look at:
- Man: Human factors that could have caused the problem
- Machine: Hardware or technical causal factors
- Material: Causal factors stemming from material issues, including consumables and information
- Method: Causal factors stemming from breakdowns in process or methodology
- Measurement: Causal factors stemming from inaccuracies in measurement tools or inspections
Environmental causal factors are also frequently investigated as part of a Fishbone/Ishikawa diagram analysis.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.