Performing a Root Cause Analysis (RCA) in IT

To perform a Root Cause Analysis (RCA) in IT, you must systematically isolate the underlying technical or process failure that caused an incident, rather than just treating the visible symptoms.

Following a structured IT service management framework ensures you fix the issue permanently and prevent it from happening again.

To perform a Root Cause Analysis (RCA) in IT
To perform a Root Cause Analysis (RCA) in IT

1. Define the Incident and Its Impact

Clearly articulate what went wrong using specific, technical terms. Avoid vague descriptions.

  • Draft a precise problem statement: Specify the exact error message, system component, and affected user base.
  • Quantify the impact: Note the financial cost, operational downtime, or number of disrupted transactions.
  • Establish containment: Ensure short-term workarounds are active to protect users while you investigate.

2. Gather Evidence and Timeline

Collect empirical data from your IT environment to reconstruct the exact order of events.

  • Pull system logs: Review application logs, server telemetry, database queries, and network traffic captures.
  • Check the change management registry: Cross-reference the exact time of failure against recent code deployments, infrastructure modifications, or patch updates.
  • Map out the sequence: Build a chronological timeline from the last known stable state to the moment of failure.

3. Identify Potential Causal Factors

Brainstorm all possible technical and human vectors that could have triggered the event.

  • Brainstorm with a cross-functional team: Involve developers, system administrators, and network engineers to get different perspectives.
  • Categorize via Fishbone (Ishikawa) Diagrams: Separate potential culprits into categories like Code, Hardware, Processes, People, and Third-Party Vendors.
Categorize via Fishbone (Ishikawa) Diagram
Categorize via Fishbone (Ishikawa) Diagrams

4. Isolate the Root Cause

Use deep analytical methods to narrow your broad list of potential causes down to the single source failure.

  • Apply the 5 Whys technique: Ask “Why?” repeatedly to drill past surface symptoms. For example:
    1. Why did the application crash? The database ran out of memory.
    2. Why did it run out of memory? A specific query caused a memory leak.
    3. Why did the query leak memory? A recent code change did not close database connections.
    4. Why were connections left open? The developer missed the disposal pattern in the new framework.
    5. Why was it missed? There was no automated code linting or peer review rule for this framework (Root Cause).
  • Utilize Fault Tree Analysis (FTA): Use boolean logic to visually map how combinations of lower-level system faults lead to a high-level systemic failure.

5. Develop and Implement Preventive Solutions

Design a permanent fix targeting the root cause so the issue cannot happen again.

  • Deploy technical remediation: Patch code, reconfigure infrastructure, or scale resources.
  • Fix the process gap: Update documentation, add automated testing pipelines, or adjust alert thresholds.
  • Assign clear ownership: Appoint explicit owners and deadlines for each action item.

6. Document and Practice Blameless Reviews

Foster transparency to improve future infrastructure resilience.

  • Conduct a blameless post-mortem: Focus entirely on how the system allowed the failure to occur, not who made the mistake.
  • Publish an internal RCA report: Document the timeline, data points, root cause, and remediation steps in a searchable knowledge base.

For a visual breakdown of how to execute these problem-solving techniques in practice, watch this tutorial on conducting a root cause analysis:

How to Do Root Cause Analysis (RCA) the Right Way | Lean Six Sigma ToolsYouTube · InfiniLean

Performing a Root Cause Analysis (RCA) in IT

Unknown's avatar

Author: Mark Whitfield

Welcome to my site! After graduating in Computing in 1990, I accepted a position as a programmer at a Runcorn based software house specialising in electronic banking software, namely sp/ARCHITECT-BANK on Tandem Computers (now HPE NonStop). This was before the internet became more prevalent and so the notion of enabling desktop access to company accounts for inter-account transfers and book keeping was still quite a cutting edge idea (and smartphones only ever hinted at in Space 1999). The company was called The Software Partnership (which was taken over by Deluxe Data in 1994). I spent 5 years in Runcorn developing code for SP/ARCHITECT for various banks like TSB, Bank of Scotland, Rabobank and Girofon (Denmark) to name but a few. I then moved onto a software house in Salford Quays for further bank facing projects. After a further 23 years in the IT industry and now a Senior IT Project Manager (both Agile and Waterfall delivery), I thought I would echo out my Career Profile in this corner of the internet for quick and easy access.

Leave a comment