Frequently Asked Question

Post Incident Analysis and Reporting
Last Updated 3 hours ago

Post Incident Analysis is an important part of the process. It allows a review of the incident, the scenario, the engineering response and knowledge, as well as performance and response. It is an automatic function of the HelpDesk, and every ticket that is closed, passes through this process. 

Two Analyses are produced, one technical and one executive, both for their respective audiences, and these are archived along with the ticket. 

Below will be two genuine analysis reports for a real ticket, in which references have been redacted. 

Technical Analysis

## Performance statistics
- Time to first response (business hours): 11 minutes (697s)
- Time to first response (elapsed): 11 minutes (697s)
- Time to solution (closed): 1 days 0 hours 0 minutes 
- SLA: SLA8 (grace 1h), first response due: 2026-02-17 08:05:56, within SLA: Yes

# Technical Performance Review: GXXX-XXX - PVE Down

## Overview

This ticket, identified as GXXX-XXX with the subject "PVE Down", was opened on 17 February 2026 at 07:05:56 by XXXXXXXXXX of XXXXXXXXXX. The issue involved a complete outage of the Proxmox Virtual Environment (PVE) cluster, which was reported as urgent. The ticket was closed on 18th Feburary at 07:06:00 due to inactivity. The resolution was handled by a technical engineer within the Support - Proxmox team, with internal coordination via ticket TXXX-XXX.

## Performance Statistics

- **Ticket Number**: GXXX-XXX 
- **Subject**: PVE Down  
- **Created**: 17 February 2026, 07:05:56  
- **Closed**: 7 March 2026, 22:40:02  
- **Time to First Response**: 11 minutes (697 seconds)  
- **Time to Solution (Closed)**: 1 days 0 minutes (automatic)
- **SLA**: SLA8 (24/7), Grace period: 1 hour  
- **First Response Due**: 17 February 2026, 08:05:56  
- **First Response Within SLA**: Yes
  
## Triage and Initial Response

The ticket was triaged as priority 1, which is the highest priority as the customer explicitly described the situation as "URGENT" and stated that "No systems are working in the business", and the initial response was delayed until 07:17:33, which is 11 minutes after the ticket was created. 

## Investigations Performed

The technical team conducted a thorough investigation. The root cause was identified as a volume failure on the node named XXXXXXXXXX, which led to a cascade failure in pvesm (Proxmox Virtual Environment storage manager). The team took appropriate actions, including bringing compute nodes back online to re-establish quorum and verify that fixes were propagated. The internal ticket TXXX-XXX was used to coordinate the resolution, and the team provided regular updates to the customer. The technical analysis appears sound, with a clear understanding of the cluster's failure mode and the steps taken to restore functionality.

## Technical Quality

The technical response was of high quality. The team correctly identified the root cause and implemented a structured recovery process. The use of an internal ticket (TXXX-XXX) to manage the incident suggests a well-organised incident response process. The communication from the engineering team was technically accurate, explaining the failure of a volume and the subsequent cascade effect. The team also provided updates on the status of the recovery, including the bringing online of compute nodes and the verification of fixes. However, the lack of a clear timeline for resolution may have impacted the customer experience.

## Communication

Communication with the customer was inconsistent. The first response was prompt (11 minutes), but subsequent updates were delayed. The customer requested an update at 07:37:20 and received a response at 07:43:03, which was acceptable. However, the next communication from the team was not until 08:06:23, which is over an hour later. 

## Risk Handling

The primary risk was the prolonged downtime of the customer's business systems. The team managed this risk by taking the compute nodes offline to prevent further damage and by initiating a recovery process. However, the risk was not fully mitigated due to the lack of a clear recovery timeline and the absence of regular updates to the customer. In such a situation when a clear recovery timeline is not clearly visible, an estimate may have been provided to the customer to manage expectations, but this would have been contrary to directive 7 which states that only when a clear timeline is available should an estimate be provided.

## What Went Well- The initial response was prompt and accurate, identifying the root cause and the internal ticket number.- The technical team correctly diagnosed the issue and implemented a structured recovery process.- The team provided regular updates during the initial phase of the incident.- The use of an internal ticket (T373-109) to coordinate the resolution demonstrates effective internal processes leveraging multiple engineers to effect a swift resolution.

## What to Improve- Communication with the customer should have been more frequent and proactive, especially during the long recovery period.


Executive Analysis

## Performance statistics

- Time to first response (business hours): 11 minutes (697s)
- Time to first response (elapsed): 11 minutes (697s)
- Time to solution (closed): 1 days 0 hours 0 minutes (automatic)
- SLA: SLA8 (grace 1h), first response due: 2026-02-17 08:05:56, within SLA: Yes

# Executive Summary: GXXX-XXX - PVE Down  

A critical incident was reported on 17 February 2026, when XXXXXXXXXX's Proxmox Virtual Environment (PVE) cluster experienced a complete outage, rendering all systems inoperable. The issue was identified at 05:16 as a cascade failure caused by a storage volume failure on the node named XXXXXXXXXX, which disrupted the storage management system (pvesm). The team responded promptly, isolating the compute node and initiating recovery procedures. The cluster was restored by bringing compute nodes back online and re-establishing quorum. The ticket was closed on 18 Feburary 2026 due to inactivity, with no further updates from the client.  

**Business Impact**: The outage resulted in a complete loss of system availability for Boss Cabins Limited, affecting business operations. The client expressed urgency and concern over the lack of visibility into the resolution process.  

**Current Status/Outcome**: The incident was resolved internally, with the cluster restored to operational status. The ticket was automatically closed due to inactivity after 1 days and 0 hours. No further client engagement occurred post-resolution. 
 
**Performance Statistics**:  - Time to first response: 11 minutes  - Time to solution (closed): 1 days 0 hours 0 minutes  

**Decisions/Asks**:  - Implement proactive monitoring and alerting for PVE cluster health to prevent future outages.  

Both reports are generated in MarkDown, as can be seen, because this is an ideal format to easily convert into ODT, DOCX, PDF or whatever format the customer might require. 

Additionally, given the way we store these documents, compound analysis and collective reporting is generated internally, and can be made available externally on a monthly basis. 

This website relies on temporary cookies to function, but no personal data is ever stored in the cookies.
OK
Powered by GEN UK CLEAN GREEN ENERGY

Loading ...