This article was cross-posted with permission from Domenico Luciani, Senior Software Engineer at VMware.
What's an incident? What to do in case of an incident? How to be a great incident commander? Are you a good incident responder? This is what I discover during my Incident Management workshop in @ThoughtWorks this week, let's take a look!
Universally, we can say that an incident is an occurrence that requires action or support by an emergency squad to prevent or minimize loss of life or damage to property and/or natural resources.
In the IT field, we can say that is an unplanned interruption or reduction of the quality of an IT service.
But what an incident is at the end of the day is nothing more than the cost, what is comported to have that incident.
In general, we can have 3 roles during an incident:
- Incident Commander: It's an authoritative role, responsible for leading the efforts to mitigate the incident.
- Communication Manager: It's a secondary role, often merged with the IC; responsible for handling communications with stakeholder and executives.
- Incident Responder: It's a member of the technical staff with clearly defined responsibilities over a specific service or groups of services, during an incident.
You, the Incident Commander have been waking up during the night for an incident, what to do then?
- Gain situational awareness
- Acknowledge the situation with your stakeholders.
- Establish the communication channels for the response team
- Gather the response team
- Evaluate the situation
Every incident has severity and depends on that you should act differently, use incident levels as guidance and use your judgment when assessing the severity. When you are unsure, just select a higher incident level and downgrade it later, just be explicit when you do that.
O captain! My captain!
Now you are in charge of leading the team to solve the incident as fast as possible, what should you do to be a great incident commander?
Stay calm: take your time to think and use your judgment.
- Focus: leading the incident is your only priority, drop the rest.
- Create the right environment: shut-down the blame talk, funnel all communications to the outside world, remember it's not about blaming people, it's about solving the problem.
- Lead and command: be direct and assertive. Use your great team. You make the decisions but it's not necessary to have all the technical knowledge to be an IC, you just need to gather all the data to make decisions consciously.
- Prioritize recovery: work towards recovering service. Always prioritize recovering than the correctness of the solution.
- Document the incident: keep track of who did what and when, and what was the outcome. As IC you should document business and the technical impact of the incident.
- Run a tight shop: rely on experts but beware of heroes and lone-wolf.
- Don't make assumptions: ask questions and gather data to confirm the hypothesis. Try to be impartial as possible.
- Communicate with the stakeholders even if there aren't updates.
- Inform stakeholders: it's important to specify if you fixed the incident or if the incident has been mitigated.
- Follow-up the next day: permanent fixes might be needed, make sure to loop in the necessary team.
- Write-up the incident report: Remove the names and the report should include business and technical details.
- Send out an impact analysis document.
- Run the incident review.
- Prioritize Action Items: be reasonable, don't put too much AI on the backlog, prioritize them with your BA.
The post-incident document is a good occasion for the onboarding process because it explains what happened, why and how you handled that to solve the problem, it's a learning opportunity for the people of the organization.
It's time to write our analysis but what do we need to put inside of it?
- People involved (roles and duties).
- Impact analysis.
- Qualitative impact.
- Team impact.
- Incident trigger.
- Incident and recovery timeline, use the logs you wrote to help you to write a better timeline.
An incident is a good occasion to learn new things, try to think about:
- What went well.
- What went not-so-well.
- Whose we got lucky.
Like a retrospective.
It's a table where to document all of the AI that you have done to mitigate the problem with a relative ticket with an owner, the type and a priority.
Well done the incident is solved, we learned from it, our stakeholders have been informed and the team now deserve a big glass of beer to celebrate this victory but remember: we won the battle, not the war.