Handling the Pressure of an Outage

You are a developer on a team going about your day, then, out of nowhere you find out there is an ongoing outage. This news flips the day from calm to frenzied in the blink of an eye. In this post, we will discuss some techniques to make the outage more manageable for all parties.
Before jumping into technology, let's be selfish and think about ourselves for a moment. What can we do to make this easy for ourselves and others processing the outage? First and foremost, temper emotions. In the middle of an outage becoming emotional is a distraction for you and the others. In these moments, we do not have to all agree on everything, what we have to do is resolve the problem for our customers. Another struggle many technical minds have is leaving old decisions in the past. While the team is trying to resolve a major issue it's not uncommon to hear how features are built incorrectly. Comments such as "This would have never been an issue if we had just built this in the way that..." have no place in a high-pressure situation. Those comments and feelings are not helpful at these times, keep them to yourself. If the software is truly that bad, build your case for prioritizing solutions to remove problem areas. Complaining about old decisions at this point in the game demotivates everyone working on solving the problem causing the outage. While on the topic of communication, I hear developers not being clear with their communication, causing the other investigators to ask for clarity, or waste time focusing on the message instead of the problem. Maintain precise communication while relaying information inward to the investigation team and outward to stakeholders. This applies to unnecessary communication as well, more communication is not always good, excess detracts from the focus required to find the root cause. Lack of precision consumes the most precious resource while solving complex problems, brain processing power.
Now we have ourselves out of the way, when you do dive into the investigation, we should start by collecting information. Being technical, it may make sense to go straight to the error logs. However, I suggest starting elsewhere. Somehow we were informed of this outage. If the information came from a person we can call or go to their desk, start with them! Get them in person or on a phone call to have them show you all they can about what is happening, or not happening. Determine if what they are showing you is really an outage or just something they are experiencing. The work at this time is to sort the facts from fiction. By, following the same actions as the reporter on your machine. While listening and following along, capture any ideas you have on where the issue may be in the system. Before leaving the conversation, you should have a better idea of where to begin digging into logs and thoughts on what you are searching for.
With the information gathered, what can we do to solve the problem faster? Using your gut reactions from the information gathering, identify experiments to narrow down the issue further. Share the information you have gathered and experiments with the rest of your investigation team. With options in hand and a supporting team, divide the tasks among the team members. While investigating take deliberate actions in the software. It may be appealing to adjust a configuration in one location and settings in another which do not end up resolving the issue. These small changes have the potential to resolve the issue, however, these changes left in place have the same potential to make the matter worse. Taking deliberate actions means that you should run the experiment, test if the problem is resolved, and then revert the change if the problem persists.
As the team is working to resolve the outage, continue to periodically check in with each investigator. Without checking in time can vanish into a list of never-ending experiments that do not drive to the root cause of the issue. In these situations where time is of the essence, we want to reduce the amount of waste as much as possible. One technique I have seen teams use is to identify a communication controller. Their job is to poll for updates from the investigating members synthesize the communication and relay the necessary information to all parties. This technique works quite well with large investigation groups, maintaining a single point of contact with stakeholders and a consistent messaging cadence.
Now that the investigation team has located the issue, how should we fix it? The obvious answer is to write the code and ship it to production, but that is not all. First, the person to implement the solution should be the person who understands the problem best. Next, write tests to exercise the problem area in isolation before writing code to solve the problem. Often a single test is not enough to cover the entire problem area. Consider writing a suite of tests to better cover the problem area. These tests also act as a communication mechanism for future team members. Once the test case(s) are in place, then solve the problem in the software. Once the software is ready for deployment, ff possible, set up test data in lower-level environments to prove the operation of the solution before deploying to production. With the fix complete, use all the collected knowledge of the investigation team to log work items with the other risk areas found during the outage. It should then be a priority to resolve these identified risks in the coming weeks, resulting in a more robust product.
Like everything, it takes focus and practice to excel at handling outages. During your practice opportunities, remember to remain calm, focus on facts, and maintain precise communication. The rest of the technology training will fall into place, about best practices once a solution is found.