Handling an Outage

Andrew Schwabe

2021-07-15

You are a developer on a team going about your day, then, out of nowhere you find out there is an ongoing outage. This news flips the day from calm to frenzied in the blink of an eye. In this post, we will discuss some techniques to make the outage more manageable for all parties.

Before jumping into technology, let us be selfish and think about ourselves. What can we do to make this easy? First and foremost, temper emotions. In the middle of an outage becoming emotional is a distraction. In these moments, we do not have to all agree, what we have to do is resolve the problem. One struggle many technical minds have is leaving old decisions in the past. While the team is trying to resolve a major issue it's not uncommon to hear how features are built incorrectly. How this would have never been an issue if we had just built this in the way that you wanted to build it. Those comments and feelings are not helpful at these times, keep them to yourself. If the software is that bad, prioritize making those fixes to remove the issues you are seeing. Complaining at this stage is the wrong time. Next, I hear developers not being clear, causing the other investigators to spin, or ask for more clarity in the messaging. You should maintain precise communication when communicating outward, clarity is key. Unnecessary communication detracts from the goal the group has set out to complete. Lack of precision will draw attention away from the problem and into deciphering the message, losing valuable brain cycles for no reason.

When you do dive into the investigation, first, it is time to start collecting information. Being technical, it may make sense to go straight to the error logs. However, somehow we were informed of this outage. If the information came from someone, start with those holding the information about the outage. Get them in person or on a phone call to have them show you all they can about what is happening, or not happening. Work by sorting facts from fiction. Perform the actions on your machine. Capture any ideas you have on where the issue may lie. Before wrapping up the conversation, you should come away with a better idea of where to begin digging in logs and possibly what to start searching for. Narrowing down the problem set from a whole system to components of a system.

Now what can we do to solve the problem quicker? Identify, experiments to narrow down the issue further. Share the facts that you have with the rest of your investigation team. With options in hand and a supporting team, divide the tasks among the team to begin investigating. While investigating take deliberate actions. It may seem easy to try a little change here and a small change there. These may stack up to resolve an issue, however, these little changes left in place also have the possibility to make the problem worse. Taking deliberate actions will allow you to experiment, test, and roll back if the problem persists. As the team is learning, do not wait too long to check in with each other. Time can vanish into a list of never-ending experiments in these situations where time is of the essence. I have seen teams identify a communication controller. Their job is to poll for updates from the investigating members synthesize the communication and relay the necessary information to all parties. This works quite well with large investigation groups, maintaining a single point of contact with stakeholders.

Now that the investigation team has located the issue, how should we fix it? It seems obvious, write the code, then ship it. But, that is not all. First, the person to implement the solution should be the person who understands the problem best. Next, write tests to exercise the problem area in isolation. Often a single test is not enough, consider writing a suite of tests to better describe the issue to future team members. If possible, set up test data in lower-level environments to prove the operation of the solution before going to production. With the fix complete, use all the collected knowledge of the investigation team to log work items with other risk areas found during the outage.

Like most everything, it takes focus and practice to excel at handling outages. During your practice opportunities, remember to remain calm, focus on facts, and maintain precise communication. The rest of the technology training will fall into place, about best practices once a solution is found.