Technology Insights

How to Troubleshoot Urgent Production Issues

By Jean-Philippe Boudreault / August 5, 2020

How to Troubleshoot Urgent Production Issues

If you work with software, sooner or later you will run into a production issue. They happen to the best of developers and engineers, but what matters most is how you address them once they arise.

As a recent article explains, "One of the biggest misconceptions about troubleshooting systems is that it requires deep, specific technical knowledge to locate and solve production issues. This assumption can often result in extending the time between the discovery and resolution of a problem."

I’ve been involved in my fair share of "all hands on deck" production issues, and I’ve seen this problem first-hand. To drive better results faster, I wanted to share some tips on how to act in these unique situations.

Houston, We Have a Problem

You just joined an all-hands on deck issue. You’re now part of an ad hoc crew to resolve an urgent matter that is impacting the business. Hopefully, you can contribute to the resolution of the problem, but first you need to get up to speed.

Investigation Notes

  • "What is happening?"
  • "Where are we at?"
  • "What did we try?"

I bet that you’ll ask those questions within the first few minutes. Investigation notes will allow newcomers to start contributing rapidly. They will be able to double-check what has been assessed and bring up new ideas around the issue(s). A ticketing tool like JIRA doesn’t shine at live collaboration, and in my experience, gets packed with symptoms that result in confusion. A shared Google document listing the problems, the items being looked at, the items discarded, and investigation traces will do wonders.

In short, build a collaborative document that contains: a summary, a list of symptoms, possible causes, discarded causes, and investigation traces.

Emergency Triage

Once you know where things are, verify the urgency of the issue at hand:

  • How many customers are impacted?
  • How many areas of systems are impacted?
  • Why is looking into the issue immediately a big deal?

While all-hands on deck troubleshooting gets the job done, it is rarely efficient for the company and should be exceptional. Keep in mind that some fixes can be complicated and could require a full development cycle. If the release process for your systems alone is 8 hours, it might not make sense to wake up a dev team to pull an all-nighter and rush a code fix that could do more harm than good due to fatigue.

Learn from the Past

As Mark Twain once said, "A favorite theory of mine [is] that no occurrence is sole and solitary, but is merely a repetition of a thing which has happened before, and perhaps often.”

While each issue has the potential to be unique, make sure to search your knowledge base for the symptoms and cause. Search your ticketing systems, your intranet, your messaging system, and the support user base. You might get lucky and the problem has happened before. The resolution might already be documented. Some issues like certificate expiration or external systems being down are likely candidates for this scenario.

Hello Tunnel Vision, My Old Friend

Tunnel vision has been linked to numerous human mistakes: Physicians providing incorrect treatment, detectives convicting innocents—and developers wasting a considerable amount of time on irrelevant stuff.

Systems are getting larger and more complex every day. In the midst of an emergency, it is challenging to find the signal in the noise. Our mind is built to focus on a single item and build proof to support it. You need to resist this.

Get a second opinion and have somebody verify the issue from the beginning. Make sure you’re not investigating a symptom or just something completely unrelated. Production systems are noisy and have thousands of errors every day that go unnoticed. Don’t fall in the trap of thinking that the first stack trace that you found is the cause of the issue. Verify if the symptom is recent. Keep in mind that a single failure can trigger a myriad of other ones. Ensure that you are going up the chain of events, and not just investigating one. A good technique to use when troubleshooting issues is to ask the “Five Whys.”

In my experience, developers that excel at troubleshooting issues spend most of their time at a bird’s eye view. They drill down to understand the flow that they don’t know well or assess a hypothesis. Pull out quickly if you’re not finding what you are looking for, because the chances are that the problem lies elsewhere. This is valid for both looking at the codebase and the server logs. This comes more easily with experience and with platform knowledge, but being aware of biases helps.

Your Toolkit Matters

There’s a famous saying by Abraham Maslow that goes: "I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.” Taking this as a lesson, be sure not to only have one answer to every problem. Ask yourself a range of questions, like:

  • What are your tools?
  • Are you good at searching the server logs?
  • What about tracing requests or following a specific user’s actions?
  • Do you have system metrics monitoring on thread, memory, or CPU usage?

Unfortunately, it is not the right time to learn new tools when the fire is lit. You need to build expertise in advance through your company’s training tools and peers. Ensure that employee onboarding materials cover most of the debugging tools.

Hopefully, you’ll bring more than a hammer to your next troubleshooting session.

Bonus Tip: Establish a Timeline

As an additional tip, make sure to build a timeline when troubleshooting an issue. Working backward through the chain of events has proven to be an effective way to limit tunnel vision and work as a group.

Most systems have a way to track a user journey via sessions ids, IPs, or activity logs. Use these whenever possible. Additionally, request tracing is a must for system to system communication. If possible, match the timeline of events with customer and deployment schedules. Did the customer activate a product or feature recently? When was the latest deployment on the various systems involved? A timeline will prove handy when considering those.

Jean-Philippe Boudreault is the Director Of Engineering, Distribution Automation at AppDirect. AppDirect is hiring! See our open positions.