Technology Insights

Is Predictability Overrated? The Case for a ‘Chaos Engineering’ Game Day

By Juan Ramollino / April 13, 2023

AD Chaos Engineering Blog

Software development practices have changed significantly over the last two decades. With DevOps being mainstream, it is no longer possible to build a career in a single technology and aspect of the software lifecycle. Teams are multidisciplinary and have an increased cognitive load due to the breadth of the knowledge they need to perform their day-to-day work.

What is an engineering game day?

A game day is an event designed to build the skill of a group within Engineering. Your teams are constantly exposed to documentation to the point of being saturated. Don't rely solely on written documentation or recorded videos to build an essential skill within your workforce. Various studies have demonstrated that the brain retains more information when associated with an emotion (e.g. fun, sadness).

A well-engineered game can increase the engagement of your workforce and strengthen critical learnings for your organization's success. Some chaos engineering events involve a real team facing a simulated system failure to build incident response memory. What is chaos engineering, anyway? ChatGPT explains it like this: Chaos engineering is like throwing a surprise party for your IT systems, but instead of balloons and cake, you bring chaos and disorder. 

Growing pains

At the end of 2022, AppDirect engineering was facing scaling challenges. The platform was facing a significant increase in volume, and teams were in the process of shifting toward DevOps practices. DataDog had just been deployed to simplify the stack, but teams needed more expertise. Finally, some groups had difficulties identifying the root cause of problems and only addressed the symptoms.

At that point, the organization needed to increase its expertise on DataDog and correctly identify the root cause of issues to address core problems. We designed a game day around those organizational needs.

Engineering Game Day 'Chaos Engineering' Graphic

The game day event: The rules of the game

Over a few hours, a small group of Kubernetes administrators introduced various issues in the continuous integration environment for teams to find and report on. Problems ranged from a complete database outage to more subtle issues like failures in Kafka.

Here’s how we structured our game day:

  • We had more than 90 participants on 30 teams in different locations

  • Teams could submit at most three incident reports to allow participants with limited time a fair chance at winning the game

  • A judging panel rated reports based on their precision and the proposed mitigation measures

  • The engineering department mandated each team to sign up at least one participant and organized live troubleshooting sessions with an expert before the event

  • Prior to the event, the teams also had access to curated Udemy training to level set on DataDog

Results

The chaos engineering game day event received stellar feedback. Overall, it helped build the DataDog expertise and reinforced DevOps practices within the company. Some examples of identified areas for improvement are described below.

Engineering Game Day 'Chaos Engineering' Datadog


Investigation traces

A look into the investigation traces showed that there was room for improvement on the reports. Some didn’t provide enough information for a peer to confirm the problem. Good investigation notes contain the following elements:

  • Timeline

  • Screenshot or link to the symptoms of the issue

  • Cause, if found

Some investigation traces only showed high level configuration changes without any proof of research and link to the symptoms reported.

Reported problems

The scoring criteria encouraged teams to report “harder to spot” issues. An analysis of the incident reports demonstrated that the most visible problems were also the most reported: MySQL & AuthZ outages (48% of issues).

The issues introduced to RabbitMQ might have been spotted, but no team reported them.

Problem cause

Not surprisingly, participants reported on problems that they could explain. In fact, in 47 percent of the incident reports the team had identified the exact change that was introduced.

Make it your own

Hosting chaos engineering game days is not new; technology leaders like Amazon recommend regularly running some with your incident response teams.

40% of companies will adopt chaos engineering as part of their DevOps initiatives in 2023 reducing unplanned downtime by 20%

                                                                                                      — Gartner

As you plan your own game day, don’t hesitate to enter uncharted territory and build something that suits your needs. An organization facing performance issues could organize an event where teams use load-testing tools to degrade target micro-services. Don't rely exclusively on documentation to evolve your organization's practices and culture. Instead, add a game day to your toolbox.

Have you hosted your own engineering game day? Connect with us on LinkedIn and share what worked best for you and your team. Or check out how we improved our platform performance by replacing our message broker and moving from RabbitMQ to KafkaWe'd also like to provide a special thanks to Jean-Philippe Boudreault for his contributions in writing this blog with Juan.


Sources:

  1. https://www.frontiersin.org/articles/10.3389/fpsyg.2021.519729/full
  2. https://aws.amazon.com/blogs/architecture/chaos-engineering-in-the-cloud/