Chapter 5. Real Breaking

This time, it didn’t take four minutes to notice. Almost immediately, all of the applications failed. Game day had gone off-script and the engineers suddenly realized it. The API team had spent the previous two weeks ensuring that the software could handle a master dying by seamlessly failing into a read-only state. And yet, as soon as the master failed, our entire infrastructure went with it.

The master database failure was handled exactly as it should have been by the core API, but the identity service used the database directly (oh technical debt, you take so little time to be costly). An immediate failure spread across all of our applications, instead of the planned switchover to read-only. This was about the point where the line between reality and FAILARP-ing (live action role playing) really started to blur. Systems failing in unexpected ways brings out a visceral mixture of fear, anger, and futility that a real incident brings about.

In the backchannel, we decided to bring the master back up. Identity was such a core service that there was no point in testing anything beyond what we had tested with the master down. We noted that certain clients would need to handle a downed identity server in a better way, and that our identity server would need to grow far more fault tolerant in the immediate future. The feeling of preparedness was gone.

Rather than just fix the permission group on the master, we promoted a replicant so we could see in a controlled environment what that looked like, and again to better simulate reality; sometimes databases that die stay dead. As it turned out, if we hadn’t tested it that way we wouldn’t have known that promoting a replicant on RDS breaks replication immediately and leaves you with just a single master that you then need to stress by taking a backup to create a replicant. Information that it turns out is incredibly important to have; if you know that this occurs, you can disable endpoint to shed load to make sure that you don’t immediately kill your shiny new master.

We went on to simulate several other scenarios that mimicked real-life failures we had seen before: having replicants flap available and become unavailable (to simulate this we would just revoke access to them, then reinstate access), break your caching layer, simulate full disks (revoke write privileges), and some human error. The easiest human error simulation is to just not do a piece of a process but say that you did: for example, starting replication.

In a four-to-five-hour torrent of breaking things, we had worked through our list of things we would be testing and everyone was exhausted. In the game day incident response channel, we told the team that their nightmare was over, thanked everyone and asked them to gather their notes and compile them per workstream. At the same time in the backchannel, we decided to test the fallbacks. Almost all of our fallbacks for database failure relied on writing to an SQS queue for later processing. SQS for many things was what we relied on in case of failure.

So we killed it.

One or two of our stacks were deliberately set up to be decoupled and wrote to a series of SQS queues that would then get written to the database in a separate process. They had handled all of the tests that we had thrown at them that day because they had been built from day one to be able to operate without a database. So we turned off SQS (in reality, what we did was to modify the /etc/hosts on all of the boxes to point the queue URL to a nonsense URL). Even given the "all clear," there was appropriate logging and alerting in place, and our final deceitful breakage of the day was found nearly immediately.

In reality there was not much that we could do in the case of an SQS failure, but we now knew how the applications reacted to it and could fall back to logging or otherwise staging the data write if need be.