Network Partition Processing Delay Postmortem

This is a blog about the development of Yeller, The Exception Tracker with Answers

What Went Wrong?

On Tuesday at 1100 UTC, my hosting provider did some planned maintenance to move Yeller’s servers onto newer network infrastructure. After the maintenance was completed (at 1200 UTC), there were three different partitions in Yeller’s internal networking.

I immediately contacted my hosting provider, and they opened a ticket on my behalf (my notes say that phone call finished at 1206UTC), and noted that they would investigate network connectivity. In the meantime, to rule out the issue being on my side, I took several steps, restarting my firewall software, and mitigating the impact of the network partition as much as possible. The ip addresses remained the same, so I was convinced that the issue was on their end.

After a lot of investigation, and debugging attempts spanning the next 17 hours or so (until 0804 UTC), my hosting provider suggested that I try restarting the cluster. I had not undertaken this step because I was very convinced that the issue was on their end rather than mine.

After a careful rolling restart of the cluster (to minimize customer impact), the network partition was resolved (at 0930UTC). It is unclear at this time why this restart solved the partition - I suspect cached routes on some level (the ip addresses involved didn’t change), because the restart eliminated these caches it is (as far as I can tell) near impossible to figure out the root cause at this point.

What impact did it have?

There were two main impacts to customers. The first was processing delays on exactly 42 exceptions. Some of these exceptions took up to 7 hours to be processed and show up for users. These exceptions ran into riak write timeouts, and had to be manually reprocessed by me. The second impact was that customers could not sign up for new accounts, change passwords, reset their password, change billing or invite users to projects. That is because all of those actions require performing writes to the data store Yeller uses for user/billing information, and the capacity for that store to perform writes was disabled by the network partition.

Where does Yeller go from here?

I need to get better at outage communication. Yeller had nearly 19 hours of delayed processing, and I only posted a few status updates in that time. Next time, I’ll set a recurring alarm on my phone to post customer updates on a regular schedule - even if the updates is just “no changes, I’m still working to resolve the issue” - keeping customers informed during problems is a crucial task, and I clearly failed on it here. As such, I’ve moved Yeller’s status page to statuspage.io - you can see the new status page here: http://www.yellerappstatus.com. I’ve also hooked statuspage up to my internal chat bot so that outage communication is more convenient in the future. Lastly, Yeller now displays the system status at the bottom of every page in the web UI (once you’re logged in), so you can tell if it has an issue:
The processing delays caused by me having to manually reprocess exceptions needs to be improved. I’m investigating improved retry logic so that if Yeller does see a partition like this in the future, the system won’t need human intervention to process timing out exception writes.
Whilst rebooting the entire cluster did fix the issue, it also prevented finding the root cause - as whatever corrupt machine state caused the issue was wiped. In future outages, I will leave at least one machine with a bad state running (but removed from the load balancers), so that the root cause can be better determined.

What Went Well?

On the alerting and monitoring front, I was paged within seconds of the network partition being in place. I was already monitoring the cluster because of the planned maintenance, but it was a good confirmation that Yeller’s monitoring and alerting systems can detect this kind of issue.
I remain confident in my choice of software for queueing and data storage. That Yeller didn’t (as far as I can tell) lose any exceptions during 18 hours of a 4 way network partition (with some manual replaying for some riak timeouts) is something I’m proud of, and want to keep as a track record.
I am also happy with my choice of Datomic for user data storage - whilst Yeller couldn’t change user data during this partition, the website remained available for reads, and users could still see their exceptions.

Summary

I know you depend on Yeller, and I’m going to continue to work hard to live up to that. Incidents like the one Yeller experienced are not fun for us, but worse, violate your trust in Yeller and hurt your ability to get your work done. I will use the lessons learned in this outage to help prevent issues like this from happening in the future

Thank you for your patience throughout this issue.

Tom Crayford, Founder at Yeller

Many thanks to Bruce Spang, Coda Hale and Daniel Schauenberg for their review and suggestions regarding this postmortem. Their help was invaluable.

This is a blog about the development of Yeller, the Exception Tracker with Answers.

Looking for more about running production applications, debugging, Clojure development and distributed systems? Subscribe to our newsletter: