No Exceptions Left Behind

This is a blog about the development of Yeller, The Exception Tracker with Answers

Read more about Yeller here

Yeller does one thing in particular that’s vastly different from many other exception tracking tools:

Yeller will never throttle you at the HTTP level, and no unique exception will ever be missed entirely

That is, Yeller says No Exception Left Behind. Now that’s not a completely hard guarantee, as it’s a very difficult problem to get right. However, Yeller strives real hard to nearly always accomplish that task.

I think this issue is a particular leap forward for exception trackers, as so many of them drop stuff on the floor (which you’ll notice if you look through your logs). But why is it important?

Imagine your boss calls you one day and tells you:

We just made %IMPORTANTCUSTOMER% mad, because the billing page errored at them when they were about to pay us thousands of dollars

You feel awful, and you want to understand the issue so that you can prevent it happening in the future. But it’s not in your error tracker. Why? Oh, because something else (that was much less important) broke on your site at the same time, and they rate limited that particular error report and dropped it on the floor.

That sucks.

So, Yeller has been carefully constructed from the ground up to Leave No Exception Behind. Whilst it does roll up data, it never drops unique exceptions such that you cannot see them, only rolling up duplicates.

How does it do it?

Yeller took a bunch of design work to get to a state where not rate limiting exceptions at the http layer was possible. Here are the key components:

Lean Hard On Distributed Systems

The first, and most obvious thing Yeller does is to lean hard on distributed systems. Building an exception tracker that doesn’t drop anything means that every component must be able to survive the death or partition of a node, without dropping exceptions. Yeller heavily utilizes retries in the client libraries, so that they can survive death of the api servers they’re talking to.

Using “High Availability” distributed systems technologies is more difficult than just shoving everything in a SQL database, but the tradeoffs are worth it, given this hard rule of not dropping exceptions.

Fast VM/Runtime

The second thing Yeller does to ensure no exceptions are left behind is that gosh, that means we have to burn CPU and network on every exception. That completely ruled out less efficient languages and runtimes. Furthermore, the language and runtime selected must have great performance tooling.

The only language that I was really happy in, that fit that bill at the time the decision was made, was Clojure, on the JVM. And it’s paid off - Yeller’s 99th percentile response time on the api handler is a screaming 4 milliseconds. That means only 1% of requests see a slower latency than that (that’s measured internally, so it doesn’t include WAN transit time). On the throughput side of things, Yeller’s api servers can handle tens of thousands of requests per second per VM. Not something that’s possible in your average monorail.

Testing

Putting careful work into the design of a system is all well and good, but it can’t tell you if you can actually meet your goals in a production environment. Unit tests don’t help, nor do benchmarks run on your fucking laptop. Here’s what I did instead:

Months of stress testing, in production

I plan to run Yeller for years to come. As such, doing small amounts of stress testing, for short periods of time, doesn’t really tell me all that much about the long term feasibility of the system, or it’s capacity to reach the goals that are set out. Kelly Sommers has a great blog post about responsible benchmarking here, and many of the same things apply to load and stress testing.

Every day, for weeks and months on end, I would try and break Yeller. What happens if I partition this machine off from the cluster? What happens if I throw hundreds of thousands of connections at it? Does the scheme for rolling up old history cap disk usage effectively?

Now, that’s all with somewhat synthetic data, and predicting what real users are going to do is basically impossible, but I feel so much more confident having gone through a period of intense testing that Yeller can hold up to real load. I absolutely refuse to ship stuff that might break under real loads, just in the name of “getting something out of the door”.

Incuriosity Kills The Infrastructure

Even with testing, careful design, and a whole bunch of work, stuff still changes. It’s not like I stress tested the system, and that’s it, code freeze forever. Nope.

So, I strongly believe in metrics and monitoring, and getting out ahead of infrastructure problems. I love the phrase

Incuriosity Killed the Infrastructure

as popularized by Boundary. It’s important to remain curious and adaptable to what happens, and try and explore problems proactively, otherwise they wake you up at 3am and disapoint your customers.

Always working, always improving

This all remains a huge work in progress, as I ship more features and help out more customers. There remain some hard constraints on No Exception Left Behind, like the bandwidth I get from my hosting provider, plus any potential bugs in the software (Yeller hasn’t seen a single bug in the exception processing and receive code for months, but that doesn’t mean that there aren’t any left).

Tired of an exception tracker that drops your important notifications on the floor?

You should switch to Yeller. It’s built from the ground up to not lose your data, and give you detailed analysis on your exceptions.

This is a blog about the development of Yeller, the Exception Tracker with Answers.

Read more about Yeller here

Looking for more about running production applications, debugging, Clojure development and distributed systems? Subscribe to our newsletter: