Quick Win: Measure Sucess as well as Failure

This is a blog about the development of Yeller, The Exception Tracker with Answers

Read more about Yeller here

The very first time I added a metric to a codebase, it measured failure rates. The code looked something like this:

def do_a_thing
  really_do_the_thing
rescue SomeExpectedTimeoutException
  Statsd.count('myapp.do_a_thing.exceptions.some_timeout')
end

Code review from a more experienced engineer asked:

Why do we only care about the failure case here? What happens if throughput goes up 10x and the error rate goes up in proportion?

I had fallen into a fairly typical mistake that developers new to adding metrics to their systems fall into.

If you only track failure rates, and nothing about successes, you’re missing out.

  • You’ll continually wonder why the rate jumped, then realize that maybe traffic jumped at the same time
  • You won’t have a good number to alert on. An app doing 200k requests/second that errors on 200 requests/second is probably ok (depending on domain). An app doing 300 requests/second that errors on 200 requests/second is very broken. But your app can jump between those rates, and without tracking success, you won’t know if the error rate of “200 requests a second fail” is super broken or just normal.

The fix is very easy: just track success rates, as well as failures:

def do_a_thing
  result = really_do_the_thing
  Statsd.count('myapp.do_a_thing.success')
  result
rescue SomeExpectedTimeoutException
  Statsd.count('myapp.do_a_thing.exceptions.some_timeout')
end

Then, plot a percentage

Many graphing tools (I like graphite and riemann) will easily let you divide these two metrics, and produce a failure percentage, which is vastly more useful than a failure rate. Failure percentages will help you diagnose and alert far better than a rate will, and it’s trivial to derive one from a success rate and a failure rate.

This technique is especially important with retries and timeouts - you really want to know if a system is retrying on 90% of its requests to some subsystem, or that it’s timing out 60% of the time. Even if the system’s still working in those cases, you’ll likely get a big win from improving those failure percentages to something more reasonable.

This is a blog about the development of Yeller, the Exception Tracker with Answers.

Read more about Yeller here

Looking for more about running production applications, debugging, Clojure development and distributed systems? Subscribe to our newsletter: