Quick Win: Measure Sucess as well as Failure
This is a blog about the development of Yeller, The Exception Tracker with Answers
The very first time I added a metric to a codebase, it measured failure rates. The code looked something like this:
def do_a_thing
really_do_the_thing
rescue SomeExpectedTimeoutException
Statsd.count('myapp.do_a_thing.exceptions.some_timeout')
end
Code review from a more experienced engineer asked:
Why do we only care about the failure case here? What happens if throughput goes up 10x and the error rate goes up in proportion?
I had fallen into a fairly typical mistake that developers new to adding metrics to their systems fall into.
If you only track failure rates, and nothing about successes, you’re missing out.
- You’ll continually wonder why the rate jumped, then realize that maybe traffic jumped at the same time
- You won’t have a good number to alert on. An app doing 200k requests/second that errors on 200 requests/second is probably ok (depending on domain). An app doing 300 requests/second that errors on 200 requests/second is very broken. But your app can jump between those rates, and without tracking success, you won’t know if the error rate of “200 requests a second fail” is super broken or just normal.
The fix is very easy: just track success rates, as well as failures:
def do_a_thing
result = really_do_the_thing
Statsd.count('myapp.do_a_thing.success')
result
rescue SomeExpectedTimeoutException
Statsd.count('myapp.do_a_thing.exceptions.some_timeout')
end
Then, plot a percentage
Many graphing tools (I like graphite and riemann) will easily let you divide these two metrics, and produce a failure percentage, which is vastly more useful than a failure rate. Failure percentages will help you diagnose and alert far better than a rate will, and it’s trivial to derive one from a success rate and a failure rate.
This technique is especially important with retries and timeouts - you really want to know if a system is retrying on 90% of its requests to some subsystem, or that it’s timing out 60% of the time. Even if the system’s still working in those cases, you’ll likely get a big win from improving those failure percentages to something more reasonable.
This is a blog about the development of Yeller, the Exception Tracker with Answers.