Logs are Liars

This is a blog about the development of Yeller, The Exception Tracker with Answers

Read more about Yeller here

I’ve seen a good number of outages extended by an engineer over-emphasizing something odd we saw in the log

Jeff Hodges (author of Notes on Distributed Systems for Young Bloods)

Regular exceptions get written to my log files. That is, candidly speaking, where data goes to die

Patrick Mckenzie, Prolific Hacker News Commenter

How many times have you dug into production log files to fix a thing, only to emerge hours later, bleary-eyed, with no progress in sight? How many production outages have gone on for longer than they needed because you Sherlocked a solution to a problem that wasn’t even the one causing the outage?

These problems are all caused by the same thing:

Logs are Liars

Production logs can be great, and they’re very useful to have. But, they get quite painful if not used well. There are three main problems with using log files for debugging:

Log files get incredibly noisy

Log files, as normally encountered, don’t have any notion of frequency. This is by design - they just log a single event at a time. However, that can really suck, because common “errors” that are known about typically get overlogged.

One simple example I’ve seen again and again of this is Apache Zookeeper logging when you try to create a key that already exists. Creating keys that already exist (and then going for an update when the create fails) is so common in many uses of Zookeeper. To developers or operators who aren’t used to this, they can jump to conclusions about what’s causing an outage just because “that error was all over the log files”

Searching for the one line that says you’re fucked

Another side effect of log files being mostly noise is that during your production outage, you have to trawl through tens of thousands of lines, hoping that your brain manages to spot the single line that says “oh yeah, everything is fucked”. This doesn’t scale at all.

If your application knew what to do, you’d have already fixed the problem

The last, and most obvious problem with logging (and actually many forms of error handling), is that you often only include log handling when you understand your application’s behavior. This doesn’t help with buggy edge cases:

The reason why edge cases cause bugs is because you didn’t think of the edge case. If you had done really good logging and thought enough to get the edge case, there’s a pretty good chance you’d have just written the right code for it

Jeff Hodges (again)

If the program knew why it was broken, it probably wouldn’t be fucking broken

James Golick

Often, production outages and errors occur because you haven’t thought through the edge cases properly, or your users submitted some really weird data. If you’d have thought of those edge cases, you could have fixed them up front.

Ok, what do I do instead?

There are two good alternatives to logs that mitigate many of the problems:

Firstly, for errors you do expect, use a metric of some kind. Stick it on a graph, ideally of success/failure percentage, and monitor it. This is great for timeouts, for string encoding errors, for odd user data and so on. For more on metrics, I recommend Coda Hale’s talk Metrics Metrics Everywhere.

Secondly, for errors you don’t expect (e.g. logic errors that just make your app throw an HTTP 500), use an Exception Tracker. (I like Yeller here, but am obviously biased). Exception Trackers deal very well with the frequency problem, they typically let you capture a lot of extra context with the error and let you drill into it later, and so on. Mike Perham has a great write-up on this.

Everything is a Tradeoff

Log files can be incredibly useful, but don’t fall into the trap of using them as your primary production debugging tool when they don’t fit. When they do fit the problem at hand, or you’re well aware of all the ways log files will try to lie to you (and are actively resisting those attempts), they can be a really useful tool.

This is a blog about the development of Yeller, the Exception Tracker with Answers.

Read more about Yeller here

Looking for more about running production applications, debugging, Clojure development and distributed systems? Subscribe to our newsletter: