What to Page On

This is a blog about the development of Yeller, The Exception Tracker with Answers

Read more about Yeller here

“And then I got paged. At 3am”

It’s enough to send chills down the spine of anybody who runs server software in production. Getting paged hurts a bunch:

  • it’s 3AM!
  • I was asleep!
  • What if I mess up?
  • If I don’t get this fixed quickly, our users are gonna be disappointed

So designing your alerting well is critical. With a well designed alerting system:

  • you won’t get woken up more than you need to
  • you’ll fix many problems before they surface into downtime
  • you can actually focus at work, because you won’t be awake at 3am every few days

There are a few pillars to getting alerting that keeps you sane (and your customers happy):

  • only page on downtime
  • make alerts as useful as possible
  • get ahead

Let’s dive in:

Only Page on Downtime

This is the most critical thing (I learned it from this blog post). Don’t page folk if there’s not downtime.

  • Don’t page for high cpu usage
  • Don’t page for high memory usage
  • Don’t page for high load average

If your CPUs are all pegged at 100% but your users aren’t affected, do you actually care? If your database load average is 1000 but there’s no impact to your customers, are they gonna call you or stop using your product?

Downtime

How do you detect “downtime”? What does it mean? (Especially for backend services as opposed to websites). It’s pretty simple:

Downtime is “essential work not getting done”. Yeller’s internal alerting is all rate based, aka “if the throughput on system X drops below value Y, page”. Work not getting done is the thing your customers will notice:

  • website requests not happening
  • emails not being sent
  • background jobs not being processed

These are the things users really care about, and that you should be prepared to wake up at 3am for.

Useful Alerts

So, you’re just alerting on downtime. What happens when you get paged? Do you start running around with your hair on fire, sshing into production boxes and running commands you think might work?

STOP

Don’t do this.

Instead, you want playbooks (an increasingly widespread practice amongst savvy ops teams).

Playbooks are a list of potential steps for fixing a type of alert. Here’s what a rough one might look like for a generic web application

If the site is down, here are some places to start looking:

  1. Was there a recent deploy? If so, consider rolling it back
  2. What kind of errors is the site spitting out? Is it 500 errors? Check the error tracker. 504 errors? Check the web server logs at /var/log/nginx/error.log
  3. Is the database overloaded? Log into DATABASE SERVER and check the CPU usage and iowait times.
  4. Don’t be shy about restarting services if they’ve fallen over. To restart the webserver, sudo /etc/init.d/yourapp restart
  5. If you can’t diagnose quickly; ask for help: call CTO on XXX or OPS PERSON on XXX
  6. If it appears to be a networking issue (i.e. you can’t even ping the servers), look at our hosting provider’s status page: http://HOSTINGSERVICESTATUS.com

You should include a link to the playbook for an alert in the alert body. Playbooks should be a list of remediation steps that anybody who is on call can follow. They should also end with “call person XXX if you’ve tried everything here and the issue remains”

One way to take this one step further is to include relevant metrics in the alert text. Say your playbook says “go look at XXX metric and then look at YYY metric”. Well, it’s often pretty simple to just include those metrics in the alert text, so you can run through much of the playbook without leaving the PagerDuty app.

Getting Ahead

Just alerting on downtime saves you from a lot of pager fatigue. But it doesn’t help you prevent downtime, just lets you keep up with it. The solution is to get ahead of your infrastructure during normal working hours. Spend a little time each week glancing through dashboards. Pipe noncritical alerts to your slack/irc channel and act on them during the day. Be curious about what’s actually going on in production.

I’ve written about this before at length, so go check that out.

So that’s it. Three pillars of a much saner alerting strategy, that avoids pager fatigue, but keeps everything running.

  • only alert on downtime
  • make your alerts useful
  • get ahead

This strategy is designed for solo/small team infrastructure folk, as that’s where my experience lies. Likely some of it will be useful for you though, even if you’re not in that kind of organization.

References:

This is a blog about the development of Yeller, the Exception Tracker with Answers.

Read more about Yeller here

Looking for more about running production applications, debugging, Clojure development and distributed systems? Subscribe to our newsletter: