What to Page On

This is a blog about the development of Yeller, The Exception Tracker with Answers

Only Page on Downtime

This is the most critical thing (I learned it from this blog post). Don’t page folk if there’s not downtime.

Don’t page for high cpu usage
Don’t page for high memory usage
Don’t page for high load average

If your CPUs are all pegged at 100% but your users aren’t affected, do you actually care? If your database load average is 1000 but there’s no impact to your customers, are they gonna call you or stop using your product?

Downtime

How do you detect “downtime”? What does it mean? (Especially for backend services as opposed to websites). It’s pretty simple:

Downtime is “essential work not getting done”. Yeller’s internal alerting is all rate based, aka “if the throughput on system X drops below value Y, page”. Work not getting done is the thing your customers will notice:

website requests not happening
emails not being sent
background jobs not being processed

These are the things users really care about, and that you should be prepared to wake up at 3am for.

Useful Alerts

So, you’re just alerting on downtime. What happens when you get paged? Do you start running around with your hair on fire, sshing into production boxes and running commands you think might work?

STOP

Don’t do this.

Instead, you want playbooks (an increasingly widespread practice amongst savvy ops teams).

Playbooks are a list of potential steps for fixing a type of alert. Here’s what a rough one might look like for a generic web application

If the site is down, here are some places to start looking:

Was there a recent deploy? If so, consider rolling it back

What kind of errors is the site spitting out? Is it 500 errors? Check the error tracker. 504 errors? Check the web server logs at /var/log/nginx/error.log

Is the database overloaded? Log into DATABASE SERVER and check the CPU usage and iowait times.

Don’t be shy about restarting services if they’ve fallen over. To restart the webserver, sudo /etc/init.d/yourapp restart

If you can’t diagnose quickly; ask for help: call CTO on XXX or OPS PERSON on XXX

If it appears to be a networking issue (i.e. you can’t even ping the servers), look at our hosting provider’s status page: http://HOSTINGSERVICESTATUS.com

You should include a link to the playbook for an alert in the alert body. Playbooks should be a list of remediation steps that anybody who is on call can follow. They should also end with “call person XXX if you’ve tried everything here and the issue remains”

One way to take this one step further is to include relevant metrics in the alert text. Say your playbook says “go look at XXX metric and then look at YYY metric”. Well, it’s often pretty simple to just include those metrics in the alert text, so you can run through much of the playbook without leaving the PagerDuty app.

Getting Ahead

Just alerting on downtime saves you from a lot of pager fatigue. But it doesn’t help you prevent downtime, just lets you keep up with it. The solution is to get ahead of your infrastructure during normal working hours. Spend a little time each week glancing through dashboards. Pipe noncritical alerts to your slack/irc channel and act on them during the day. Be curious about what’s actually going on in production.

I’ve written about this before at length, so go check that out.

So that’s it. Three pillars of a much saner alerting strategy, that avoids pager fatigue, but keeps everything running.

only alert on downtime
make your alerts useful
get ahead

This strategy is designed for solo/small team infrastructure folk, as that’s where my experience lies. Likely some of it will be useful for you though, even if you’re not in that kind of organization.

References:

This is a blog about the development of Yeller, the Exception Tracker with Answers.

Looking for more about running production applications, debugging, Clojure development and distributed systems? Subscribe to our newsletter: