What to Page On
This is a blog about the development of Yeller, The Exception Tracker with Answers
“And then I got paged. At 3am”
It’s enough to send chills down the spine of anybody who runs server software in production. Getting paged hurts a bunch:
- it’s 3AM!
- I was asleep!
- What if I mess up?
- If I don’t get this fixed quickly, our users are gonna be disappointed
So designing your alerting well is critical. With a well designed alerting system:
- you won’t get woken up more than you need to
- you’ll fix many problems before they surface into downtime
- you can actually focus at work, because you won’t be awake at 3am every few days
There are a few pillars to getting alerting that keeps you sane (and your customers happy):
- only page on downtime
- make alerts as useful as possible
- get ahead
Let’s dive in:
Only Page on Downtime
This is the most critical thing (I learned it from this blog post). Don’t page folk if there’s not downtime.
- Don’t page for high cpu usage
- Don’t page for high memory usage
- Don’t page for high load average
If your CPUs are all pegged at 100% but your users aren’t affected, do you actually care? If your database load average is 1000 but there’s no impact to your customers, are they gonna call you or stop using your product?
Downtime
How do you detect “downtime”? What does it mean? (Especially for backend services as opposed to websites). It’s pretty simple:
Downtime is “essential work not getting done”. Yeller’s internal alerting is all rate based, aka “if the throughput on system X drops below value Y, page”. Work not getting done is the thing your customers will notice:
- website requests not happening
- emails not being sent
- background jobs not being processed
These are the things users really care about, and that you should be prepared to wake up at 3am for.
Useful Alerts
So, you’re just alerting on downtime. What happens when you get paged? Do you start running around with your hair on fire, sshing into production boxes and running commands you think might work?
STOP
Don’t do this.
Instead, you want playbooks (an increasingly widespread practice amongst savvy ops teams).
Playbooks are a list of potential steps for fixing a type of alert. Here’s what a rough one might look like for a generic web application
If the site is down, here are some places to start looking:
- Was there a recent deploy? If so, consider rolling it back
- What kind of errors is the site spitting out? Is it 500 errors? Check the error tracker. 504 errors? Check the web server logs at
/var/log/nginx/error.log
- Is the database overloaded? Log into DATABASE SERVER and check the CPU usage and iowait times.
- Don’t be shy about restarting services if they’ve fallen over. To restart the webserver,
sudo /etc/init.d/yourapp restart
- If you can’t diagnose quickly; ask for help: call CTO on XXX or OPS PERSON on XXX
- If it appears to be a networking issue (i.e. you can’t even ping the servers), look at our hosting provider’s status page: http://HOSTINGSERVICESTATUS.com
You should include a link to the playbook for an alert in the alert body. Playbooks should be a list of remediation steps that anybody who is on call can follow. They should also end with “call person XXX if you’ve tried everything here and the issue remains”
One way to take this one step further is to include relevant metrics in the alert text. Say your playbook says “go look at XXX metric and then look at YYY metric”. Well, it’s often pretty simple to just include those metrics in the alert text, so you can run through much of the playbook without leaving the PagerDuty app.
Getting Ahead
Just alerting on downtime saves you from a lot of pager fatigue. But it doesn’t help you prevent downtime, just lets you keep up with it. The solution is to get ahead of your infrastructure during normal working hours. Spend a little time each week glancing through dashboards. Pipe noncritical alerts to your slack/irc channel and act on them during the day. Be curious about what’s actually going on in production.
I’ve written about this before at length, so go check that out.
So that’s it. Three pillars of a much saner alerting strategy, that avoids pager fatigue, but keeps everything running.
- only alert on downtime
- make your alerts useful
- get ahead
This strategy is designed for solo/small team infrastructure folk, as that’s where my experience lies. Likely some of it will be useful for you though, even if you’re not in that kind of organization.
References:
- only page on downtime
- Scalyr Blog: 99.9% uptime on a 9-5 schedule
- Yeller: Monitoring
- Yeller: Incuriosity Will Kill Your Infrastructure
- Heroku Operations
- Nagios Herald
This is a blog about the development of Yeller, the Exception Tracker with Answers.