Production Clojure Checklist
This is a blog about the development of Yeller, The Exception Tracker with Answers
Running an application in production is challenging. There’s so many things to take care of, and it’s easy to miss a little thing that will compromise you a bunch in the future. Checklists are an obvious, and easy fix for this, and are well implemented by Heroku, Github, and many other well known companies.
This is just a starting point - you should adapt this for your services, but it’s a great start.
First, a brief overview:
- Use bcrypt for password storage
- Servers are in UTC
- HTTPS only
- Application errors are sent to an exception tracker
- Alerts a human if it’s down
- Web serving is redundant
- Database Migrations are Automated
- Configuration is sensible
- Deploys are automated via a well documented script
- Deploys just copy an uberjar to the servers and restart them
- The database is backed up regularly
- Database backups are test restored regularly
- High traffic pages are behind a CDN
- Code is visible on Github
- Credentials are available
- Application has a health check
- There’s an Ops Playbook
- High Fidelity Staging Environment
- The rest of the team knows the service exists
- Transactional email handled via an external service
- Api requests via a separate domain
This seems like a long list. It is! Running apps in production is difficult.
Here’s some more details on each entry:
The Basics
All of these are non-negotiable if you’re running a production service whose code impacts humans:
use bcrypt if you store passwords (read more)
servers should be in UTC (read more)
HTTPS ONLY (http requests just redirects to https)
Application errors are tracked using an exception tracker (read more)
JavaScript errors are tracked using an exception tracker (read more)
It’s 2015.
Alerts a human if it’s down (I like pingdom and pagerduty for this)
You’d be surprised how many applications are deployed in production that don’t alert humans when they’re down. This one is specifically about not making your customers mad - I for one would much rather be woken up at 3am with a page than wake up at 10am to thousands of angry customer emails.
Web serving is redundant (at least two processes, so you can deploy without downtime)
Again, super obvious - deploys causing downtime isn’t good enough at all.
Database Migrations are Automated
I like conformity for Datomic, but similar things exist for sql databases. Use them. Migrations should also be kept in the same repo as the app.
Sensible Configuration
I like both the 12 factor approach and dropwizard’s single static file approach
Deploys happen via a well documented script and are automated
I like fabric
Deploys just copy an uberjar
to the servers and restart processes
Read Phil Hagelberg on this. This is very standard these days.
The database is backed up regularly
Preventing production data loss in the case of a catastrophic event turns “oh shit we lost the DC” from a company destroying event into a “we were down for a bit” event.
(I like tarsnap)
Restores of the production backups happen regularly
If your database isn’t being restored regularly, then you have no idea if the backups are working correctly. You don’t need to be able to make a backup in the case of disaster. You need to be able to make a restore in the case of disaster.
High Traffic static pages are behind a CDN
Lots of traffic can quite happily break many a clojure app (especially if it’s sudden and/or unexpected). Putting static pages like your blog and homepage behind a CDN protects you from this in the future, and takes 5 minutes to do. Plus it means those pages will load super fast.
I like Fastly
Code is visible on Github
This sounds super silly, who would ever deploy an application or service without making the code available to other team members.
You’d be surprised.
Any credentials are readily available to all team members
SSH access (if applicable), third party service logins, internal admin accounts, heroku access should all be available to any developer on the team. This one’s obvious, but often not well followed. Quite a few times I’ve seen team members unable to fix a broken app because they didn’t have access to what they needed, and the CTO is on vacation and unreachable.
Some common credentials you might have in play:
- SSH access
- third party service logins
- internal admin accounts
- heroku access
The application has a health check
A health check is (typically) an http route that’s hit by the load balancer (and/or third party monitoring services) that returns 200 OK
if the service and all of its dependencies are working (e.g. database connections, required third party services, etc), and 500 ERROR
and alerts somebody if the service or any of its dependencies aren’t working.
There’s an ops playbook
Ops playbooks detail what to do in the case of alerts. A super simple playbook for a basic clojure web app that talks to a database might look something like this:
If the site is down, here are some places to start looking:
- Was there a recent deploy? If so, consider rolling it back
- What kind of errors is the site spitting out? Is it 500 errors? Check the error tracker. 504 errors? Check the web server logs at
/var/log/nginx/error.log
- Is the database overloaded? Log into DATABASE SERVER and check the CPU usage and iowait times.
- Don’t be shy about restarting services if they’ve fallen over. To restart the webserver,
sudo /etc/init.d/yourapp restart
- If you can’t diagnose quickly; ask for help: call CTO on XXX or OPS PERSON on XXX
- If it appears to be a networking issue (i.e. you can’t even ping the servers), look at our hosting provider’s status page: http://HOSTINGSERVICESTATUS.com
The application has a high fidelity staging environment, and larger changes are always tested on staging first
If you’re deploying a thing in production, you need a way to test significant ops changes without taking production down, losing data etc. Staging is the best way to do that.
The rest of the team knows the application exists (send an email)
You’d be surprised how many times I’ve heard of services going into production that nobody knew existed, and then them breaking and folk having to learn of their existence super quickly.
Uses an external service for transactional email
Sending email is hard. Working around spam reports, delivering properly with retries, etc etc is really painful. Use an external service like mailgun, mandrill or sendgrid.
Api requests go via a separate domain (typically api.YOURDOMAIN.com)
This might sound silly, but you really want this at the start, because it means you can break out a separate service that handles the API at your leisure, without breaking backwards compatability.
Next Steps
What things is your app missing from this list? What things do you think are important that are missing from it? Hit me up on twitter and let me know: twitter.com/t_crayford
Further Reading
Thoughtbot’s Playbook has a checklist section that covers many of these points for rails apps
Noah Zoschke gave a fantastic talk about Heroku Operations which briefly touched on production checklists. It was the original inspiration for this post.
This is a blog about the development of Yeller, the Exception Tracker with Answers.