Effective Feature Flags

This is a blog about the development of Yeller, The Exception Tracker with Answers

Read more about Yeller here

Feature Flags are by far and away my favorite tool when working on infrastructure. They let you gradually ramp up users onto new code paths, by small percentages of users at a time. They let you turn off the new feature for that 1% of users when it behaves badly.

If you haven’t seen feature flags before, here’s what they look like in code:

(if (feature-flags/active? flags "use-shiny-new-service-path" (:id user))
  (use-the-shiny-new-microservice user)
  (use-the-old-codepath user))

Behind the scenes, your feature flag library is saying something like this:

  • have we explicitly activated this path for this user?
  • OR does this user get to see this path because it’s on for 10% of users and they’re in that 10%?
  • OR does this user get to see this path because they’re a staff member and we’ve enabled it for all staff members?

They let you gradually roll out both new features and new infrastructure. Worried about performance of a thing but can’t test that properly outside production?

Use a feature flag.

Worried if the new thing will be correct, but can’t verify that fully without production data?

Use a feature flag.

Feature flags are great.

Over the past year or so of using them heavily in production, I’ve learnt a lot about using them effectively. As such, I have a bunch of tips and tricks that have helped. None of these are codebase specific - if you use feature flags (or you want to), they’ll help you out:

1. The Double Flag

One thing that happens a lot in Yeller’s codebase is shipping a new feature in the exception processing pipeline. This ingests new reports of exceptions, taking them from a POST request to the API to a database, ready to send to the user’s dashboard. Whilst that part of the code is relatively simple, reliability in it is crucial. Never dropping exceptions is a core value at Yeller. As such, changing that codebase is relatively scary. Or at least, it would be if new features in it weren’t always behind a feature flag.

To achieve the highest possible robustness there, Yeller always does this thing I’ve come to call the double flag:

(if (and (feature-flags/active? flags "enable-new-feature" (:id user))
         (feature-flags/active? flags "enable-new-feature-for-event" (:id event)))

  (use-new-feature event)
  (ignore-new-feature event))

The double flag is just that: you flag on both the user id, and the event id. This means you can say things like

  • enable this feature for 1% of the admin project’s events
  • enable this feature for 10% of the users, on 10% of their events

At that point, there’s usually very low risk about shipping a new feature. First you ship it just for the admin account, with 1% of events. Then you ramp it up to 10, 15, 20, 50, 75 100% of the admin project’s events (whilst staring at your dashboards). Then you ramp it down to 10% of 1% of users, then ramp it up (slowly) to 10% of events, 100% of users. Eventually, you ramp up to 100% of users, 100% of events.

The timespan there can be over any kind of period. For Yeller it’s often a few weeks or so to get a new exception processing feature to 100%.

Over the past year, doing the double flag has let Yeller safely ship:

All of those ramped up slowly for new users. None of them ever caused a dropped event for any customer.

2. Keep feature flags visible

The next tip isn’t so applicable to Yeller (because there’s only one dev), but is crucial on larger teams:

Keep feature flags visible.

This takes a variety of forms, but making it obvious to your entire team what features are enabled and disabled is huge.

In the past, I’ve seen a press release for a new feature get completely stymied because of miscommunication meaning that the feature was only on for admins.

To combat that particular one:

Show logged in admins which features were checked whilst rendering this page, and what happened with them:

This is super easy to write, and a huge win for visibility, especially to non developers.

Log feature flag changes to Slack

This one helps with developer collaboration: log feature flag changes to your chat tool.

It’s again, simple to write (just wrap your feature flag library such that mutations get logged).

It means that devs always get told when somebody else is messing with a feature.

Track enabled feature percentages with metrics

(image courtesy of my friends at Geckoboard, who put me onto the idea)

This one’s so tiny, but so helpful: Track enabled percentages over time, using metrics.

This lets you say “how many users did we have this feature turned on for, at 2pm last Friday”, and any other time you choose. It lets you put new feature ramp-up on a dashboard, so anybody can see “how is rolling out this feature going”.

3. Don’t let feature flags break your reliability

You can implement a backing store for feature flags in a variety of ways. Yeller uses Zookeeper’s Watches, and an in-memory atom so feature lookups are always fast. But many folk use the ruby implementation, which by default sits on top of redis. That is fine for a lot of use cases. But redis can be a bottleneck. So, as a simple trick for that: timeout your feature flag checking, and default to the feature being off.

(defn is-feature-active? [flags feature-name user-id]
  (deref (future (flags/active? flags feature-name user-id))

    10 ; milliseconds
    false ; default to false if loading flags times out

Defaulting to “this feature is off” means your reliability isn’t impacted too much if your backing store dies, and ensures users can’t see new features before they are ready.

That’s it

That’s it for this post.

I love feature flags.

You should probably use them more.

What tips and tricks have you learnt from using feature flags in production? I’d love to hear about that. Toot at me @t_crayford or email me

This is a blog about the development of Yeller, the Exception Tracker with Answers.

Read more about Yeller here

Looking for more about running production applications, debugging, Clojure development and distributed systems? Subscribe to our newsletter: