Back to Blog
SaaS

Feature Flags Done Right: How We Manage 150+ Flags Without Losing Our Minds

Zyptr Admin
29 April 2024
8 min read

Feature Flag Debt Is Real

Feature flags start as a great idea: ship code to production without exposing it to users, then gradually roll it out. The problem is what happens six months later when you have 150 flags and nobody remembers what half of them do or whether they can be removed. We reached this point across our product suite and it was a mess — flags referencing features that were fully launched months ago, flags that controlled A/B tests that ended in Q2, and flags with names like "new_dashboard_v2_final_FINAL" that told you nothing.

Here's the system we built to manage flag sprawl, and the discipline that keeps it working.

The Flag Lifecycle

Every flag has a lifecycle: Created → Active → Graduated → Archived. The key innovation is "Graduated" — when a flag's feature is fully launched and the flag is 100% ON for all users, it enters a grace period (we use 14 days). If no issues surface, the flag is marked for removal. A Slack bot reminds the flag owner weekly until the flag code is removed from the codebase and the flag is archived.

We track flag metadata in a simple PostgreSQL table: name, description, owner (the engineer who created it), creation date, expected graduation date, current status, and the Jira ticket it relates to. The expected graduation date is mandatory — you must specify when you think this flag should be permanent or removed. This forces the conversation upfront.

Naming Conventions That Scale

We enforce naming conventions via a linter that runs in CI. Flag names follow the pattern: team_feature_description. Examples: billing_usage_metering_enabled, dashboard_new_charts_rollout, auth_passkey_login_beta. The team prefix lets us quickly filter flags by owning team. The description should be specific enough that someone unfamiliar with the feature can understand what it controls.

We also prefix temporary flags (feature rollouts, A/B tests) with "temp_" and permanent flags (kill switches, ops controls) with "ops_". Temporary flags get extra scrutiny in the graduation process. Permanent flags (like ops_disable_email_notifications) are expected to live forever and are excluded from the cleanup reminders.

The Technical Implementation

We evaluated LaunchDarkly, Flagsmith, Unleash, and custom-built solutions. For client projects with budget constraints, we use Flagsmith (open-source, self-hosted). For our own products, we use LaunchDarkly because the targeting rules and analytics are significantly better. For very simple use cases (fewer than 20 flags, no complex targeting), we use environment variables and a simple config file — don't over-engineer this.

On the code side, flag checks are wrapped in a utility function that also handles: default values if the flag service is unreachable, logging which flags were evaluated per request (for debugging), and type validation (a flag expected to return a boolean shouldn't return a string). We've had production incidents caused by flag type mismatches — a flag that was supposed to be a boolean was accidentally set to the string "true", which evaluated differently in some language constructs.

The Cleanup Process

This is the hard part. Creating flags is easy; removing them requires discipline. Our process: every sprint planning includes a "flag cleanup" item. The tech lead reviews flags approaching their graduation date and assigns removal tasks. Removing a flag involves: deleting the flag check from code, removing the else/fallback path (which is now dead code), updating tests, and archiving the flag in the management system.

We measure "flag hygiene" as a team metric: the percentage of flags that are either Active (in use) or Graduated (pending removal within 14 days). Our target is 90%. Any flag that's been at 100% rollout for more than 30 days without being graduated is flagged (pun intended) in our weekly engineering review.

The ROI

Since implementing this system, we've reduced flag-related production incidents from about one per month to one per quarter. The average flag lifetime dropped from "forever" to 6 weeks. And new engineers can onboard onto the flag system in about 30 minutes because the naming, lifecycle, and documentation are consistent. The system itself took two weeks to build. Maintaining it takes about 2 hours per sprint. Worth it.

feature-flagsdevopsengineering-practicessaas
Let's Work Together

Have a Project in Mind?
Great?

Let's talk about building your next product.