The "Prompt Wizard" Era Is Over
Back in early 2023, we had one engineer on the team who was our "prompt wizard." He could coax GPT-4 into doing just about anything with clever prompting. We thought that was enough. It wasn't. The moment we had five different features using LLMs across three client projects, the whole thing fell apart. Prompts were hardcoded in application code, nobody knew which version was running in production, and debugging was a nightmare.
So we built a system. And it changed everything about how we deliver AI projects.
Prompts as Code, Not Strings
Every prompt in our projects now lives in its own file, version-controlled in Git, with a clear naming convention. We use a template system (we started with Jinja2-style templating, now we use our own lightweight TypeScript library) that separates the prompt structure from the variables. This means we can swap out the system prompt for a feature without touching application code.
Each prompt file has metadata: the model it's optimized for, the expected input/output schema, the last eval date, and the author. When someone changes a prompt, the PR diff shows exactly what changed, and our CI pipeline runs the eval suite against the new version automatically. We've caught regressions this way that would've gone straight to production otherwise.
The Eval Pipeline That Saved Us
We maintain eval datasets for every prompt in production. Minimum 50 test cases, though most of our critical prompts have 200+. Each test case has an input, expected output, and evaluation criteria (sometimes exact match, sometimes semantic similarity, sometimes LLM-as-judge). Our CI runs these on every PR that touches a prompt.
This sounds expensive — and it is, about $30-50 per CI run for a large project. But it's a fraction of the cost of shipping a broken prompt to production. We learned this the hard way when a "small optimization" to a summarization prompt caused 40% of outputs to truncate the last paragraph. No one noticed for three days. That incident is why we built the eval pipeline.
Dynamic Prompt Assembly
For complex features, we don't use a single monolithic prompt. We assemble prompts dynamically based on context. Our customer support AI, for example, builds its system prompt from: a base persona template + product-specific context + customer tier rules + conversation history summary. Each piece is independently versioned and tested.
This modular approach means we can A/B test individual components. Last month, we tested two different "tone" modules for the same support bot — one more formal, one more casual. The casual version had 23% higher user satisfaction scores. We never would have discovered that with monolithic prompts.
The Tooling We Use
For prompt management, we've evaluated Langfuse, PromptLayer, and Humanloop. Langfuse is our current pick for most projects — it's open-source, self-hostable (important for clients with data residency requirements), and the tracing UI is genuinely good. For evals, we use a combination of custom scripts and Braintrust for more complex evaluation scenarios. The key insight: the tooling matters less than the discipline. Even a spreadsheet of test cases run manually is better than nothing.
If you're still treating prompts as strings you tweak in a code editor, you're going to hit a wall. We hit it at about the 10-prompt mark across a project. Build the system early — you'll thank yourself when the client asks "why did the AI say that?" and you can actually answer them.