Agents Are Not Chatbots
There's a huge difference between a chatbot that answers questions and an agent that takes actions. When a chatbot hallucinates, the user sees a wrong answer and moves on. When an agent hallucinates, it might send an email to the wrong person, update a database incorrectly, or trigger a payment for the wrong amount. We've built both, and the engineering rigor required for agents is at least 3x what you need for a chatbot.
We shipped our first agentic system last year — an operations agent for a logistics client that could track shipments, update ETAs, notify customers, and escalate issues to human operators. It works. But getting there required building guardrails we'd never considered before.
The Confirmation Layer
Every action our agents take goes through a confirmation layer. For low-risk actions (looking up information, generating reports), the agent proceeds automatically. For medium-risk actions (sending notifications, updating records), the agent generates a preview and asks for human confirmation. For high-risk actions (financial transactions, deleting data), the agent requires explicit approval from an authorized user.
We define risk levels in a configuration file, not in the LLM prompt. The LLM doesn't get to decide what's risky — we do. This was a deliberate design choice after we watched a demo where the agent decided that sending a mass email to 10,000 customers was "low risk." It wasn't.
Structured Output Is Non-Negotiable
Every tool call an agent makes must use structured output — JSON with a strict schema validated before execution. We use OpenAI's function calling with Zod schemas on the TypeScript side. If the LLM returns malformed JSON or unexpected values, the call is rejected and the agent is asked to retry. We cap retries at three, after which the task is escalated to a human.
This sounds basic, but you'd be surprised how many "agent frameworks" just pass raw LLM output to tool functions with minimal validation. We've seen agents pass string values where numbers were expected, fabricate parameter values that look plausible but are completely made up, and call tools with missing required fields. Schema validation catches all of this.
The Memory Problem
Multi-step workflows require memory — the agent needs to remember what it's done, what information it's gathered, and what's left to do. We use a structured state object (not conversation history) to track this. Each step reads from and writes to a typed state object. This approach, inspired by LangGraph's state management, is more reliable than asking the LLM to maintain state in its context window.
For our logistics agent, the state object tracks: current task, gathered information (tracking numbers, customer details, shipment status), actions taken, actions remaining, and any errors encountered. The LLM sees a summary of this state at each step, not the full conversation history. This keeps the context window small and focused.
Testing Agents Is Different
You can't unit test agents the way you test regular code. The LLM's behavior is non-deterministic, and the same input can produce different action sequences. We've developed a testing approach we call "trajectory testing" — instead of testing exact outputs, we test that the agent's action sequence is valid. Did it gather the required information before taking action? Did it stay within its authorized actions? Did it handle errors gracefully?
We run these tests with temperature set to 0 and a fixed seed for reproducibility, but we also run stochastic tests at higher temperatures to catch edge cases. It's expensive (each test run costs $5-15 in API calls), but cheaper than a production incident where the agent takes an unauthorized action.