AI & ML

Agentic AI Workflows: Building Multi-Step Automation That Doesn't Hallucinate

Zyptr Admin

15 April 2024

9 min read

Agents Are Not Chatbots

There's a huge difference between a chatbot that answers questions and an agent that takes actions. When a chatbot hallucinates, the user sees a wrong answer and moves on. When an agent hallucinates, it might send an email to the wrong person, update a database incorrectly, or trigger a payment for the wrong amount. We've built both, and the engineering rigor required for agents is at least 3x what you need for a chatbot.

We shipped our first agentic system last year — an operations agent for a logistics client that could track shipments, update ETAs, notify customers, and escalate issues to human operators. It works. But getting there required building guardrails we'd never considered before.

The Confirmation Layer

Every action our agents take goes through a confirmation layer. For low-risk actions (looking up information, generating reports), the agent proceeds automatically. For medium-risk actions (sending notifications, updating records), the agent generates a preview and asks for human confirmation. For high-risk actions (financial transactions, deleting data), the agent requires explicit approval from an authorized user.

We define risk levels in a configuration file, not in the LLM prompt. The LLM doesn't get to decide what's risky — we do. This was a deliberate design choice after we watched a demo where the agent decided that sending a mass email to 10,000 customers was "low risk." It wasn't.

Structured Output Is Non-Negotiable

Every tool call an agent makes must use structured output — JSON with a strict schema validated before execution. We use OpenAI's function calling with Zod schemas on the TypeScript side. If the LLM returns malformed JSON or unexpected values, the call is rejected and the agent is asked to retry. We cap retries at three, after which the task is escalated to a human.

This sounds basic, but you'd be surprised how many "agent frameworks" just pass raw LLM output to tool functions with minimal validation. We've seen agents pass string values where numbers were expected, fabricate parameter values that look plausible but are completely made up, and call tools with missing required fields. Schema validation catches all of this.

The Memory Problem

Multi-step workflows require memory — the agent needs to remember what it's done, what information it's gathered, and what's left to do. We use a structured state object (not conversation history) to track this. Each step reads from and writes to a typed state object. This approach, inspired by LangGraph's state management, is more reliable than asking the LLM to maintain state in its context window.

For our logistics agent, the state object tracks: current task, gathered information (tracking numbers, customer details, shipment status), actions taken, actions remaining, and any errors encountered. The LLM sees a summary of this state at each step, not the full conversation history. This keeps the context window small and focused.

Testing Agents Is Different

You can't unit test agents the way you test regular code. The LLM's behavior is non-deterministic, and the same input can produce different action sequences. We've developed a testing approach we call "trajectory testing" — instead of testing exact outputs, we test that the agent's action sequence is valid. Did it gather the required information before taking action? Did it stay within its authorized actions? Did it handle errors gracefully?

We run these tests with temperature set to 0 and a fixed seed for reproducibility, but we also run stochastic tests at higher temperatures to catch edge cases. It's expensive (each test run costs $5-15 in API calls), but cheaper than a production incident where the agent takes an unauthorized action.

ai-agentsautomationlanggraphguardrails

AI & ML

Have a Project in Mind?
Great?

Let's talk about building your next product.

Book a Call See Our Services

Agentic AI Workflows: Building Multi-Step Automation That Doesn't Hallucinate

Agents Are Not Chatbots

The Confirmation Layer

Structured Output Is Non-Negotiable

The Memory Problem

Testing Agents Is Different

Related Articles

How LLMs and RAG Are Transforming Enterprise Software in 2024

Why Most AI Projects Die in Staging (and Never Make It to Production)

Fine-tuning vs RAG: We Tried Both. Here's When Each Actually Makes Sense

Have a Project in Mind?
Great?

Agentic AI Workflows: Building Multi-Step Automation That Doesn't Hallucinate

Agents Are Not Chatbots

The Confirmation Layer

Structured Output Is Non-Negotiable

The Memory Problem

Testing Agents Is Different

Related Articles

How LLMs and RAG Are Transforming Enterprise Software in 2024

Why Most AI Projects Die in Staging (and Never Make It to Production)

Fine-tuning vs RAG: We Tried Both. Here's When Each Actually Makes Sense

Have a Project in Mind?Great?

Have a Project in Mind?
Great?