There's a moment in most AI agent projects where everything seems to be working. The agent handles the test cases. The demo goes well. Someone says "let's put this in production." Then things get complicated.
This isn't about the model. Most production failures we've seen have little to do with which LLM you're using. They're about the infrastructure around the model — the retry logic, the output validation, the error states, the monitoring. The parts that don't appear in demos because demos are curated.
The demo environment is always better than your production environment
Every vendor demo runs against clean data. Consistent formats. Known inputs. No legacy systems. No auth edge cases. No network timeouts.
Your production environment has all of those. When we start a new project, we build a sampling harness in week one and run the agent candidate against 200–300 real examples from the client's actual data. Not synthetic data. Not curated examples. Whatever the real distribution looks like, including the outliers.
A system that works at 97% on demo cases might work at 81% on the real distribution. The gap is often the interesting part — it tells you where the process is actually messy, which is also where human escalation paths need to be designed carefully.
Retry logic is not optional
LLM APIs have latency variance. They time out. They occasionally return malformed responses. If your agent is calling an LLM and treating every call as atomic and guaranteed, it will fail in production in ways that are difficult to debug.
We implement exponential backoff with jitter as standard. We log every call with its input hash, model version, response time, and output. We validate outputs structurally before acting on them. None of this is glamorous. All of it prevents the 2am incident.
The most useful thing we ever added to an agent pipeline was a structured log entry for every LLM call. Not for debugging — for monitoring. Within two weeks we had enough data to see exactly when output quality started drifting.
Output validation is where most agents cut corners
When an LLM returns an output, what are you actually checking? If you're parsing JSON, are you validating the schema? Are you handling the case where the model returns valid JSON in the wrong structure? Are you handling the case where it returns text that isn't JSON at all?
We use strict schema validation on every structured output, with a defined fallback path for validation failures. The fallback is usually: log the failure, route to human review, alert if the failure rate exceeds a threshold. "Send it anyway and see what happens" is not a fallback.
Design the escalation path before you design the agent
An agent that can't fail gracefully is a liability. Before we write a line of agent code, we define exactly what happens when the agent encounters a case it shouldn't handle. Who does it go to? What does the notification look like? What does the human reviewer actually need to see in order to make a decision?
The escalation path also tells you something important: if the human escalation queue is regularly getting more than 15–20% of volume, the agent specification probably needs tightening. Either the confidence thresholds are wrong, or the task scope is wider than the agent can reliably handle.
Monitoring is not logging
Logs tell you what happened. Monitoring tells you whether what happened was normal. The two are different instruments.
For every production agent we deploy, we set up output quality tracking from day one. This means running a sample of outputs through an evaluation harness on a regular schedule — typically daily for high-volume agents, weekly for lower-volume ones — and comparing the distribution to a baseline established during the testing phase.
When the distribution shifts, we get an alert. Sometimes the shift is benign — new categories of input that weren't in the training distribution. Sometimes it signals a problem — model update, data source change, accumulated prompt drift. Either way, you want to know before your ops team does.
The handover documentation is part of the product
The agent will eventually be managed by someone who wasn't involved in building it. That person needs to understand: what the agent does, what it doesn't do, how to tell if it's working, what to do if it isn't, and how to update the configuration without breaking it.
We write two documents for every deployment. An architecture document for whoever maintains the infrastructure, and a runbook for whoever manages the operational workflow. They have different audiences and different formats. Neither is a substitute for the other.
A well-documented agent that lives in your stack for two years is worth more than a poorly-documented agent that needs a specialist to fix it whenever something breaks.