How to evaluate an AI vendor without getting lost in the demo

You've sat through three AI vendor demos in the past quarter. All three were impressive. All three showed the same cherry-picked examples performing well. You're no closer to knowing which one — if any — can actually handle your use case in your environment.

This is the standard situation. Vendors optimise for demos because demos win deals. The questions that reveal production readiness are the ones that make the sales team less comfortable. Here they are.

What to ask immediately after the demo

Before they start talking pricing, ask to see the system running on your data. Not on a sample they've prepared — on data you provide, right now. A vendor who can't accommodate this is telling you something important about how the system was built.

Ask specifically about failure modes. "What does the system do when it encounters input it doesn't recognise?" is a better question than "how accurate is it?" The accuracy number comes from their test set. The failure mode question reveals how much thought went into the unhappy path.

The questions that separate vendors

These are the questions we've found most revealing when evaluating AI vendors for client engagements. Ask them directly, watch what happens to the energy in the room:

Can I see your evaluation dataset? Accuracy claims without visibility into what they were measured against are not useful. A vendor that says "94% accuracy" and can't show you the eval set is offering you a number without context.

What happens when your API is down? This is less about the answer and more about whether they have an answer. A production system needs defined fallback behaviour, not "our uptime is very high."

Who maintains the prompts, and how? Prompts degrade over time as data distributions shift. If there's no clear answer to prompt maintenance, there's no clear answer to what happens six months after deployment.

What's the latency at p95? Not average latency — p95. Average latency is optimistic by construction. p95 tells you what happens on a bad day, which is what matters for user-facing workflows.

What observability do you provide out of the box? Can they show you what a typical week of monitoring looks like? If observability is an add-on or a "contact us" item, factor that into your evaluation.

Responses that should slow you down

Vague answers to specific questions about infrastructure are the clearest signal. If the question is "how do you handle API timeouts" and the answer involves the word "seamlessly", that's not an answer.

Reluctance to provide references at companies similar to yours is worth noting. References who are large enterprises don't tell you much about how the system performs for a 200-person operations team.

Sales-led evaluation processes where you can't speak to their engineering team. If the people who built the system aren't available during evaluation, getting access to them when something breaks post-deployment is going to be interesting.

Structure the pilot correctly

If you run a pilot, run it with full production data — including the edge cases, the messy inputs, the unusual formats your team knows about but wouldn't think to mention. Brief whoever runs the pilot to not clean the data first.

Define success criteria for the pilot in writing before the pilot starts. Not after, when there's pressure to interpret results favourably. What accuracy on what task, at what volume, with what latency, over what time period.

A vendor who pushes back on pre-defined success criteria is worth scrutinising. A vendor who welcomes them is worth paying more attention to.