Enterprise AI

AI agent reliability is a systems problem, not a model problem

AI agent reliability isn't about a flakier model. Ookla's 2026 report shows the real shift: an agent is a dependency chain across systems you don't monitor end to end.

7 min read

A wave of “agentic AI” hit production this year, and right behind it came the complaint that the agents are flaky. The instinct is to blame the model — pick a smarter one, wait for the next release. That instinct is wrong, and AI agent reliability is the clearest example of why. The model isn’t what got less reliable. The thing you built around it did.

TL;DR: AI agent reliability is a systems problem, not a model problem. An agent isn’t a single call — it’s a dependency chain across the model provider, your retrieval layer, and several internal systems, and it fails at the seams between them, not at the model. Ookla’s June 2026 reliability report logged 3.72 million user-reported AI incidents in 16 months and found high-signal disruption days rising from 6 a quarter to 51. The fix isn’t a better model. It’s the boring distributed-systems work — observability, timeouts, circuit breakers, an owner for the whole chain — that almost no agent deployment actually has.

What Ookla actually measured

Be precise about the data, because the framing usually outruns it. Ookla published its AI reliability report on June 10, 2026, built from 471 days of U.S. Downdetector reports — January 1, 2025 through April 16, 2026 — covering ChatGPT, Claude, Gemini, Microsoft Copilot, AWS, and Azure. Over that window it counted 3.72 million user-reported incidents.

The number that matters isn’t the raw total; it’s the trend in what Ookla calls high-signal disruption days. Those are days when a single service’s user reports exceed ten times its own normal daily volume — a real outage, not background noise. They went from 6 in the first quarter of 2025, to 16 in the fourth quarter, to 51 in the first quarter of 2026. And they concentrate: of those 51 days in early 2026, one provider accounted for 39. The point isn’t to name a worst offender — every provider has its quarter. The point is that if your agent is wired to a single provider, that provider’s bad quarter is now your bad quarter, on its schedule, not yours.

Ookla’s lead analyst, Luke Kehoe, named the mechanism: these patterns are making AI a deeper single point of failure than cloud alone, because enterprises are inheriting both the hyperscaler risk and the provider-specific orchestration risk on top of it. The report’s own language is that agentic workflows increase dependency depth and amplify the impact of any single failure. That is the whole story, and it has nothing to do with model quality.

An agent is a dependency chain, not a feature

Here’s the assertion that reframes the problem: a chatbot is one call, and an agent is a distributed system. When you ship an agent, you didn’t add a feature. You stood up a chain — plan, retrieve, call a tool, read from a system of record, synthesize — and every link is a separate thing that can be slow, wrong, or down.

A single-step chatbot fails honestly. The model times out, you get an error, you retry, the user sees a spinner. Annoying, visible, contained. A five-step agent fails in the middle. Step three calls your inventory API, the call times out, and unless someone wrote the code to treat that timeout as a hard stop, the agent does the worst possible thing: it proceeds. The synthesis step doesn’t know step three came back empty. It produces a fluent, confident answer built on a hole. That’s not an outage you can see. It’s silent degradation, and it’s far more expensive than a clean failure because nobody gets paged.

This is the same shape as every integration problem I keep coming back to. Multi-agent orchestration doesn’t fix a process that loses state at every handoff — it just adds more handoffs to lose it at. The seam was the problem before the agents arrived. Chaining a model across five systems doesn’t remove the seams; it strings the agent through all of them and makes the whole run only as reliable as its flakiest hop.

The failure you don’t see coming

Drop down a level, because the concrete failure mode is where this gets real. When a model provider is degraded, it doesn’t go cleanly dark. It returns a rate-limit response, or an “overloaded” error, or it just gets slow. Your agent framework hits that mid-chain. What happens next is entirely a function of code you may not have written carefully: does the retry use backoff, or does it hammer an already-overloaded endpoint? Does a swallowed exception turn a provider error into an empty string that the next step happily treats as a valid result? Is there a circuit breaker, or does every concurrent run pile into the same failing dependency at once?

Most agent stacks shipped this year answer those questions badly, because they were built to demo, not to degrade. The demo never sees the provider’s 51st bad day. Production does. And the gap between “works in the demo” and “survives the bad quarter” is the entire discipline of distributed systems — timeouts, retries, idempotency, bulkheads — none of which is new, and none of which a smarter model provides.

DimensionSingle AI call (chatbot)Multi-step agent chain
Points of failureOne: the modelThe model + retrieval + every tool + every internal API
How failure shows upClean error, visible to the userSilent — a bad step gets carried forward
Provider outage impactRequest fails, user retriesChain stalls or proceeds on partial data
Who’s watchingThe user, in real timeOften no one, until output is wrong at scale
What fixes itRetryObservability, circuit breakers, fallbacks, an owner

Every row is the same lesson: the reliability question moved from the model to the wiring around it, and the wiring is where almost no one is looking.

Why no one is watching the chain

The deeper reason agents fail quietly is organizational, not technical. Walk the dependency chain and ask who’s on call for each link. The model provider owns its own uptime. Your retrieval system has a team. Each internal API has an owner. But the agent that threads through all of them belongs to none of those on-call rotations. The seams between systems are exactly the part no single team watches, which is why a degraded agent can run for hours — generating confident, wrong answers — before anyone connects the dots.

I’ve made this argument about cost and about data, and it’s the same argument here. Integration and data contracts are the unglamorous layer that decides whether any of this works; reliability is that layer wearing a different hat. You can’t operate what you don’t observe, and most teams observe the model (latency, token count) while the actual failures happen one layer out, in the calls between systems that no dashboard is tracking end to end.

The working version: engineer the chain, not the model

None of this is an argument against agents. It’s an argument about where the work is — and it’s the same order of operations I bring to every job.

Instrument the whole run, not the model call. You need to see which step failed, with what input, and how long it took — per run, end to end. A bad final answer with no trace is unfixable; a trace that shows step three timed out is a ten-minute fix. This is the single biggest thing most agent deployments are missing, and it pays for itself the first time something breaks.

Treat provider errors as hard stops. A rate limit, an overloaded response, or a timeout is a signal to stop or fall back — never a null to quietly slide past. The most dangerous line in an agent codebase is the one that catches an exception and returns an empty result the next step trusts.

Put circuit breakers and fallbacks on every external hop. When a provider has its bad day — and the data says it will — one degraded dependency should trip a breaker and route to a fallback, not drag every concurrent run down with it. If a single workflow can’t tolerate one provider being slow, it isn’t production-ready; it’s a demo with users attached.

Give the chain one owner. The reliability of the whole run is somebody’s job, by name — the person who reconciles “the model is up” with “the workflow is actually working,” because those are different facts. That role rarely exists on an org chart yet, and it’s the one that keeps the agent honest. It’s the work I do first on every engagement.

Do it in the other order — ship the chain, then discover the seams when the bad quarter hits — and you’ve automated a process whose failure modes you’ve never mapped.

The operator read

The Ookla numbers will get read as “the AI tools are getting less reliable.” That’s the lazy version. What actually changed is that we started chaining those tools across five systems and calling the result an agent, while monitoring exactly one of the five. The model didn’t become a single point of failure. We made it one, and then pointed it at production without the plumbing that every other distributed system requires.

You didn’t deploy an agent. You deployed a dependency chain, and right now most of it is running where no one is looking. If a workflow in your shop broke this week and the honest answer to “which step failed” is a shrug, the model was never the problem. That’s the conversation worth having.

FAQ

Why do AI agents fail more often than a regular chatbot?
Because an agent isn't one call — it's a chain. A chatbot answers in a single step: one prompt, one response, and if the model is having a bad minute you see an error and retry. An agent plans, retrieves, calls tools, reads from your systems, and synthesizes across five or more steps, often spanning your retrieval layer, the model provider, and a couple of internal APIs. Any one of those links can stall or return garbage, and the failure usually lands in the middle of the chain rather than at the front door. So the same underlying flakiness that a chatbot surfaces as a clean error, an agent absorbs and carries forward — it keeps going on partial inputs and hands you a confident answer built on a step that quietly failed.
What did the Ookla AI reliability report find?
Ookla published the report on June 10, 2026. Drawing on 471 days of U.S. Downdetector data from January 1, 2025 to April 16, 2026, it logged 3.72 million user-reported incidents across ChatGPT, Claude, Gemini, Microsoft Copilot, AWS, and Azure. Its sharpest metric is 'high-signal disruption days' — days when a service's user reports run more than ten times its own normal daily volume. Those rose from 6 in the first quarter of 2025 to 16 in the fourth quarter to 51 in the first quarter of 2026. The report frames agentic workflows as increasing dependency depth and amplifying the impact of any single failure.
Is agentic AI reliable enough for production in 2026?
The models are reliable enough. The chains built on top of them often aren't — not because the model degraded, but because most agent deployments have no end-to-end observability and no plan for what happens when one link is slow or down. Ookla's lead analyst, Luke Kehoe, put it well: enterprises are inheriting both the hyperscaler risk and the provider-specific orchestration risk on top of it, which makes AI a deeper single point of failure than cloud alone. Agentic AI is production-ready exactly to the degree that you've engineered the chain — timeouts, retries with backoff, circuit breakers, fallbacks, and monitoring of every hop — the same way you would any distributed system.
Who is responsible when an AI agent breaks?
Usually no one, and that's the actual problem. The model provider owns its uptime. Your retrieval system owns its index. Each internal API owns its slice. But the agent that strings them together crosses all of those boundaries, and the seams between them are where it breaks — a tool call that times out, a provider returning an overloaded error, a stale read passed downstream. Those seams belong to no single team's on-call, so a degraded agent can run for hours producing plausible, wrong output before anyone notices. Reliability requires naming an owner for the whole chain, not just the parts.
How do you make an AI agent reliable?
Treat it as the distributed system it is. Instrument every step so you can see where a run actually failed instead of guessing from a bad final answer. Set explicit timeouts and treat provider errors — rate limits, overloaded responses, timeouts — as hard stops, not as nulls to quietly proceed past. Add circuit breakers and fallbacks for the steps that touch an external provider, so one bad quarter from one vendor doesn't take your workflow down with it. Make each step idempotent so a retry can't double-act. And give one person ownership of the end-to-end chain. None of that is a better model. All of it is plumbing — and the plumbing is what decides whether the agent is reliable.