Operational Resilience For AI | Business Continuity for AI Agents

Five years ago, “using AI” at most companies meant a recommendation model tucked behind a product surface. Today, agents are executing end-to-end work on behalf of the business. They pick up IT tickets and run diagnostics. They triage tier-1 customer support and issue refunds. In financial services, they screen transactions, reconcile ledgers, and draft first-pass credit memos.

An AI agent is not a chat window. It is software that decides and acts. It uses tools - API calls, database queries, MCP servers, function calls - to read and write real systems. It has credentials. It has a scope. It replaces or augments a person doing a specific job.

Once an agent is load-bearing, “the model was down” is not an excuse the business can absorb. It is a production incident.

Three risks you have to plan for

Traditional BCP thinks about outages, human error, and security. AI agents map to all three - but the failure modes look different enough that most existing plans do not actually cover them.

1. Model outages

Every major provider has had multi-hour outages in the last year. If your agent depends on a single frontier model, you have taken on an availability dependency that your vendor does not price as tier-1. When the endpoint is down, the agent is down - which means the workflow behind it is down. Tickets pile up. Tier-1 support stops answering. Reconciliation stalls. The outage is not abstract anymore; it is your outage.

Treat them like any other dependency. Subscribe to the public status pages and wire them into your own monitoring - for example, status.openai.com and status.claude.com. When a provider declares an incident, your on-call should already know, and your failover should already be starting.

2. Hallucination and error in responses

An outage is loud. A silent wrong answer is worse. An agent that confidently miscategorizes a ticket, issues a refund for the wrong amount, or writes a bad entry into a ledger creates downstream work that may take weeks to find. Error rates on agent workflows need the same observability as any other production path: sampling, evals, regression tests on golden traces, and a feedback loop back into the system prompt and tool definitions.

3. The agent is a risk to the systems it connects to

An agent with tool access is a confused deputy waiting to happen. Prompt injection from a malicious email, a poisoned ticket, or a crafted document turns your helper into an insider threat with legitimate credentials. The blast radius is the set of tools you gave it: the databases it can read, the money it can move, the code it can deploy. That risk is not a model problem. It is a systems problem you own.

BIA: map the agent as a dependency

A Business Impact Analysis is, at its core, a dependency map. For every business process you care about, you list what it depends on to run, and you score what happens when each of those dependencies fails. An AI agent is a dependency of the process it automates - the same way a database, a payment rail, or a vendor API is. The question is how you define it.

“We use GPT” is not a definition. The model is one sub- dependency of the agent. To capture the agent as a dependency on your BIA, you need to include:

The system prompt - the instructions, constraints, and persona that turn a general model into a specific worker. Version it. Review it. Treat prompt changes like code changes.
The tools it has - the set of functions, APIs, and MCP servers the agent can call. Each tool has a scope, a set of credentials, and a blast radius in the systems downstream of it.
The model (or models) behind it - provider, endpoint, fallback path. This is a dependency of the agent, not the agent itself.
The processes it automates - which human work goes sideways when the agent is down, and what the manual SLA looks like.

Walk the tool graph. For each tool the agent owns: which system does it touch, what can it change there, what is the worst case if it fires incorrectly, and how quickly would you notice? A “read-only” tool that can exfiltrate customer data is not low-risk. A “small” tool that can issue refunds is not small.

Once you can draw the picture above for every agent you run, you have the input the BCP actually needs. Until then, you are planning for a system whose shape you do not know.

BCP: what to do when it breaks

A continuity plan that exists only on paper is not a plan. Three moves, all tested regularly:

Practice the outage

Run game-days. Block the primary model endpoint at the network layer and watch what happens. Measure how long it takes you to detect the failure, how clean the failover is, and whether the degraded mode actually degrades - or quietly breaks in a way nobody notices until Monday. An agent outage you have rehearsed costs you hours. One you have not rehearsed costs you days.

Human as replacement

Every agent task needs a documented human runbook. Who picks up tier-1 tickets when the agent is down? What is the SLA in manual mode? How does work get routed back to humans without dropping on the floor? The handoff has to be written down, staffed, and practiced. If your answer is “we’ll figure it out,” you do not have a plan.

Another model as replacement

Multi-provider fallback is table stakes now. Abstract the model call so the agent can switch from Anthropic to OpenAI - or to a self-hosted open-weights model - without code changes in the business logic. The fallback path has to be exercised regularly, because an untested fallback is not a fallback; it is a promise. And when you switch, expect your evals to shift; the behavior will not be identical, and that is fine as long as the degraded behavior is acceptable.

Resilience is a practice, not a product

None of this is new. BCP discipline has been applied to databases, payment rails, and data centers for decades. AI agents are simply the next class of production system that needs the same treatment: an inventory, a risk model, a written plan, and a rehearsal schedule.

The organizations that treat their agents as critical infrastructure are the ones that will keep running when a provider has a bad Tuesday. The rest will find out the hard way that convenience is not the same as continuity.