Customer Story How Bitso cut change failure rate 83% while scaling delivery with coding agents

Validating AI-Generated Code Against Real Kubernetes Dependencies

Coding agents now write code faster than teams can verify it, and the bottleneck has moved from generation to proof. This guide explains why a closed validation loop, real cloud-native dependencies, and infrastructure that scales to thousands of parallel agents are what turn a flood of AI-generated pull requests into software you can ship.

Validating AI-generated code means proving that the change a coding agent produced actually works against the real services, databases, and message queues it will run with in production, not just that it compiles or passes a unit test. As agents move from autocomplete to opening pull requests on their own, the constraint on software delivery shifts from writing code to verifying it. This guide walks through why verification is the step that decides whether agents speed your team up or bury it, why the problem is hardest in cloud-native systems, what it actually takes to call a change verified, and how to make that verification fast and cheap enough to run on every change an agent produces.

The short version: agents need a closed loop they can run themselves, against real dependencies, on infrastructure that scales to their volume. Everything below builds on that idea.

Why verification is the step that lets agents keep going

For most of software’s history, typing the code was the slow part. That assumption no longer holds. A single engineer running coding agents can open more pull requests in a day than a small team used to open in a week. The agents are fast, tireless, and increasingly capable. The problem is that generation speed only turns into shipped product if every change can be verified quickly, and verification is now the rate limiter.

There is a deeper reason verification matters so much for agents specifically, and it has to do with how they work. A coding agent makes progress by acting, observing the result, and deciding what to do next. When it cannot observe whether its change worked, it has to stop and ask a human, or worse, it guesses and keeps building on top of an unverified assumption. Either way the agent’s run ends early. Give the same agent a way to test its change, read real results, and correct itself, and it can stay productive for far longer without supervision. The verification step is what closes the loop.

Write or fixthe changeValidate in a sandboxagainst real dependenciesRead theresultsPasses:ready forreviewFails or regresses: the agent fixes it and runs again, no human needed
The closed loop is what keeps an agent moving on its own until the change actually works.

This is why the most interesting agent results share a shape. When an agent can verify and correct its own work, it behaves less like an autocomplete and more like an engineer, as the team behind Ramp’s Inspect described in detail. The same pattern shows up in our own walkthrough of Claude Code validating a microservices change end to end: the agent provisions an environment, runs checks, reads what broke, and fixes it without a person in the path. Industry data is starting to name the gap too. The 2026 CircleCI report on the agent validation gap found that teams adopting agents fastest are the ones whose biggest unsolved problem is verifying what those agents produce. Closing the loop, as we have argued for cloud-native systems, is the difference between an agent that generates code and one that ships changes you can trust.

Why this is a cloud-native problem first

The reason verification is hard is almost never the changed file by itself. It is the system around it. A modern application is a mesh of dozens or hundreds of services talking over the network, backed by databases, caches, and queues, deployed on Kubernetes. The behavior that matters emerges from how those pieces interact, and a coding agent has no way to predict that from the source alone. To know whether a change works, you have to run it in something that looks like production.

That is why Signadot is built for cloud-native environments first. The hard cases live there: request-level routing between services, real data stores, service meshes like Istio and Linkerd, asynchronous messaging. An approach that handles validation cleanly in that setting handles the easier cases for free. We think about the problem more broadly than Kubernetes, because the principle, give the agent a real environment and a loop it can drive, applies anywhere agents write code. But cloud-native is where the need is sharpest and where most teams running agents at scale already are. Claude and other agents need a real place to run a change, not a mock of one, and in a cloud-native system that place is non-trivial to provide.

What “verified” means for a cloud-native change, and why it is hard

It is tempting to treat verification as a single checkbox: the tests passed. In a cloud-native system it is closer to a long tail of distinct questions, and a change can answer most of them correctly while quietly failing one that matters.

On the functional side, you want to know that the change still does the right thing once it is wired into the live system: that its API contract did not shift in a way a caller cannot handle, that service-to-service calls still succeed, that an end-to-end user flow still completes, that a schema or data migration did not strand existing records. Unit tests and mocks do not catch these, because a mock encodes the very assumption an agent is most likely to get wrong. Then there is the non-functional side, which agents tend to ignore entirely: latency under realistic traffic, behavior under load, memory and resource limits, security, and the handling of authorization and secrets. A change can be perfectly correct and still be too slow, too expensive, or insecure.

Functionaldoes it still behave correctly across services?
API contractsService-to-service callsEnd-to-end user flowsSchema and data migrationsEvent and queue handlingBackward compatibility
Non-functionaldoes it still meet the bar under real conditions?
LatencyBehavior under loadResource and memory limitsSecurityAuthorization and secretsCost

The practical consequence is that good verification is not one test but a portfolio of them, and the portfolio only produces a trustworthy signal when it runs against the real thing. An agent that reasons over noisy or fake signals reaches confident, wrong conclusions, which is why coding agents are only as good as the signals you feed them. Real dependencies are what make the signal worth acting on.

Who writes the checks, and who keeps them current

A long tail of checks raises an obvious question. Someone has to write all of that, and keep it current as the system changes. At human pace, with humans authoring every integration test, this was already the part of testing that teams quietly let rot. At agent pace it is hopeless. If a fleet of agents is producing changes all day, a separate group of people cannot hand-write and maintain the verification for each one.

So the checks have to be written and maintained by agents too, as a first-class part of doing the work rather than an afterthought. When an agent changes a behavior, it should also update or add the checks that prove the new behavior, the same way a careful engineer would. This is the line between an agent that produces code and one that does engineering, a distinction we have written about directly: real engineering includes leaving the system verifiable for the next change. Tooling can make this the default. The signadot-validate skill, for example, gives an agent a repeatable way to set up and run validation for a microservices change, so producing the proof is part of the same motion as producing the code.

The human role does not disappear; it moves up a level. People define what good looks like, the scenarios that matter, the regressions that are unacceptable, the acceptance criteria a change must meet, and the agents fill in and maintain the checks that enforce it.

Which checks run, and when

Running every possible check on every change does not scale and is not even desirable. A one-line copy fix and a change to the payments service do not warrant the same battery of tests. Deciding which subset of the long tail is relevant to a given change is itself a judgment, and it is exactly the kind of judgment an agent is well suited to make, because the agent knows what it changed and can reason about the blast radius.

The way to make that judgment reliable is to give the agent a library of validation it can compose rather than improvise. That is the idea behind Signadot Plans: reusable, deterministic validation flows that an agent selects and runs based on the change at hand. What makes the selection tractable is that each plan carries a selectionHint in its frontmatter, a short natural-language description of what the plan validates. An agent reads those hints, matches them against the diff in front of it, and decides which plans are relevant to the change, so it pulls in the booking-flow plan when it touches checkout and leaves the rest alone, rather than running the entire library every time. We went deeper on why this matters for agents specifically when we introduced Plans as validation superpowers for coding agents. The agent decides which plan fits a change and when to run it; the plan guarantees the check itself is consistent and repeatable. Intelligence picks the tests; determinism runs them.

The validation layer has to scale to agent volume

Everything above assumes you can actually run all of this, constantly. That assumption is where most teams hit a wall, because the two traditional ways to get a real environment both break under agent load. Replicating the full stack locally was already impractical for a thirty-service application on a laptop, and asking an agent to do it for every change is a non-starter. Shared staging turns into a queue: one environment cannot absorb dozens of concurrent agent pull requests without constant contention, broken state, and the forensic question of whose change caused the failure. That is the same staging bottleneck that already slows human teams, now multiplied by agent throughput. Duplicating a full environment per change solves isolation but explodes cost, since the bill scales with the number of services times the number of in-flight changes.

The picture to plan for is hundreds or thousands of agents, each opening and validating changes in parallel, every day. As we put it bluntly, most teams’ infrastructure is not ready for agentic development at scale. What works is request-level isolation: instead of cloning the stack, deploy only the service a change touches into a shared baseline cluster, and use request routing to send test traffic to the changed version while everything else borrows the baseline. If you already run a service mesh, you are halfway to solving this already.

Agent100s to 1000s of agentsvalidating in parallelONE SHARED KUBERNETES CLUSTERSHARED BASELINEPER-CHANGE SANDBOXESand moreeach change runs only what it touched, on the shared baseline
Request-level isolation lets one cluster host hundreds of agent sandboxes at once, instead of duplicating the stack per change.

Spun up this way, a sandbox costs a fraction of a full environment and starts in seconds, so running one on every change becomes the default rather than a budget decision. That economics is the whole game: validation that is cheap enough to always run is the only kind that keeps up with agents.

How Signadot closes the loop

Signadot puts these ideas together as a layer agents can drive. The environment layer is Sandboxes: lightweight, ephemeral ephemeral environments that deploy only the changed service onto your existing cluster and route test traffic to it, so hundreds can run in parallel at low cost. On top of that sits the validation layer. SmartTests compare the baseline and changed versions of an API and use Smart Diff to flag meaningful behavioral changes while ignoring benign noise, catching regressions without the hand-written contracts that tools like Pact require. Jobs run the integration, end-to-end, and load suites you already trust, inside the cluster, against the sandbox. Plans tie these into single, repeatable workflows an agent can select and run.

The loop itself is closed by Signadot’s MCP server, which lets agents such as Claude Code and Cursor provision a sandbox, run these checks against real dependencies, read the results, and iterate, all without a human in the path until the change is genuinely ready for review. To see this in practice, watch Claude Code validate a microservices change in a closed loop. The result is a pre-merge gate where every agent-generated change is exercised against the real system, automatically, in parallel, before it reaches a reviewer or production.

Give every coding agent a real environment to prove its work

Signadot spins up an isolated, production-like sandbox for every change on your existing cluster, and its MCP server lets agents validate their own pull requests against real dependencies. The free tier is open to every developer.

Start free

Validation is the new build step

Agentic development only pays off when generation speed converts into shipped, working software, and that conversion happens at the validation layer. The teams pulling ahead are the ones that treat verification as core infrastructure: a closed loop the agent can run itself, against the real cloud-native dependencies a change will meet in production, with checks the agents write and maintain, chosen intelligently per change, on infrastructure that scales to thousands of validations a day. Get that right and a flood of AI-generated pull requests becomes a stream of changes you can trust. For the broader testing context, see the complete guide to microservices testing and the guide to ephemeral environments in Kubernetes.

Frequently asked questions

Why isn’t passing unit tests enough to validate AI-generated code?

Coding agents are good at producing code that is locally plausible, which is exactly what a unit test checks. The failures that matter in a cloud-native system are emergent: a changed response shape a downstream service cannot parse, a new call pattern that trips a rate limit, an assumption about data that holds in isolation but not against the real database. A mock confirms the agent called a dependency the way the author expected, which is the assumption the agent is most likely to get wrong. The only reliable signal is running the changed service against the real dependencies it will meet in production.

Should developers or QA own AI-generated test suites?

In practice, ownership follows the closed loop. When agents can validate their own changes against real dependencies before opening a PR, the developer who runs the agent owns the first line of validation, and QA shifts toward defining what good looks like: the high-value scenarios, the regression checks, and the acceptance criteria the agent must satisfy. Increasingly the agent also writes and maintains the checks themselves, since a human cannot keep hand-written suites current at agent volume.

What happens when coding agents go wrong in enterprise environments?

The risk is not that an agent writes one bad line; it is that it writes many plausible-looking changes that only fail on integration. The mitigation is a validation gate that every change must pass: an isolated environment, real dependencies, and automated regression detection before merge. With that gate in place, a wrong change fails fast in its own sandbox instead of breaking a shared environment or reaching production. See why coding agents break CI/CD pipelines and how to fix it.

How does validation infrastructure keep up with hundreds of agents?

Duplicating a full environment per agent does not scale, because the bill grows with the number of services times the number of in-flight changes. The approach that scales is request-level isolation: deploy only the service a change touches into a shared baseline cluster and route test traffic to it. One cluster can then host hundreds of concurrent sandboxes at a fraction of the cost, which is what lets thousands of agent-driven validations run in parallel every day.

Can AI detect release regressions in Kubernetes?

Yes. By comparing the behavior of a changed service against the current baseline and filtering out benign differences, AI-assisted checks like Smart Diff surface the breaking changes that matter, such as a removed field, a changed type, or a different status code, while ignoring noise. Run against real dependencies in a sandbox, this catches regressions before merge rather than after deploy.

Stay in the loop

Get the latest updates from Signadot

Validate code as fast as agents write it.