The hidden tax on every engineering team running microservices — and why adding more staging environments makes it worse, not better.
It's 2:17pm on a Tuesday. A senior developer has been blocked for forty minutes. Not on a hard problem on a queue. Three other teams have changes deployed to staging, and one of them broke something. Nobody's sure whose change it was. The PR that was supposed to ship today will slip to tomorrow, maybe Thursday if the investigation runs long.
This scene plays out in engineering organizations everywhere, dozens of times a week. It's so normal that most teams have stopped registering it as a cost. It's just how things work.
But it isn't free. And the gap between what teams think shared staging costs and what it actually costs is one of the largest unexamined line items in modern software engineering.
Shared staging has three distinct cost buckets. They're rarely measured together, which is part of why the total stays invisible.
When two or more teams need staging at the same time, someone waits. At a 100-person engineering organization shipping at pace, this isn't a rare edge case; it's a daily condition. Consider a conservative estimate: if the average developer loses one hour per day to staging-related delays, waiting for a slot, waiting for an environment to stabilize, waiting to understand whether a failure is theirs or another team's, the math is stark.
At a fully loaded cost of $125,000 per year, one hour daily represents roughly $15,600 per developer annually. For a hundred-developer team, that's $1.56M in lost velocity, before counting a single incident, a single deployment failure, or a single bug that reached production because the debugging environment was too contested to investigate properly.
That number doesn't appear on any invoice. It doesn't show up in your infrastructure costs or your headcount budget. But it's real, and it compounds.
"We had developers deploying unapproved code into staging just to test. That meant when bugs showed up, we couldn't tell what broke what."
Engineering Team · WealthsimpleThis describes something that happens in most scaling engineering organizations: developers start working around the bottleneck. They push code to staging before it's review-ready because waiting is expensive. The environment becomes noisier. Signal degrades. Eventually nobody trusts staging, and the entire point of having it begins to erode.
In a shared environment, one team's bad deploy is every team's problem. When a service breaks in staging, it doesn't just block that team; it blocks every team with a downstream dependency. In a microservices architecture with dozens of interconnected services, that blast radius can be substantial.
The less visible cost is investigative overhead. When staging breaks and multiple teams have active changes, the first thing that happens is an unplanned multi-team debugging session. Time is spent in Slack threads trying to attribute the failure rather than building features. DoorDash experienced this directly before moving to sandboxes: staging instability meant that when regressions appeared, it was impossible to quickly determine which team's change was responsible, causing delays across teams that had nothing to do with the original issue.
The Pattern That Develops
Teams start serializing their deployments to avoid collisions, effectively turning a shared parallel environment into a queue. A 100-person team that could theoretically be testing in parallel instead tests sequentially. Throughput gets capped by the throughput of a single environment.
Teams then game the system further, pushing to staging at off-peak hours to avoid conflicts. Testing happens at the margins of the workday. Bugs get found on Friday afternoon. The problem perpetuates itself.
Perhaps the most significant cost of staging contention is the hardest to attribute: the bugs that slip through because the environment wasn't trustworthy enough to catch them.
When staging is unreliable, developers learn to discount what it tells them. A test failure might be their bug, or it might be noise from another team's deploy, or it might be configuration drift in the shared environment. Over time, engineers develop a kind of learned skepticism about staging signals. They merge code that staging flagged as failing because "staging is always broken." And occasionally they're right. But every so often the flag was real.
The research on production incident costs is consistent: a bug caught in development costs orders of magnitude less to fix than one caught in production. Industry estimates put the cost of a production incident at $10,000 or more when you account for engineer time, customer impact, and remediation. A team that lets three additional bugs per month reach production because staging was too noisy to catch them is adding $30,000+ to its annual cost base, without ever seeing it as a line item.
The intuitive response to staging contention is to give each team their own staging environment. More environments, fewer collisions.
This works briefly and at a small scale. It stops working as your microservices architecture grows: duplicating a full stack of services doesn't just duplicate compute costs. It duplicates the cost of maintenance, data seeding, configuration management, third-party integrations, and the operational burden on your platform team.
Brex learned this the hard way. They built a system using duplicated Kubernetes namespaces to give developers isolated environments. As their microservices footprint grew, those environments began pushing the scaling limits of a single Kubernetes control plane. Availability issues pushed teams back toward staging, and the duplicated environments fell into disrepair. As Connor Braa, Software Engineering Manager at Brex, put it, the cost of getting something wrong in a preview environment was never high enough to prevent people from leaving things broken, so they constantly were.
"On the margin, with the Signadot approach, 99.8% of the isolated environment's infrastructure costs look wasteful. That percentage looks like an exaggeration, but it's really not."
Connor Braa · Software Engineering Manager, BrexThe fundamental issue is architectural. Full environment duplication is the wrong model for microservices. When a developer changes one service out of 50, the cost of running the other 49 is pure overhead. The question isn't "How do we give each developer their own environment?" it's "How do we give each developer isolation for the services they're actually changing?"
The teams that have solved this problem well share a common pattern. They've stopped thinking of testing environments as finite, shared resources, and started treating isolation as something that gets created on demand, per change, and torn down when it's no longer needed.
This is what the industry calls "shift-left" testing: moving integration validation as early as possible in the development cycle, ideally to the moment a developer opens a pull request. The core insight is that the cost of finding a bug rises with time. A bug found before merge is cheap. Found in staging, it costs more. Found in production, it's expensive. The goal is to catch as many bugs as possible at the earliest, cheapest stage.
Rather than giving each developer a full clone of the environment, the approach that scales works like this: maintain a single stable baseline representing your production cluster. When a developer changes one or two services, spin up an isolated context, a Sandbox for just those changed services. Route test traffic through that Sandbox. Everything else flows from the shared baseline, unchanged.
This model scales in a way that full duplication never can. A 100-developer team can run 100 Sandboxes simultaneously. Each contains only the 1-3 services being changed. The shared baseline handles everything else, which is precisely what Brex found: infrastructure costs reduced by 99% compared to their previous namespace-duplication approach.
This isn't theoretical. Several engineering organizations have moved from shared staging to sandbox-based isolation and documented what happened.
Customer Outcomes · Signadot
The pattern across these organizations is consistent. The pain isn't unique to their scale or stack. It's the natural consequence of treating a shared environment as an infinite, low-friction resource, when it isn't either of those things.
The numbers are illustrative until they're yours. Most teams that run this exercise are surprised not because the numbers are challenging to find, but because nobody had put them side by side before. The cost of the problem is usually much larger than the cost of solving it. The gap just hasn't been visible.
Shared staging isn't bad because of a tooling mistake or a specific architectural choice. It's a scaling problem. It works reasonably well for small teams with a handful of services. It stops working when the number of concurrent developers, active PRs, and interconnected services grows, and it stays broken until the underlying model changes, not just the number of environments provisioned.
The next question, once the cost is visible, is what a replacement model looks like technically and what it takes to adopt it without disrupting the teams already in flight. That's the subject of the next piece in this series.
Get the latest updates from Signadot