The Hidden Cost of Shared Environments · 01 of 05

Why Shared Staging Is the Most Expensive Tool You're Not Accounting For

The hidden tax on every engineering team running microservices — and why adding more staging environments makes it worse, not better.

Reading time 10 min

Author Rory Nolan

Category Cost Optimization

It's 2:17pm on a Tuesday. A senior developer has been blocked for forty minutes. Not on a hard problem on a queue. Three other teams have changes deployed to staging, and one of them broke something. Nobody's sure whose change it was. The PR that was supposed to ship today will slip to tomorrow, maybe Thursday if the investigation runs long.

This scene plays out in engineering organizations everywhere, dozens of times a week. It's so normal that most teams have stopped registering it as a cost. It's just how things work.

But it isn't free. And the gap between what teams think shared staging costs and what it actually costs is one of the largest unexamined line items in modern software engineering.

The Three Real Costs of Shared Staging

Shared staging has three distinct cost buckets. They're rarely measured together, which is part of why the total stays invisible.

Cost 1 · Developer Time Lost to Contention

When two or more teams need staging at the same time, someone waits. At a 100-person engineering organization shipping at pace, this isn't a rare edge case; it's a daily condition. Consider a conservative estimate: if the average developer loses one hour per day to staging-related delays, waiting for a slot, waiting for an environment to stabilize, waiting to understand whether a failure is theirs or another team's, the math is stark.

1 hr Lost per developer
per day (conservative)

230 Working days
per year

~12.5% Of total engineering
capacity, gone

At a fully loaded cost of $125,000 per year, one hour daily represents roughly $15,600 per developer annually. For a hundred-developer team, that's $1.56M in lost velocity, before counting a single incident, a single deployment failure, or a single bug that reached production because the debugging environment was too contested to investigate properly.

That number doesn't appear on any invoice. It doesn't show up in your infrastructure costs or your headcount budget. But it's real, and it compounds.

"We had developers deploying unapproved code into staging just to test. That meant when bugs showed up, we couldn't tell what broke what."

Engineering Team · Wealthsimple

This describes something that happens in most scaling engineering organizations: developers start working around the bottleneck. They push code to staging before it's review-ready because waiting is expensive. The environment becomes noisier. Signal degrades. Eventually nobody trusts staging, and the entire point of having it begins to erode.

Cost 2 · The Blast Radius Problem

In a shared environment, one team's bad deploy is every team's problem. When a service breaks in staging, it doesn't just block that team; it blocks every team with a downstream dependency. In a microservices architecture with dozens of interconnected services, that blast radius can be substantial.

The less visible cost is investigative overhead. When staging breaks and multiple teams have active changes, the first thing that happens is an unplanned multi-team debugging session. Time is spent in Slack threads trying to attribute the failure rather than building features. DoorDash experienced this directly before moving to sandboxes: staging instability meant that when regressions appeared, it was impossible to quickly determine which team's change was responsible, causing delays across teams that had nothing to do with the original issue.

The Pattern That Develops

Teams start serializing their deployments to avoid collisions, effectively turning a shared parallel environment into a queue. A 100-person team that could theoretically be testing in parallel instead tests sequentially. Throughput gets capped by the throughput of a single environment.

Teams then game the system further, pushing to staging at off-peak hours to avoid conflicts. Testing happens at the margins of the workday. Bugs get found on Friday afternoon. The problem perpetuates itself.

Cost 3 · The Bugs That Make It Through Anyway

Perhaps the most significant cost of staging contention is the hardest to attribute: the bugs that slip through because the environment wasn't trustworthy enough to catch them.

When staging is unreliable, developers learn to discount what it tells them. A test failure might be their bug, or it might be noise from another team's deploy, or it might be configuration drift in the shared environment. Over time, engineers develop a kind of learned skepticism about staging signals. They merge code that staging flagged as failing because "staging is always broken." And occasionally they're right. But every so often the flag was real.

The research on production incident costs is consistent: a bug caught in development costs orders of magnitude less to fix than one caught in production. Industry estimates put the cost of a production incident at $10,000 or more when you account for engineer time, customer impact, and remediation. A team that lets three additional bugs per month reach production because staging was too noisy to catch them is adding $30,000+ to its annual cost base, without ever seeing it as a line item.

Why Adding More Staging Environments Doesn't Fix This

The intuitive response to staging contention is to give each team their own staging environment. More environments, fewer collisions.

This works briefly and at a small scale. It stops working as your microservices architecture grows: duplicating a full stack of services doesn't just duplicate compute costs. It duplicates the cost of maintenance, data seeding, configuration management, third-party integrations, and the operational burden on your platform team.

Brex learned this the hard way. They built a system using duplicated Kubernetes namespaces to give developers isolated environments. As their microservices footprint grew, those environments began pushing the scaling limits of a single Kubernetes control plane. Availability issues pushed teams back toward staging, and the duplicated environments fell into disrepair. As Connor Braa, Software Engineering Manager at Brex, put it, the cost of getting something wrong in a preview environment was never high enough to prevent people from leaving things broken, so they constantly were.

"On the margin, with the Signadot approach, 99.8% of the isolated environment's infrastructure costs look wasteful. That percentage looks like an exaggeration, but it's really not."

Connor Braa · Software Engineering Manager, Brex

The fundamental issue is architectural. Full environment duplication is the wrong model for microservices. When a developer changes one service out of 50, the cost of running the other 49 is pure overhead. The question isn't "How do we give each developer their own environment?" it's "How do we give each developer isolation for the services they're actually changing?"

What Healthy Pre-Merge Validation Actually Looks Like

The teams that have solved this problem well share a common pattern. They've stopped thinking of testing environments as finite, shared resources, and started treating isolation as something that gets created on demand, per change, and torn down when it's no longer needed.

This is what the industry calls "shift-left" testing: moving integration validation as early as possible in the development cycle, ideally to the moment a developer opens a pull request. The core insight is that the cost of finding a bug rises with time. A bug found before merge is cheap. Found in staging, it costs more. Found in production, it's expensive. The goal is to catch as many bugs as possible at the earliest, cheapest stage.

The model that works at scale

Rather than giving each developer a full clone of the environment, the approach that scales works like this: maintain a single stable baseline representing your production cluster. When a developer changes one or two services, spin up an isolated context, a Sandbox for just those changed services. Route test traffic through that Sandbox. Everything else flows from the shared baseline, unchanged.

01A developer opens a PR. An isolated Sandbox spins up automatically inside the existing Kubernetes cluster, deploying only the services changed in that PR.
02The Sandbox routes relevant traffic through the changed services and falls through to the baseline for everything else. No full stack duplication. No other teams affected.
03Automated tests (Playwright, Cypress, Postman, or Signadot SmartTests) run against the Sandbox. Developers and reviewers access a live environment with real backend behavior.
04When the PR merges or closes, the Sandbox tears down automatically. Nothing to clean up. Nothing left running and accruing cost.

This model scales in a way that full duplication never can. A 100-developer team can run 100 Sandboxes simultaneously. Each contains only the 1-3 services being changed. The shared baseline handles everything else, which is precisely what Brex found: infrastructure costs reduced by 99% compared to their previous namespace-duplication approach.

· · ·

What Teams Report After Making the Shift

This isn't theoretical. Several engineering organizations have moved from shared staging to sandbox-based isolation and documented what happened.

Customer Outcomes · Signadot

10×

DoorDash removed their staging environment entirely and moved to Signadot Sandboxes. Previously, spinning up Docker images for a change took over 30 minutes, and staging instability blocked multiple teams simultaneously. After adopting Signadot, developers got 10x faster feedback on code changes, and the number of rollbacks and production incidents dropped.

80%

Brex replaced namespace-based preview environments with Signadot Sandboxes across hundreds of engineers. Previewing changes took 80% less time. Developer satisfaction scores (CSAT) were 28 points higher for Signadot than the previous tooling. Infrastructure costs were reduced by 99% on a per-preview basis.

~50%

Wealthsimple reduced staging bottlenecks by nearly 50% after shifting to per-PR sandboxes. Staging conflicts, teams blocking each other, unclear failure attribution, and developers pushing unapproved code just to test were eliminated. Sandbox adoption became core to their workflow, with plans to extend to async systems including Kafka and Temporal.

The pattern across these organizations is consistent. The pain isn't unique to their scale or stack. It's the natural consequence of treating a shared environment as an infinite, low-friction resource, when it isn't either of those things.

Calculate Your Own Staging Waste

Staging Cost Audit Adjust inputs → see your annual waste

Your Team

Number of engineers using staging All developers who depend on the shared staging environment

devs

Average fully-loaded annual salary Include salary, benefits, employer taxes, equipment

Staging Contention & Wait Time

Avg. hours lost per developer per day to staging delays Waiting for a slot, environment stabilising, attribution debugging

hrs/day

Platform / DevOps engineers maintaining staging Headcount whose time is materially consumed by staging upkeep

people

% of platform time spent on staging maintenance Tickets, environment fixes, oncall for staging incidents

Infrastructure Spend

Annual cloud cost for non-production environments Compute, networking, storage for staging / preview clusters

Bugs That Slip Through

Production incidents per year caused by staging gaps Bugs that staging was too noisy or unreliable to catch pre-merge

/year

Average cost per production incident Engineering hours, customer impact, remediation, post-mortems

Developer Contention

$938K

Platform Overhead

$104K

Infrastructure Spend

$120K

Estimated Annual Staging Waste Developer time + platform overhead + infrastructure + incident cost

$1.28M

Estimates only. Developer contention calculated at hourly rate × wait hours × 230 working days. Platform overhead at ops salary × 1.15 premium × time allocation. All figures in USD.

The numbers are illustrative until they're yours. Most teams that run this exercise are surprised not because the numbers are challenging to find, but because nobody had put them side by side before. The cost of the problem is usually much larger than the cost of solving it. The gap just hasn't been visible.

Shared staging isn't bad because of a tooling mistake or a specific architectural choice. It's a scaling problem. It works reasonably well for small teams with a handful of services. It stops working when the number of concurrent developers, active PRs, and interconnected services grows, and it stays broken until the underlying model changes, not just the number of environments provisioned.

The next question, once the cost is visible, is what a replacement model looks like technically and what it takes to adopt it without disrupting the teams already in flight. That's the subject of the next piece in this series.