Back to all posts

Engineering

The YAML janitor problem: why Kubernetes dev environments always drift and how to fix it

L

Lu 2026-01-20

A platform engineer at a 40 person startup told us he spent every Monday morning going through a backlog of "my environment is broken" tickets. He was not writing code. He was manually patching dev namespaces because someone had updated a ConfigMap in production and the dev copies had drifted again. "I'm basically a janitor for YAML," he said.

He was not unusual. We heard the same story from team after team. Platform engineers spending 5 to 10 hours a week on dev environment tickets. Senior developers waiting half a day to get a working namespace before they could even start on a feature. Onboarding taking a week because the dev environment setup was a tribal knowledge document that was never quite right.

Why the common approaches break down

There are four ways most teams try to solve this. Each one has a predictable failure mode.

Helm values per environment. A values-dev.yaml alongside values-prod.yaml is the most common starting point. It feels controlled because the files are in version control. But they are still two copies. Someone updates an image tag or a resource limit in prod and forgets the dev file. Six weeks later a developer is debugging a behaviour that does not exist in their environment because their values are six weeks stale.

Kustomize overlays per environment. An overlays/dev directory alongside overlays/prod has the same problem. ArgoCD or Flux deploying it faithfully does not prevent it from drifting. It just means the drift is version controlled. Someone still has to keep the dev overlay in sync with the prod overlay, and at some point they will not.

Share a staging namespace across the team. Works until two developers need to test conflicting changes at the same time. One person's deployment overwrites another's. Debugging becomes guesswork because nobody is sure whose version is running.

Give every developer a full clone of the stack. Solves the conflict problem, but at serious cost. A 30 service stack with 10 developers means 300 running service replicas just for dev. On a managed cluster, that is roughly $15,000 to $30,000 a month in infrastructure (assuming m5.xlarge or equivalent on AWS), for environments that are idle most of the time. Most teams hit this wall and start looking for shortcuts, which brings them back to the first approaches.

The root cause

All four approaches share the same mistake: they treat dev environment config as a separate thing from production config. The moment you have two copies, they start to diverge. The fix is not to copy better or sync more often. The fix is to stop copying.

The pattern that fixes it

What you actually want is a system with three properties. First, the production namespace is the only source of manifest truth. Second, dev environments are computed from it automatically, never hand copied. Third, when production changes, developers are notified and can sync on their own schedule rather than being surprised by silent drift.

If you have that, you have eliminated the janitor problem entirely. There is nothing to patch, because there is no copy.

The architecture has four conceptual pieces.

A read only view of your production namespace. Instead of copying manifests, your tooling reads the workloads already running. This view is the only source of truth. Nothing is authored separately for dev.

Environments computed from that view, not provisioned by hand. A developer requests an environment and gets one that matches production at that point in time. When production changes, they are notified. They choose when to sync. No manual patching, no silent drift.

Routing-based isolation instead of full clones. Each developer runs only the services they are actively changing. Everything else is shared from a baseline. A request header carries the developer's identity through the stack, so each person sees their own version end to end without running 30 services each. For a 30 service stack with 10 developers, this brings active replicas from 300 down to roughly 30 to 50, an 88% reduction in dev infrastructure cost.

Local debugging that stays connected to the cluster. When a developer needs to step through code, they run one service on their laptop and let the cluster route traffic to it. They get real requests, real config, and real dependencies, without running the full stack locally.

This is the pattern we implemented in Lapdev. If you want to see how it works in practice, the install guide walks through it from cluster connection to first environment.

The outcome

After this setup, the platform team maintains one set of manifests: the ones they already ship to production. Dev environments are derived automatically. The Monday morning backlog goes away because there is nothing to patch.

Developers can spin up their own environment in minutes rather than waiting on a ticket. Onboarding goes from a week of tribal knowledge to a single command. And when something behaves differently from production, the environment config is not the first suspect, because it cannot be.