VM Creep in Deployments

Let’s imagine you are part of a small team preparing to deploy a new service. The first steps are already behind us, and now we are getting ready for production. Sounds great — let’s walk through the lifecycle of the service and watch how the VM count evolves over time.

1. Step: production.

Production, we need 3 VMs. Deploy and done. First users are joining! 🎉

production: 3 VMs
total: 3 VMs

2. Step: testing

We got updates for the service! 📈 However we are no gigachads, so option to test in production is a no-go. Lets create a test environment with 3 VMs. It just works, nice.

production: 3 VMs
testing: 3 VMs
total: 6 VMs

3. Step: scaling

New users are joining our service, we scale production: node scaling, load balancers, proxies 🚀

production: 16 VMs
testing: 9 VMs
total: 25 VMs

4. Step other environments

Because our testing is far away to match our production, we introduce one production-similar last step test environment: staging. To support agile development, short-lived experimental environments start appearing… 🧪

production: 16 VMs
staging: 16 VMs
testing: 9 VMs
experimental: 5 x 3 VMs
total: 56 VMs

5. Step: geo redundancy

Because we fear downtime and blackout (or something else relevant), we decide to introduce geo-redundant infrastructure to our system.

production: 2 x 16 VMs
staging: 2 x 16 VMs
testing: 2 x 9 VMs
single site testing: 9 VMs
experimental: 5 x 3 VMs
total: 106 VMs

(could have happened at any step): bad design choice

We made a bed design step with all best intention. Eg: We dug too deep into microservice-oriented architecture and doubling VMs everywhere.

total 212 VMs

How we ended up there and was it inevitable?

We have 212 VMs plus multiple environments and this gets very expensive, not talking about the man-hour cost to manage all of it.

All the steps are reasonable by themselves, but when all combined it might easily get over our head. Quite often we can see at conferences/posts that these techniques are praised for their pros, but much less often the cons are mentioned as well. So we should ask ourselves every time if we really need it:

Testing: If it’s a rarely used service, would scheduled maintenance windows be enough to resolve issues directly in production?
Scaling: Would vertical scaling be sufficient? Is there room for optimization instead?
Staging: MCould a larger testing environment — something between testing and staging — catch most scale-related issues?
Geo: How quickly can we restore from an offsite backup? Would that be sufficient instead of full geo-redundancy? In many cases, geo-redundancy only makes sense for truly global or critically high-availability services.

None of these decisions are wrong on their own. Complexity grows when reasonable decisions accumulate without regular re-evaluation. Redundancy is not just about availability — it is about the operational cost we silently accept.