It’s 2 AM. Your team just pushed a critical security patch to production. Users are still actively shopping, chatting, streaming—completely unaware that you’re replacing the very servers handling their requests. No downtime. No service interruption. Just a seamless transition from old to new.
This is the promise of rolling updates, and when done right, it’s one of the most elegant deployment strategies available to modern DevOps teams.
What Are Rolling Updates?
A rolling update is a deployment strategy that incrementally replaces instances of your application with new versions, ensuring that some instances remain available throughout the deployment process. Instead of taking down your entire application, updating it, and bringing it back up (the “big bang” approach), you update instances in batches while maintaining service availability.
Think of it like replacing planks on a bridge while people are still walking across—challenging, but absolutely possible with the right approach.
Why Rolling Updates Matter
The Zero-Downtime Imperative
In today’s always-on digital economy, downtime is measured not just in lost revenue, but in damaged reputation and customer trust. Consider these realities:
- Financial Impact: For many e-commerce platforms, even 5 minutes of downtime during peak hours can mean tens of thousands in lost revenue
- User Experience: Modern users expect 24/7 availability. A maintenance window announcement feels antiquated
- Competitive Advantage: The ability to deploy features and fixes without service interruption is a genuine competitive differentiator
Faster Feedback Loops
Rolling updates enable a progressive rollout model. You can:
- Deploy to a small percentage of your fleet first
- Monitor metrics and error rates in real-time
- Detect issues before they impact all users
- Roll back quickly if problems emerge
Risk Mitigation
By updating incrementally, you create natural checkpoints. If instance 1 of 10 fails after update, you’ve impacted 10% of capacity, not 100%. This built-in safety mechanism is invaluable in production environments.
How Rolling Updates Work: The Mechanics
At its core, a rolling update follows this pattern:
- Remove an instance from the load balancer (or service mesh)
- Stop the old version on that instance
- Start the new version on that instance
- Run health checks to verify the new version is working
- Add the instance back to the load balancer
- Repeat for the next instance
The Critical Details
The devil is in the details. Here’s what actually happens behind the scenes:
Health Checks Are Everything
Your orchestrator relies on health checks to determine if a new instance is ready. You need two types:
- Liveness probes: Is the application running?
- Readiness probes: Is the application ready to serve traffic?
A common mistake is using the same endpoint for both. Your app might be “alive” (process running) but not “ready” (database connections not established).
Connection Draining
When you remove an instance from a load balancer, existing connections need time to complete. Good rolling update implementations:
- Stop sending new requests to the instance
- Wait for existing requests to complete (connection draining)
- Only then terminate the old version
Skip this, and you’ll drop active requests—exactly what you’re trying to avoid.
Surge and Unavailable Instances
In Kubernetes terms:
- maxSurge: How many extra instances you can create during update (e.g., 25% means for 4 instances, you can temporarily have 5)
- maxUnavailable: How many instances can be down during update (e.g., 25% means for 4 instances, at most 1 can be down)
These parameters control update speed vs. resource usage. More surge = faster updates but higher temporary resource consumption.
Implementation Approaches
Kubernetes RollingUpdate Strategy
Kubernetes makes rolling updates a first-class citizen:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
spec:
containers:
- name: app
image: myapp:v2
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
Why this configuration?
maxSurge: 1, maxUnavailable: 0: Never reduces capacity below desired state. Always maintains 4 instances, temporarily scaling to 5 during update.readinessProbe: Waits 5 seconds after startup, then checks every 5 seconds. Instance only receives traffic when this passes.livenessProbe: Gives app 15 seconds to start, then checks every 10 seconds. If it fails, kubelet restarts the container.
AWS ECS Rolling Deployments
Amazon ECS offers similar capabilities with different terminology:
{
"deploymentConfiguration": {
"maximumPercent": 200,
"minimumHealthyPercent": 100,
"deploymentCircuitBreaker": {
"enable": true,
"rollback": true
}
}
}
minimumHealthyPercent: 100: Never drop below desired capacitymaximumPercent: 200: Can double capacity during deploymentdeploymentCircuitBreaker: Automatically rolls back failed deployments
Docker Swarm
Docker Swarm’s approach is similarly straightforward:
docker service update \
--update-parallelism 2 \
--update-delay 30s \
--update-failure-action rollback \
--image myapp:v2 \
web-app
This updates 2 containers at a time, waits 30 seconds between batches, and automatically rolls back if updates fail.
Best Practices for Production
1. Implement Comprehensive Health Checks
Your health checks should verify:
- Application process is running
- Critical dependencies are reachable (database, cache, etc.)
- Application can serve requests (not just that the port is open)
Example robust health check in Go:
func healthCheck(w http.ResponseWriter, r *http.Request) {
// Check database connectivity
if err := db.Ping(); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
json.NewEncoder(w).Encode(map[string]string{
"status": "unhealthy",
"reason": "database unreachable",
})
return
}
// Check Redis connectivity
if err := redisClient.Ping().Err(); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
json.NewEncoder(w).Encode(map[string]string{
"status": "unhealthy",
"reason": "cache unreachable",
})
return
}
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(map[string]string{
"status": "healthy",
})
}
2. Set Appropriate Timeouts
Configure timeouts at every layer:
- Health check timeout: How long to wait for health check response
- Connection draining timeout: How long to wait for existing connections
- Update timeout: Overall time limit for the entire rolling update
In Kubernetes:
spec:
progressDeadlineSeconds: 600 # Fail deployment if not done in 10 min
minReadySeconds: 10 # Wait 10s after ready before continuing
readinessProbe:
timeoutSeconds: 5 # Health check must respond in 5s
successThreshold: 1 # Must succeed once to be ready
failureThreshold: 3 # Allow 3 failures before marking unready
3. Use PodDisruptionBudgets (Kubernetes)
Prevent rolling updates from causing capacity issues during node maintenance:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-app-pdb
spec:
minAvailable: 3
selector:
matchLabels:
app: web-app
This ensures at least 3 pods are always available, even during voluntary disruptions like node drains.
4. Implement Graceful Shutdown
Your application should handle termination signals properly:
func main() {
server := &http.Server{Addr: ":8080"}
go func() {
if err := server.ListenAndServe(); err != http.ErrServerClosed {
log.Fatal(err)
}
}()
// Wait for interrupt signal
quit := make(chan os.Signal, 1)
signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
<-quit
log.Println("Shutting down server...")
// Give existing requests 30 seconds to complete
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
if err := server.Shutdown(ctx); err != nil {
log.Fatal("Server forced to shutdown:", err)
}
log.Println("Server exited gracefully")
}
5. Monitor Deployment Progress
Use metrics to track rollout health:
Key metrics to watch:
- Error rate: Should not spike during rollout
- Response latency: Should remain stable
- Request success rate: Should stay at baseline
- Health check failures: Should be zero or minimal
Set up alerts that automatically pause or rollback deployments if these metrics degrade.
Common Pitfalls and How to Avoid Them
Pitfall 1: Database Migrations
Problem: New code expects new schema, but rolling update means old and new code run simultaneously.
Solution: Use backward-compatible migrations:
- Deploy code that works with both old and new schema
- Run migration to add new columns/tables (don’t remove old ones yet)
- Deploy code that uses only new schema
- Run migration to remove old columns/tables
This “expand-contract” pattern ensures compatibility during rollout.
Pitfall 2: Incorrect Health Checks
Problem: Health check returns 200 before app is truly ready, causing traffic to fail.
Solution:
- Separate readiness from liveness
- Test dependencies in readiness probe
- Use initialDelaySeconds to account for startup time
- Don’t check external services in liveness (they’re not under your control)
Pitfall 3: Insufficient Resources
Problem: Using maxSurge creates temporary resource spike, hitting cluster limits.
Solution:
- Set resource requests/limits on pods
- Ensure cluster has headroom for surge instances
- Or use maxSurge: 0 with maxUnavailable: 1 for resource-constrained environments (slower but no extra resources needed)
Pitfall 4: Ignoring Connection Draining
Problem: Terminating pods immediately drops active connections.
Solution:
- Configure terminationGracePeriodSeconds
- Implement graceful shutdown in application
- Ensure load balancer respects connection draining
spec:
terminationGracePeriodSeconds: 60
containers:
- name: app
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"]
The sleep gives load balancer time to deregister the pod before shutdown starts.
Pitfall 5: Too Aggressive Rollout
Problem: Updating many instances simultaneously increases risk if new version has issues.
Solution:
- Start with conservative maxSurge and maxUnavailable
- Use progressive delivery (canary) for high-risk changes
- Implement automatic rollback based on metrics
When NOT to Use Rolling Updates
Rolling updates aren’t always the right choice:
Use Blue-Green Instead When:
- Database migrations are complex: Can’t easily make backward-compatible changes
- Need instant rollback: Blue-green offers near-instantaneous rollback by switching traffic
- Testing new infrastructure: Want to validate entire new stack before switching
Use Canary Instead When:
- High-risk changes: Want to test with small percentage of users first
- Need gradual exposure: Progressive rollout based on metrics
- A/B testing features: Want controlled comparison between versions
Use Recreate Strategy When:
- Breaking changes: New and old versions absolutely cannot coexist
- Resource constrained: Can’t afford any temporary surge in resource usage
- Development/staging environments: Downtime is acceptable
Monitoring and Validation
Successful rolling updates require robust monitoring:
Pre-Deployment Checks
# Verify image exists
docker pull myapp:v2
# Check resource quotas
kubectl describe quota
# Verify configuration
kubectl diff -f deployment.yaml
During Deployment
# Watch rollout status
kubectl rollout status deployment/web-app
# Monitor pod events
kubectl get events --watch
# Check logs of new pods
kubectl logs -f deployment/web-app
Post-Deployment Validation
- Run smoke tests against production
- Verify metrics dashboards (Grafana, Datadog, etc.)
- Check error tracking (Sentry, Rollbar, etc.)
- Monitor SLA metrics (uptime, latency, error rate)
Real-World Example: Updating a Production API
Let’s walk through a real scenario:
Context: API serving 1000 req/sec, 10 instances, need to deploy security patch
Deployment configuration:
replicas: 10
rollingUpdate:
maxSurge: 2 # Can temporarily have 12 instances
maxUnavailable: 0 # Never drop below 10 instances
Timeline:
- T+0s: Deployment starts, creates 2 new pods (v2)
- T+15s: New pods pass health checks, join load balancer
- T+16s: Terminate 2 old pods, create 2 more new pods
- T+31s: Next 2 v2 pods ready, 2 more v1 pods terminated
- T+90s: All 10 instances updated, 2 surge pods terminated
Result: Update completed in 90 seconds with zero dropped requests.
Key success factors:
- Fast health checks (5s interval)
- Proper readiness probe (waited for DB connections)
- Sufficient surge capacity (20% overhead acceptable)
- Application graceful shutdown (30s drain time)
Conclusion
Rolling updates represent a mature, battle-tested approach to zero-downtime deployments. They’re not magic—they require careful configuration, robust health checks, and monitoring—but when implemented correctly, they enable teams to deploy with confidence.
The key takeaways:
- Health checks are non-negotiable: Invest time in making them accurate
- Understand your parameters: maxSurge and maxUnavailable control everything
- Plan for the worst: Implement graceful shutdown and connection draining
- Monitor actively: Deployments should be observable events, not black boxes
- Know the alternatives: Rolling updates aren’t always the right choice
In production, the ability to update systems without downtime isn’t just a nice-to-have—it’s table stakes. Master rolling updates, and you’ll have a deployment strategy that scales from small services to massive distributed systems.
Now go forth and deploy fearlessly. Your users won’t even notice.