Mastering Rolling Updates: The Art of Zero-Downtime Deployments

It’s 2 AM. Your team just pushed a critical security patch to production. Users are still actively shopping, chatting, streaming—completely unaware that you’re replacing the very servers handling their requests. No downtime. No service interruption. Just a seamless transition from old to new.

This is the promise of rolling updates, and when done right, it’s one of the most elegant deployment strategies available to modern DevOps teams.

What Are Rolling Updates?

A rolling update is a deployment strategy that incrementally replaces instances of your application with new versions, ensuring that some instances remain available throughout the deployment process. Instead of taking down your entire application, updating it, and bringing it back up (the “big bang” approach), you update instances in batches while maintaining service availability.

Think of it like replacing planks on a bridge while people are still walking across—challenging, but absolutely possible with the right approach.

Why Rolling Updates Matter

The Zero-Downtime Imperative

In today’s always-on digital economy, downtime is measured not just in lost revenue, but in damaged reputation and customer trust. Consider these realities:

Financial Impact: For many e-commerce platforms, even 5 minutes of downtime during peak hours can mean tens of thousands in lost revenue
User Experience: Modern users expect 24/7 availability. A maintenance window announcement feels antiquated
Competitive Advantage: The ability to deploy features and fixes without service interruption is a genuine competitive differentiator

Faster Feedback Loops

Rolling updates enable a progressive rollout model. You can:

Deploy to a small percentage of your fleet first
Monitor metrics and error rates in real-time
Detect issues before they impact all users
Roll back quickly if problems emerge

Risk Mitigation

By updating incrementally, you create natural checkpoints. If instance 1 of 10 fails after update, you’ve impacted 10% of capacity, not 100%. This built-in safety mechanism is invaluable in production environments.

How Rolling Updates Work: The Mechanics

At its core, a rolling update follows this pattern:

Remove an instance from the load balancer (or service mesh)
Stop the old version on that instance
Start the new version on that instance
Run health checks to verify the new version is working
Add the instance back to the load balancer
Repeat for the next instance

The Critical Details

The devil is in the details. Here’s what actually happens behind the scenes:

Health Checks Are Everything

Your orchestrator relies on health checks to determine if a new instance is ready. You need two types:

Liveness probes: Is the application running?
Readiness probes: Is the application ready to serve traffic?

A common mistake is using the same endpoint for both. Your app might be “alive” (process running) but not “ready” (database connections not established).

Connection Draining

When you remove an instance from a load balancer, existing connections need time to complete. Good rolling update implementations:

Stop sending new requests to the instance
Wait for existing requests to complete (connection draining)
Only then terminate the old version

Skip this, and you’ll drop active requests—exactly what you’re trying to avoid.

Surge and Unavailable Instances

In Kubernetes terms:

maxSurge: How many extra instances you can create during update (e.g., 25% means for 4 instances, you can temporarily have 5)
maxUnavailable: How many instances can be down during update (e.g., 25% means for 4 instances, at most 1 can be down)

These parameters control update speed vs. resource usage. More surge = faster updates but higher temporary resource consumption.

Implementation Approaches

Kubernetes RollingUpdate Strategy

Kubernetes makes rolling updates a first-class citizen:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      containers:
      - name: app
        image: myapp:v2
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 10

Why this configuration?

maxSurge: 1, maxUnavailable: 0: Never reduces capacity below desired state. Always maintains 4 instances, temporarily scaling to 5 during update.
readinessProbe: Waits 5 seconds after startup, then checks every 5 seconds. Instance only receives traffic when this passes.
livenessProbe: Gives app 15 seconds to start, then checks every 10 seconds. If it fails, kubelet restarts the container.

AWS ECS Rolling Deployments

Amazon ECS offers similar capabilities with different terminology:

{
  "deploymentConfiguration": {
    "maximumPercent": 200,
    "minimumHealthyPercent": 100,
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    }
  }
}

minimumHealthyPercent: 100: Never drop below desired capacity
maximumPercent: 200: Can double capacity during deployment
deploymentCircuitBreaker: Automatically rolls back failed deployments

Docker Swarm

Docker Swarm’s approach is similarly straightforward:

docker service update \
  --update-parallelism 2 \
  --update-delay 30s \
  --update-failure-action rollback \
  --image myapp:v2 \
  web-app

This updates 2 containers at a time, waits 30 seconds between batches, and automatically rolls back if updates fail.

Best Practices for Production

1. Implement Comprehensive Health Checks

Your health checks should verify:

Application process is running
Critical dependencies are reachable (database, cache, etc.)
Application can serve requests (not just that the port is open)

Example robust health check in Go:

func healthCheck(w http.ResponseWriter, r *http.Request) {
    // Check database connectivity
    if err := db.Ping(); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        json.NewEncoder(w).Encode(map[string]string{
            "status": "unhealthy",
            "reason": "database unreachable",
        })
        return
    }

    // Check Redis connectivity
    if err := redisClient.Ping().Err(); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        json.NewEncoder(w).Encode(map[string]string{
            "status": "unhealthy",
            "reason": "cache unreachable",
        })
        return
    }

    w.WriteHeader(http.StatusOK)
    json.NewEncoder(w).Encode(map[string]string{
        "status": "healthy",
    })
}

2. Set Appropriate Timeouts

Configure timeouts at every layer:

Health check timeout: How long to wait for health check response
Connection draining timeout: How long to wait for existing connections
Update timeout: Overall time limit for the entire rolling update

In Kubernetes:

spec:
  progressDeadlineSeconds: 600  # Fail deployment if not done in 10 min
  minReadySeconds: 10           # Wait 10s after ready before continuing
  readinessProbe:
    timeoutSeconds: 5           # Health check must respond in 5s
    successThreshold: 1         # Must succeed once to be ready
    failureThreshold: 3         # Allow 3 failures before marking unready

3. Use PodDisruptionBudgets (Kubernetes)

Prevent rolling updates from causing capacity issues during node maintenance:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-app-pdb
spec:
  minAvailable: 3
  selector:
    matchLabels:
      app: web-app

This ensures at least 3 pods are always available, even during voluntary disruptions like node drains.

4. Implement Graceful Shutdown

Your application should handle termination signals properly:

func main() {
    server := &http.Server{Addr: ":8080"}

    go func() {
        if err := server.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatal(err)
        }
    }()

    // Wait for interrupt signal
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
    <-quit

    log.Println("Shutting down server...")

    // Give existing requests 30 seconds to complete
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    if err := server.Shutdown(ctx); err != nil {
        log.Fatal("Server forced to shutdown:", err)
    }

    log.Println("Server exited gracefully")
}

5. Monitor Deployment Progress

Use metrics to track rollout health:

Key metrics to watch:

Error rate: Should not spike during rollout
Response latency: Should remain stable
Request success rate: Should stay at baseline
Health check failures: Should be zero or minimal

Set up alerts that automatically pause or rollback deployments if these metrics degrade.

Common Pitfalls and How to Avoid Them

Pitfall 1: Database Migrations

Problem: New code expects new schema, but rolling update means old and new code run simultaneously.

Solution: Use backward-compatible migrations:

Deploy code that works with both old and new schema
Run migration to add new columns/tables (don’t remove old ones yet)
Deploy code that uses only new schema
Run migration to remove old columns/tables

This “expand-contract” pattern ensures compatibility during rollout.

Pitfall 2: Incorrect Health Checks

Problem: Health check returns 200 before app is truly ready, causing traffic to fail.

Solution:

Separate readiness from liveness
Test dependencies in readiness probe
Use initialDelaySeconds to account for startup time
Don’t check external services in liveness (they’re not under your control)

Pitfall 3: Insufficient Resources

Problem: Using maxSurge creates temporary resource spike, hitting cluster limits.

Solution:

Set resource requests/limits on pods
Ensure cluster has headroom for surge instances
Or use maxSurge: 0 with maxUnavailable: 1 for resource-constrained environments (slower but no extra resources needed)

Pitfall 4: Ignoring Connection Draining

Problem: Terminating pods immediately drops active connections.

Solution:

Configure terminationGracePeriodSeconds
Implement graceful shutdown in application
Ensure load balancer respects connection draining

spec:
  terminationGracePeriodSeconds: 60
  containers:
  - name: app
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 10"]

The sleep gives load balancer time to deregister the pod before shutdown starts.

Pitfall 5: Too Aggressive Rollout

Problem: Updating many instances simultaneously increases risk if new version has issues.

Solution:

Start with conservative maxSurge and maxUnavailable
Use progressive delivery (canary) for high-risk changes
Implement automatic rollback based on metrics

When NOT to Use Rolling Updates

Rolling updates aren’t always the right choice:

Use Blue-Green Instead When:

Database migrations are complex: Can’t easily make backward-compatible changes
Need instant rollback: Blue-green offers near-instantaneous rollback by switching traffic
Testing new infrastructure: Want to validate entire new stack before switching

Use Canary Instead When:

High-risk changes: Want to test with small percentage of users first
Need gradual exposure: Progressive rollout based on metrics
A/B testing features: Want controlled comparison between versions

Use Recreate Strategy When:

Breaking changes: New and old versions absolutely cannot coexist
Resource constrained: Can’t afford any temporary surge in resource usage
Development/staging environments: Downtime is acceptable

Monitoring and Validation

Successful rolling updates require robust monitoring:

Pre-Deployment Checks

# Verify image exists
docker pull myapp:v2

# Check resource quotas
kubectl describe quota

# Verify configuration
kubectl diff -f deployment.yaml

During Deployment

# Watch rollout status
kubectl rollout status deployment/web-app

# Monitor pod events
kubectl get events --watch

# Check logs of new pods
kubectl logs -f deployment/web-app

Post-Deployment Validation

Run smoke tests against production
Verify metrics dashboards (Grafana, Datadog, etc.)
Check error tracking (Sentry, Rollbar, etc.)
Monitor SLA metrics (uptime, latency, error rate)

Real-World Example: Updating a Production API

Let’s walk through a real scenario:

Context: API serving 1000 req/sec, 10 instances, need to deploy security patch

Deployment configuration:

replicas: 10
rollingUpdate:
  maxSurge: 2        # Can temporarily have 12 instances
  maxUnavailable: 0  # Never drop below 10 instances

Timeline:

T+0s: Deployment starts, creates 2 new pods (v2)
T+15s: New pods pass health checks, join load balancer
T+16s: Terminate 2 old pods, create 2 more new pods
T+31s: Next 2 v2 pods ready, 2 more v1 pods terminated
T+90s: All 10 instances updated, 2 surge pods terminated

Result: Update completed in 90 seconds with zero dropped requests.

Key success factors:

Fast health checks (5s interval)
Proper readiness probe (waited for DB connections)
Sufficient surge capacity (20% overhead acceptable)
Application graceful shutdown (30s drain time)

Conclusion

Rolling updates represent a mature, battle-tested approach to zero-downtime deployments. They’re not magic—they require careful configuration, robust health checks, and monitoring—but when implemented correctly, they enable teams to deploy with confidence.

The key takeaways:

Health checks are non-negotiable: Invest time in making them accurate
Understand your parameters: maxSurge and maxUnavailable control everything
Plan for the worst: Implement graceful shutdown and connection draining
Monitor actively: Deployments should be observable events, not black boxes
Know the alternatives: Rolling updates aren’t always the right choice

In production, the ability to update systems without downtime isn’t just a nice-to-have—it’s table stakes. Master rolling updates, and you’ll have a deployment strategy that scales from small services to massive distributed systems.

Now go forth and deploy fearlessly. Your users won’t even notice.

What Are Rolling Updates?

Why Rolling Updates Matter

The Zero-Downtime Imperative

Faster Feedback Loops

Risk Mitigation

How Rolling Updates Work: The Mechanics

The Critical Details

Implementation Approaches

Kubernetes RollingUpdate Strategy

AWS ECS Rolling Deployments

Docker Swarm

Best Practices for Production

1. Implement Comprehensive Health Checks

2. Set Appropriate Timeouts

3. Use PodDisruptionBudgets (Kubernetes)

4. Implement Graceful Shutdown

5. Monitor Deployment Progress

Common Pitfalls and How to Avoid Them

Pitfall 1: Database Migrations

Pitfall 2: Incorrect Health Checks

Pitfall 3: Insufficient Resources

Pitfall 4: Ignoring Connection Draining

Pitfall 5: Too Aggressive Rollout

When NOT to Use Rolling Updates

Use Blue-Green Instead When:

Use Canary Instead When:

Use Recreate Strategy When:

Monitoring and Validation

Pre-Deployment Checks

During Deployment

Post-Deployment Validation

Real-World Example: Updating a Production API

Conclusion

Related Posts

Kubernetes Best Practices for Production in 2025

Anatomy of a Developer Recruitment Scam: A Technical Forensic Analysis

Building Production-Grade AWS Infrastructure in 2025