The Zero-Downtime Kubernetes Upgrade: A Step-by-Step Guide

Remember Reddit’s Pi Day outage in 2023? A single Kubernetes upgrade triggered hours of downtime—upgrades are messy, error-prone, and full of hidden deprecations. Even with scripts or partial automation, upgrades remain risky and inconsistent.

Kubernetes upgrades don't have to mean downtime. Plural.sh eliminates this risk with Upgrade Autopilot, an AI-powered feature that automates the entire upgrade process. It handles pre-flight checks, add-on compatibility analysis, and AI-driven insights, turning weeks of manual work into a predictable, one-click rollout across Amazon Elastic Kubernetes Service (EKS), Azure Kubernetes Service (AKS), Google Kubernetes Engine (GKE), and other managed services.

In this article, you'll learn how to perform a zero-downtime Kubernetes upgrade of an Amazon EKS cluster using Plural’s Upgrade Autopilot. We'll walk through automated pre-flight checks, canary-based node upgrades, and real-time health monitoring to show you how simple the process can be.

What Is Plural and Upgrade Autopilot?

Plural.sh is an enterprise-scale, AI-native Kubernetes management platform built for platform teams running cloud-native, production-ready infrastructure. It integrates AI into day-2 operations, automates complex upgrades, and manages Kubernetes fleets seamlessly across any cloud, data center, or edge environment—all under your control.

Upgrade Autopilot is Plural’s AI-powered upgrade assistant. It turns one of the riskiest and most time-consuming operational tasks—Kubernetes upgrades—into a predictable, one-click rollout. Notable features include: 

  • Automated pre-flight checks: Validate core Kubernetes components and third-party add-ons before upgrading.
  • API deprecation detection: Monitor GitOps configs and cloud provider deployments for deprecated APIs.
  • Add-on compatibility checks: Verify version requirements against Plural’s compatibility database.
  • Upgrade blocker analysis: Surface blockers with actionable recommendations.
  • Smart version selection: Identify the latest safe Kubernetes version using an automated compatibility matrix.

Automated rollout pipelines: Orchestrate phased upgrades across environments with built-in safety checks and GitOps-based rollback.

Step-by-Step Kubernetes Upgrade Process with Plural

Let's walk through upgrading an AWS EKS cluster using Plural's Upgrade Autopilot.

Prerequisites

Before you start, make sure you have the following prerequisites:

  • An AWS EKS Cluster managed by Plural: Your target cluster must be registered with Plural, so Upgrade Autopilot can run checks and orchestrate rollouts. For setup guidance, refer to this video tutorial.
  • Access to the Plural Console and kubectl: Use the Plural web console to view Kubernetes dashboards, and kubectl for command-line management of Kubernetes resources.
  • Basic Kubernetes knowledge: You should be comfortable checking cluster health, reviewing pod status, and performing basic troubleshooting.

Once you have all these prerequisites, you're ready to begin.

Set Up Upgrade Insights

Upgrade Insights provides a dashboard for upgrade-related information across cloud accounts and regions. It combines intelligence from providers like Amazon EKS with Plural's own analysis to provide a comprehensive view of cluster upgradeability.

For EKS users, this is especially valuable because the service has its own add-on ecosystem that can affect upgrades. In the latest Plural release, Upgrade Insights is enabled by default for all Plural-managed EKS clusters. Any cluster configured with plural up --cloud comes with Upgrade Insights pre-installed as part of Global Services. No additional setup is required.

To view the cluster insight, log in to the Plural app and click on Go to Console to access your management cluster console:

Find your target cluster and click the Upgrade button on the right:

In the panel that opens, expand the Check API Deprecations section. Switch to the Detected by Cloud Provider tab, and you'll be able to see insights from your cloud provider:

View the Upgrade Plan

The upgrade plan gives you a consolidated view of everything you need to upgrade a cluster safely. It highlights potential blockers, warnings, and eligibility, along with any preparation or remediation steps needed. This makes Kubernetes version updates more predictable and less risky.

To view the upgrade plan, log in to the Plural app and click on Go to Console to access your management cluster console. Then, find your target cluster and click Upgrade:

Perform Pre-Flight Checks and Other Upgrade Blockers

Before starting a cluster upgrade, Plural automatically runs a detailed analysis of potential blockers and issues. Review the upgrade plan carefully and resolve all warnings before proceeding.

Pre-Flight Checklist

Plural runs a pre-flight checklist to validate infrastructure prerequisites before upgrading. For example, Kubernetes permits only one minor version of drift between the kubelet and the control plane. The checklist confirms this requirement (e.g., v1.32 to v1.33) before proceeding.

Check Deprecated APIs

Deprecated APIs are a common cause of upgrade failures. Plural automatically detects them in two ways:

  • GitOps-based detection: Plural's fleet-scale GitOps engine inspects applied resources to flag any deprecated APIs in use.
  • Cloud provider insights: Upgrade Insights surfaces deprecations detected by your cloud provider (e.g., the EKS Insights API).

This gives you a single, consolidated view of API deprecations across multiple accounts, regions, and providers, which is important when managing large-scale environments.

Add-On Compatibility

Add-ons like the aws-ebs-csi-driver or aws-load-balancer-controller are critical for EKS clusters. The Plural agent scans installed add-ons, sends the data to the management cluster, and cross-references it against regularly updated compatibility tables.

When you select an add-on, Plural shows its compatibility table for your target version (e.g., v1.32 to v1.33) and whether it blocks the upgrade. A status like "Not Blocking" confirms it's safe to proceed:

Warnings

The Upgrade Plan also surfaces warnings, such as deprecated custom resources that may cause issues during upgrade. Plural’s AI-powered insights recommend fixes or links to documentation, helping you resolve blockers quickly and reduce risk:

Perform Cluster Upgrade

Once everything is prepared and you've resolved any upgrade flags or compatibility issues, you can trigger the upgrade. Plural makes this easy by integrating with its GitOps engine through Pull Request Automation (PRA). The Fleet Upgrader PRA is designed to automate upgrades across entire fleets of clusters, which is especially valuable in enterprise environments.

To perform the upgrade, log in to the Plural app and click on Go to Console to access your management cluster console. From the left menu, open the Self-Service page. Switch to the PR Automations tab, search for "fleet-upgrader", and click Create PR:

Enter the fleet name where clusters reside (e.g., demo) and the target Kubernetes version (e.g., v1.33), then click Review:

Provide the GitOps repository and branch where the cluster configuration lives:

Review the PR created by fleet-upgrader PRA and merge it:

After you merge, Plural generates a new PR for each cluster in the fleet. Review the changes (which include the Kubernetes version upgrade) and merge them:

Finally, go to the Stacks page and approve the PR changes. Under the hood, Plural updates the Terraform infrastructure to apply the upgrade:

Drain and Upgrade Nodes (Canary Rollout)

Plural Upgrade Autopilot minimizes disruption with a safe node upgrade strategy. On EKS, it uses Amazon's blue-green deployment strategy, creating a new node pool with an auto-scaling group. Upgrades begin with a canary rollout. One node is drained, upgraded, and tested before the process continues cluster-wide. Each node drain upgrades the kubelet and reschedules pods, helping you catch issues early.

However, a major challenge is workload rescheduling: Kubernetes drains nodes sequentially (kubectl drain), restarting pods without regard for deployment strategies, causing downtime. Plural addresses this with the ClusterDrain CRD, adding guardrails like controlled concurrency and label-based targeting to ensure graceful pod restarts.

yaml

apiVersion: deployments.plural.sh/v1alpha1

kind: ClusterDrain

metadata:

  name: drain-{{ cluster.metadata.master_version }}

spec:

  flowControl:

    maxConcurrency: 10

  labelSelector:

    matchLabels:

      deployments.plural.sh/drainable: "true"

When resources opt into this process, Plural performs a graceful restart by temporarily annotating their podTemplate. You can check out the Cluster Drain docs for more examples.

Monitor Cluster Health

As Autopilot runs the upgrade, the Plural Operational Console provides real-time visibility into your cluster. Key panels to monitor include:

Cluster Health Overview

The main dashboard provides an overview of cluster status and helps you spot issues during an upgrade. A healthy cluster is marked as "Healthy", while a "Degraded" status indicates reduced performance with workloads still online. An "Unavailable" status signals that a platform-level event is disrupting cluster health.

Nodes Panel

From the Kubernetes page in the Plural Console, you can open Cluster > Nodes to monitor worker nodes. During upgrades, nodes transition from Ready to NotReady or SchedulingDisabled while being drained, and then return to Ready. Watching CPU and memory usage confirms workloads are shifting off as expected.

Workloads Panel

You can access the Workloads panel from the Kubernetes page. It tracks the health of applications and services, including Deployments, Pods, StatefulSets, DaemonSets, and Jobs. During a node drain, pods should migrate smoothly to other nodes and restart. A spike in Pending or Failed pods can indicate problems like limited cluster capacity.

Events Panel

On the Kubernetes page, under Cluster > Events, the console displays a chronological feed of Kubernetes control plane events. This log makes it easy to trace what's happening before, during, and after an upgrade. Informational messages like Node upgraded successfully or Pod scheduled confirm normal progress, while warnings such as FailedScheduling or CrashLoopBackOff flag issues that need attention.

Production Best Practices

Even with Upgrade Autopilot handling the heavy lifting, it's important to follow best practices to ensure smooth rollouts. 

  • Schedule upgrades during maintenance windows: Limit user impact by aligning upgrades with low-traffic periods.
  • Test in staging first: Validate the process in a staging cluster before applying changes to production. 
  • Use GitOps for configuration management: Keep upgrade changes consistent, traceable, and version-controlled.
  • Monitor for API deprecations: Stay ahead of breaking changes by monitoring deprecated APIs before releases to prevent unexpected failures.
  • Establish observability baselines: Measure performance before and after the upgrade to quickly spot regressions and confirm the cluster is stable.

Conclusion

Upgrading Kubernetes doesn't have to mean downtime or weeks of manual effort. With Plural’s Upgrade Autopilot, you can perform a safe, zero-downtime upgrade.

Plural goes beyond upgrades. Its platform supports application installs, policy management, and multi-cluster control, giving you a unified way to manage Kubernetes at enterprise scale. Check out Plural's open-source repository to learn more.