Table of Contents
Deciding to implement Kubernetes (Day 0) and then getting your first deployment up and running (Day 1) is hard enough. But then there’s everything that comes after, commonly known as Day 2 Kubernetes. Many organizations overlook this stage, which is fraught with challenges and problems.
Once the initial excitement wears off, Day 2 is the make-or-break moment when your team needs to figure out how to manage and maintain Kubernetes for the long term. Otherwise, as you add features to your app and grow the complexity of your deployment, costs can and will pile up in the form of expensive outages, integration headaches, and lost developer velocity.
I have spent the past year talking to dozens of best-in-class DevOps teams about how to overcome some common operational challenges engineering teams face when wrangling Kubernetes.
Here is what I learned:
Why solving Day 2 Kubernetes is crucial
Day 2 Kubernetes covers DevOps processes—like monitoring, testing, runbooks, and alerting—that maintain the performance and reliability of your clusters. Often, these operations aren’t given careful thought in the initial push to deploy Kubernetes as quickly as possible. After all, there’s an extensive amount of terminology and concepts to learn in order to break into Kubernetes and just figure out the basics, like how to convert a Docker Compose file into a production K8s service.
However, while figuring out your initial deployment, it’s important to also think ahead to Day 2 and beyond. As with any open-source technology, choosing to self-host Kubernetes rather than a managed solution can provide huge cost savings and flexibility, but it comes with risks.
If your Kubernetes clusters are not well managed, monitored, or understood, your engineers can end up spending a significant amount of time root-causing and fixing failures. Security breaches or governance issues could lead to PR or compliance disasters. You could run up cloud costs as a result of misconfigurations. And overall, morale can take a hit as engineers spend more time writing Helm charts than they spend working on product features.
What problems do organizations face with Day 2 Kubernetes?
The problems that engineering organizations encounter when managing K8s tend to break down into these five areas:
Learning curve & knowledge transfer
Whether you’re using Kubernetes for just your data stack or converting your entire monolithic system into distributed microservices, you want to avoid a situation where just one or two engineers are responsible for maintaining your solution. However, there’s a steep learning curve and an overwhelming amount of material out there about K8s.
Furthermore, not only do you have to master the core Kubernetes API, you also have to master the toolchains to manage K8s. With so many options out there for different tools (Helm or Kustomize? Terraform or Ansible?), your solution will often end up being very specialized, making it painful to onboard new engineers or lose knowledge that exists within a few engineers in the org.
In most cases, especially if you use AWS, you won’t have a dashboard built-in for Kubernetes. To understand what all your resources are, you’ll need to use the command-line interface (kubectl)—and while some people are very comfortable with this, most aren’t and need the benefit of a visual interface.
Third-party app integrations
Often, the problems you’ll face with Day 2 Kubernetes aren’t technically Kubernetes problems. Rather, it’s the operational idiosyncrasies of how other applications interact with K8s that will give you headaches. For example, if you want to deploy Airflow on Kubernetes, you might not know how to scale the database underneath it or how to scale the workers, which metrics to visualize, or what CPU/memory tradeoffs to make.
This operational knowledge is unique to each application and has to be learned from scratch every time there’s a new open-source tool you want to use on Kubernetes. Any misconfigurations could result in a higher cloud bill than you really need to spend.
Monitoring, alerting, and disaster recovery
While you can get some logging built-in with K8s, in Day 2 it’s essential to set up your logs to connect to a central system (or set of tools) that you use for observability and alerting. Logging a dynamic, distributed system like Kubernetes is complicated. You’ll want to monitor multiple layers (e.g. Node and Cluster levels), each with its own lifecycle and different kinds of logs.
Along with logging, an alerting and disaster recovery strategy are a must for Day 2 Kubernetes. Again, teams can run into problems here because of the distributed nature of the system. It may not always be clear who the owner is for each service, so the person on-call might have no idea what to do or even who to contact in the case of an outage.
Security and governance
Kubernetes can be beneficial from a security perspective. If you have a consolidated networking layer using K8s, you don’t have to worry about exposing more data than you need to, and you can run an extra-secure layer on top of potentially less-secure third-party apps.
However, the way you store secrets and check for vulnerabilities will need to be adapted to work for Kubernetes, which can be especially challenging if you’re new to managing a distributed system. Furthermore, you’ll need to set up new access controls that follow your company’s best practices around governance and compliance.
What a solution to Kubernetes Day 2 looks like
In my experience, a solution to Day 2 Kubernetes needs to have the following components at a minimum:
- Dashboarding: A visual interface for managing your resources, for people who don’t want to use the command line.
- Integration testing suite: When you push a new version of a package to production, you want some way to automatically deploy it to test clusters and run health checks to make sure that everything is working perfectly.
- Access controls: It should be easy to set up access controls for your cluster from a central location, and audit trails should be baked in.
- Observability and alerting: If anything goes wrong, you need to be able to root-cause the issue quickly and alert the right people.
- Runbooks for disaster recovery: When there’s an issue, you need runbooks so that anyone on-call can quickly implement a fix. Which leads to the final point…
- Automation: Too often, teams end up reinventing the wheel when managing Kubernetes. When you want to deploy anything on K8s, you should be able to quickly find all the dashboards you need, all the hooks for scaling, and interactive runbooks that make the process repeatable.
Many companies try to string these components together from different fragmented DevOps solutions. However, to have a really effective solution, you need the whole suite to work together. When an alert fires, it should hook up to a runbook and point you to the fix. All your operations should be automated—and the knowledge around these operations should be accessible and available to everyone on the team, not just a few engineers.
To learn more about how Plural works and how we are helping engineering teams across the world deploy open-source applications in a cloud production environment, check out our Github to get started today.
Join us on our Discord channel for questions, discussions, and to meet the rest of the community.
Be the first to know when we drop something new.