Kubernetes Monitoring with Prometheus: A Complete Guide

Adopting Prometheus Kubernetes monitoring is a standard move for most engineering teams, but many only scratch the surface of its capabilities. Its real power lies in an architecture designed specifically for dynamic environments. The pull-based model and native service discovery allow Prometheus to automatically track pods, services, and nodes as they are created and destroyed, eliminating the manual configuration burden that plagues other systems.

This guide moves beyond a basic setup tutorial. We will explore the core components that make Prometheus so effective, show you how to write powerful PromQL queries to extract critical insights, and detail the strategies required to scale your monitoring infrastructure without compromising performance or reliability.

Unified Cloud Orchestration for Kubernetes

Manage Kubernetes at scale through a single, enterprise-ready platform.

GitOps Deployment
Secure Dashboards
Infrastructure-as-Code
Book a demo

Key takeaways:

  • Adopt a Declarative Monitoring Stack: A robust Prometheus setup requires more than just the server. Use the Prometheus Operator to declaratively manage the full ecosystem—including exporters for data collection, Alertmanager for notifications, and Grafana for visualization—as a unified toolchain.
  • Use PromQL to Diagnose, Not Just Observe: Move beyond simple data collection by writing targeted PromQL queries to analyze resource consumption, application performance, and error rates. This allows you to proactively identify resource pressure and performance bottlenecks before they cause outages.
  • Automate Fleet-Wide Monitoring with GitOps: Managing Prometheus configurations across many clusters manually is error-prone and leads to drift. Use a platform like Plural to treat your monitoring stack as code, automating deployments and enforcing consistent rules with Global Services to eliminate inconsistencies and reduce operational load.

What is Prometheus and Why Use It for Kubernetes?

Prometheus is the leading open-source system for monitoring Kubernetes environments. Built for dynamic infrastructure, it collects and stores time-series metrics by scraping HTTP endpoints on your workloads, nodes, and control plane components.

In Kubernetes, where containers spin up and down constantly, Prometheus shines with:

  • A pull-based model that automatically discovers targets via Kubernetes service discovery
  • A multi-label data model for rich, contextual metrics (e.g., container="nginx", namespace="prod")
  • PromQL, a powerful query language for slicing and aggregating data on the fly

With Prometheus, you can track everything from pod restarts and memory usage to API server latency and node health, enabling proactive alerting, historical analysis, and real-time dashboards.

For platform teams running Kubernetes in production, Prometheus isn’t just nice to have—it’s the backbone of a reliable observability stack. It integrates seamlessly with tools like Grafana for visualization and Alertmanager for real-time incident response.

Whether you're debugging a performance spike or enforcing SLAs, Prometheus gives you the visibility needed to run Kubernetes with confidence.

Why Prometheus Excels in Kubernetes Environments

Prometheus is practically tailor-made for Kubernetes. Its native service discovery detects new pods, nodes, and services in real time—automatically scraping metrics as workloads scale or shift. No manual reconfiguration is needed.

Combined with its label-based data model, Prometheus allows developers and platform engineers to gain deep insights across their clusters:

  • View memory usage per pod in a namespace
  • Track CPU across node pools
  • Monitor specific containers by label
  • Alert on outliers or anomalies

You get fine-grained observability without coupling your metrics system to specific app logic or static infrastructure.

Debunking Common Prometheus Misconceptions

“Prometheus doesn’t do dashboards.”
That’s true—by design. Prometheus focuses on high-performance data collection, not visualization. For dashboards, it pairs perfectly with Grafana, which has become the default frontend for Prometheus metrics.

“Prometheus Operator is too complex.”
Running Prometheus Operator adds some overhead, but it dramatically simplifies production deployments. It automates everything from scraping config to Alertmanager setup. Tools like Plural make this even easier by packaging Prometheus and its ecosystem (Operator, Alertmanager, Grafana) as part of a managed GitOps-friendly stack.

How to Set Up Prometheus in Your Kubernetes Cluster

Setting up Prometheus is a foundational step for building observability into your Kubernetes environment. While many teams eventually automate this with operators or GitOps workflows, understanding the manual setup gives you visibility into how each component works—and how to scale it responsibly across clusters.

This guide breaks the process down into three phases: manual setup, Kubernetes-native configuration, and scaling via GitOps with Plural.

Installation Prerequisites and Steps

Before deploying Prometheus, make sure you have:

  • A running Kubernetes cluster
  • kubectl configured and authenticated
  • Sufficient permissions to create cluster-scoped resources

Step-by-step:

Create a Namespace
Use a dedicated namespace to isolate Prometheus and related monitoring tools:

kubectl create namespace monitoring

Grant Permissions
Prometheus needs cluster-wide access to discover targets and scrape metrics. Create a ClusterRole and ClusterRoleBinding (example YAML) to give it read-only access to Kubernetes resources.

Create a ConfigMap for prometheus.yml
The prometheus.yml file defines scrape targets and discovery behavior:

kubectl create configmap prometheus-config \
  --from-file=prometheus.yml \
  -n monitoring

Deploy Prometheus
Define a Deployment to run the Prometheus server pod, mounting the ConfigMap and setting up storage (e.g., with emptyDir or a PersistentVolumeClaim).

Expose the Service
Create a ClusterIP or NodePort service for internal access:

apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitoring
spec:
  ports:
    - port: 9090
      targetPort: 9090
  selector:
    app: prometheus

Access the UI
Use kubectl port-forward to view Prometheus in your browser:

kubectl port-forward svc/prometheus 9090:9090 -n monitoring

Configure Kubernetes Service Discovery

One of Prometheus's core strengths is dynamic Kubernetes service discovery. Instead of manually listing targets, Prometheus can discover services, endpoints, and pods in real-time.

In prometheus.yml, you configure scrape_configs like this:

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: (.+):(?:\d+);(\d+)
        replacement: $1:$2
        target_label: __address__

This tells Prometheus to scrape all pods with the prometheus.io/scrape: "true" annotation, dynamically adjusting as pods are added or removed.

Integrate with Plural's Deployment Pipelines

While manual setup is useful for local clusters and learning, it doesn't scale across teams or environments. Plural solves this by enabling you to manage Prometheus as infrastructure-as-code with GitOps.

With Plural’s Continuous Deployment engine:

  • Define Prometheus (or the full Prometheus Operator) as a deployable resource
  • Store prometheus.yaml, scrape configs, and alert rules in Git
  • Use Plural’s Global Services to enforce consistent configs and alerting logic across multiple clusters
  • Automatically deploy dashboards, alerts, and exporters via reusable Helm charts or Plural marketplace apps

This approach turns Prometheus setup into a repeatable, versioned process, eliminating config drift and making it easy to scale observability alongside your platform.

Core Components of the Prometheus Stack

To effectively monitor Kubernetes, you need more than just a Prometheus server—you need the full observability ecosystem around it. The real power of Prometheus comes from its integration with exporters, Alertmanager, and optional tooling like the Prometheus Operator. Together, these components form a scalable, flexible, and production-grade monitoring stack.

Prometheus Server, Exporters, and Alertmanager

At the core of the stack is the Prometheus server, which scrapes and stores metrics in a high-efficiency time-series database. It follows a pull-based model, making regular HTTP requests to known endpoints to retrieve metrics.

To expose metrics from your infrastructure, you use exporters—lightweight agents that expose Prometheus-formatted metrics from otherwise opaque systems. Common examples include:

Once metrics are flowing, you need to act on them. That’s the job of Alertmanager. Prometheus evaluates alerting rules and sends alerts to Alertmanager, which handles deduplication, grouping, and routing to destinations like Slack, PagerDuty, Opsgenie, or email. This setup helps prevent alert fatigue while ensuring critical events are never missed.

Plural makes this easy by offering a pre-configured Prometheus + Alertmanager stack through its marketplace, ensuring all components are deployed and integrated correctly.

Simplify Management with the Prometheus Operator

Managing Prometheus manually in a fast-changing Kubernetes environment is error-prone. That’s why most teams adopt the Prometheus Operator, which extends the Kubernetes API with purpose-built Custom Resource Definitions (CRDs).

With the Operator, you can define your entire monitoring stack declaratively using these objects:

  • ServiceMonitor: tells Prometheus to scrape metrics from a Service
  • PodMonitor: defines scrape configs at the pod level
  • PrometheusRule: encapsulates alerting and recording rules
  • Prometheus: manages the lifecycle and config of the Prometheus instance itself

This declarative model makes Prometheus fully GitOps-compatible. Instead of editing configuration files, you write Kubernetes YAML. With Plural’s Continuous Deployment engine, you can manage these CRDs as code and deploy consistent monitoring across clusters with one pipeline.

Automate Configuration with Kubernetes Service Discovery

In Kubernetes, resources like pods and services are created, updated, and destroyed frequently. Static scrape configs become unmanageable in this dynamic environment. Prometheus solves this with built-in Kubernetes service discovery.

By querying the Kubernetes API directly, Prometheus can dynamically find and monitor:

  • All pods with specific annotations
  • Nodes and their system metrics
  • API servers, kubelets, and control plane components
  • Services that expose Prometheus endpoints

For example, any pod with this annotation:

prometheus.io/scrape: "true"

can be automatically detected and scraped—no manual config updates needed.

Plural enhances this further with Global Services. You define a base configuration once, and it’s applied fleet-wide, ensuring uniform discovery behavior and eliminating config drift across clusters.

Monitor Key Kubernetes Metrics with PromQL

Once Prometheus is scraping metrics from your cluster, the next step is to derive insights from that data, and that’s where PromQL comes in. PromQL (Prometheus Query Language) is a purpose-built language for querying time-series data. It allows you to filter, aggregate, and transform metrics to uncover performance issues, diagnose service disruptions, and monitor key business indicators.

By mastering a few core PromQL techniques, you can shift from simply collecting metrics to actively driving uptime, efficiency, and performance across your Kubernetes environment.

Essential Node, Pod, and Container Metrics

The health of your cluster starts with its infrastructure. At a minimum, you should continuously monitor the following components:

  • Nodes: CPU, memory, disk I/O, and network throughput
  • Pods: lifecycle phase, restarts, and resource usage
  • Containers: memory limits, CPU throttling, and OOM kills

With kube-state-metrics and node_exporter, Prometheus exposes metrics like:

kube_pod_status_phase{phase!="Running"}

This query shows all pods that are not in a healthy running state—often the first sign of misconfiguration or resource pressure.

Other useful queries include:

node_memory_Active_bytes / node_memory_MemTotal_bytes
Tracks node memory usage ratio.
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod)
Shows CPU usage by pod over time.
kube_pod_container_status_restarts_total > 0
Flags containers with non-zero restart counts.

Track Application-Specific Metrics

While infrastructure metrics show where a problem might exist, application-specific metrics tell you what’s going wrong. For production-grade observability, you need to instrument your applications with a Prometheus client library (Go, Python, Java, etc.).

With instrumentation, you can monitor:

  • Business KPIs: transactions, user sessions, orders placed
  • Service-level performance: request latencies, error rates, cache hit ratios

Follow the “Four Golden Signals” framework:

  1. Latency: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
  2. Traffic: sum(rate(http_requests_total[1m]))
  3. Errors: rate(http_requests_total{status=~"5.."}[5m])
  4. Saturation: rate(container_cpu_usage_seconds_total[5m]) against limit

These metrics provide a high-fidelity view of how your services perform under real-world load—and how that performance impacts users.

Write Powerful Queries with PromQL

PromQL is more than a query language—it’s a full analytical engine for time-series data. You can use it to:

  • Calculate moving averages and growth rates
  • Compare current performance to historical baselines
  • Alert on patterns like sudden drops, slow creep, or burst errors

Example: To track the 5-minute error rate for a service:

rate(http_requests_total{job="my-api", status=~"5.."}[5m])

To detect sustained memory usage above 90%:

(sum(container_memory_usage_bytes{container!="",pod!=""}) by (pod)) / 
(sum(kube_pod_container_resource_limits_memory_bytes) by (pod)) > 0.9

These queries can be visualized using Grafana or directly within a dashboard tool like Plural.

Unified Insights with Plural

While PromQL gives you the raw power, interpreting results—especially across multiple clusters—can be time-consuming. That’s where Plural’s embedded observability dashboard adds value:

  • View Prometheus metrics alongside logs and events in one pane
  • Use AI-assisted root cause analysis to automatically surface the source of issues
  • Standardize dashboards across environments using Global Services

Whether you’re dealing with a spike in latency or a degraded node, PromQL gives you the answers, and Plural turns those answers into action.

Visualize Data and Configure Alerts

Monitoring is only useful when it drives action. Once Prometheus is collecting metrics, the next step is making that data visible and actionable. This involves two critical components:

  • Dashboards, for visual understanding of trends and anomalies
  • Alerts, for real-time notification when something goes wrong

Together, these tools turn raw metrics into operational intelligence. But traditional workflows—jumping between Grafana panels, alert messages, and logs—can slow down response times. Modern observability platforms solve this by integrating data, context, and automation in one place. Plural’s AI Insight Engine, for example, uses Prometheus data as a foundation and adds intelligent root cause analysis on top, transforming detection into fast diagnosis.

Set Up Kubernetes Alerts with Alertmanager

Alertmanager is Prometheus’s alerting companion. When Prometheus detects a threshold breach—like high latency or an offline node—it fires an alert. Alertmanager receives these alerts, groups them intelligently, suppresses duplicates, and routes them to the right communication channels.

How to set it up:

Define alert rules in Prometheus:

groups:
  - name: pod-alerts
    rules:
      - alert: PodCrashLooping
        expr: kube_pod_container_status_restarts_total > 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod is restarting frequently"

Configure Alertmanager with your desired receivers:

route:
  group_by: ['namespace']
  receiver: 'slack-alerts'

receivers:
  - name: 'slack-alerts'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true

Integrate with Slack, PagerDuty, email, or any webhook.

With proper grouping and routing, your team gets fewer, more meaningful notifications, reducing noise and response fatigue.

Build Effective Grafana Dashboards

Grafana is the de facto visualization layer for Prometheus. It turns queries into rich, interactive dashboards that help you:

  • Monitor resource usage over time
  • Spot patterns in application behavior
  • Correlate metrics across services

Start by importing community dashboards for Kubernetes, nodes, and container runtimes. Then, customize your panels with PromQL queries tailored to your workloads.

For example:

  • Pod restarts over time: rate(kube_pod_container_status_restarts_total[5m])
  • 95th percentile request latency: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

With Plural, you don’t have to manage Grafana separately. The platform provides an embedded Kubernetes dashboard out of the box—SSO-enabled, RBAC-compliant, and integrated with Prometheus—giving you immediate visibility without additional tooling.

Accelerate Troubleshooting with Plural’s AI Insight Engine

Traditional alerting tells you that something is broken—but not why. Most teams still spend valuable time jumping between logs, dashboards, manifests, and incidents to track down root causes.

Plural’s AI Insight Engine automates this step.

It ingests:

  • Prometheus metrics
  • Kubernetes event streams
  • GitOps deployment data (like Terraform or Helm changes)

It then constructs a causal graph of your system. When an alert fires, the engine analyzes relationships across infrastructure and application layers to pinpoint the root cause—whether it's a misconfigured Deployment, a resource overrun, or a failed rollout.

This drastically reduces MTTR (mean time to resolution) by providing contextual, actionable insights, not just symptom alerts.

Example: A spike in latency is traced to a recent config change in your Helm chart that reduced CPU limits—without you needing to dig manually.

Scale and Optimize Prometheus for Large Clusters

As your Kubernetes footprint grows, the limitations of a single Prometheus instance become increasingly apparent. You’ll start to see slower queries, shorter data retention windows, and inconsistent configurations across clusters. These issues signal the need for a more scalable monitoring architecture—one that can support multi-cluster environments, handle massive metric volumes, and maintain consistency across your fleet.

To meet these demands, advanced strategies like federation, remote storage, and GitOps-based configuration management are essential. This section covers each in detail, so you can extend Prometheus from a simple local setup into a robust observability platform for enterprise-scale Kubernetes deployments.

Use Federation for Multi-Cluster Monitoring

If you're running Prometheus in multiple clusters, aggregating all raw metrics into a central instance is both costly and inefficient. Instead, use Prometheus federation to build a hierarchical monitoring architecture.

Here’s how it works:

  • Each cluster runs its own local Prometheus instance, scraping high-cardinality data from pods, nodes, and services.
  • A global Prometheus server periodically scrapes pre-aggregated metrics from each local instance via /federate.
  • This gives you a centralized, high-level view of system health across environments (e.g., dev, staging, prod) without overwhelming the global server.
Tip: Limit the federated metrics to coarse-grained KPIs (e.g., job:request_latency:avg) by using recording rules on the local Prometheus servers. This reduces transfer volume and query complexity.

Federation is ideal when you need aggregate monitoring and alerting across clusters, but want to keep detailed, high-volume scraping close to the data source.


Optimize with Remote Storage Solutions

While Prometheus’s TSDB is excellent for short-term metric storage and fast local queries, it’s not built for:

  • Long-term retention (weeks/months of data)
  • High durability
  • Horizontal scalability

For large-scale deployments, offload your metrics using Prometheus’s remote_write feature. This allows Prometheus to continuously push metrics to remote storage backends like:

  • Thanos: Adds global querying, downsampling, and object storage support (e.g., S3, GCS)
  • Cortex: Offers multi-tenancy, horizontal scalability, and HA
  • VictoriaMetrics: A fast, cost-efficient time-series database for large volumes of Prometheus data
These backends give you:Long-term durabilityGlobal querying across clustersLower total storage costs via compression and deduplication

This decoupling of storage from scraping lets you scale each independently, enabling high-throughput ingestion and deep historical analysis without impacting Prometheus performance.

Ensure Uniform Monitoring with Plural’s Global Services

Managing Prometheus configurations—like scrape jobs, ServiceMonitors, and alert rules—across dozens of clusters is a recipe for drift and maintenance overhead. One typo or version mismatch can lead to missed alerts or blind spots in monitoring.

Plural’s Global Services solves this by turning configuration into a declarative, fleet-wide service.

How it works:

  1. You define your Prometheus stack—including the Operator, Alertmanager, and all CRDs like PrometheusRule and PodMonitor—in a Git repo.
  2. Plural uses a GitOps pipeline to apply this configuration across all your clusters.
  3. Any update to your config is automatically and uniformly rolled out to your entire fleet.

This approach ensures:

  • Consistency across environments
  • Zero manual drift
  • Simplified updates with version control
Roll out a new alert rule or dashboard update? Just commit to Git—Plural propagates it everywhere.

Global Services are essential for platform teams managing multiple clusters at scale, offering a centralized source of truth for monitoring configurations.

Troubleshoot Common Prometheus Issues

Even in a well-architected Kubernetes observability stack, Prometheus can occasionally run into trouble. Whether you're dealing with misconfigured scrape jobs, performance bottlenecks, or alert noise, effective troubleshooting is critical to maintaining visibility into your systems. Here’s how to solve the most common problems—and how Plural makes it easier to manage them at scale.

Secure Your Prometheus Setup

A default Prometheus install is not secure enough for production. Out-of-the-box, it lacks authentication, persistent storage, and redundancy—all essential in multi-cluster environments.

To secure and stabilize your deployment:

  • Use the Prometheus Operator to automate lifecycle management, apply configuration as code, and manage CRDs like ServiceMonitor and PrometheusRule.
  • Enable persistent storage with PVCs to avoid losing metrics after pod restarts or node failures.
  • Integrate Thanos to add high availability, object storage integration, and long-term metric retention.
With Plural Global Services, you can define these production best practices in a single manifest and automatically apply them across your entire Kubernetes fleet—ensuring consistency, security, and resilience.

Tune Performance and Allocate Resources

Prometheus is resource-intensive, especially in large clusters. Without proper limits and tuning, it can overconsume CPU and memory, leading to eviction or degraded cluster performance.

Key performance tuning tips:

  • Set realistic CPU/memory requests and limits on your Prometheus and Alertmanager pods.
  • Use --storage.tsdb.retention.time and --storage.tsdb.retention.size flags to control data retention and reduce disk usage.
  • Monitor query latency, series cardinality, and scrape duration using internal Prometheus metrics like prometheus_engine_query_duration_seconds.
With Plural’s unified Kubernetes dashboard, you can monitor Prometheus resource consumption alongside your other workloads, making it easy to spot bottlenecks and fine-tune limits without hopping between tools.

Resolve Common Configuration Errors

One of the most frequent sources of Prometheus issues is misconfigured YAML—especially in scrape jobs and relabeling rules. A small typo can silently prevent metric ingestion for critical services.

To prevent this:

  • Use the promtool utility to validate Prometheus configs before applying them.
  • Manage all configuration through GitOps, using Plural Stacks to version, peer-review, and roll out changes consistently.
Plural automatically applies validated changes across clusters and reloads Prometheus at runtime—no restarts or downtime required.

Apply Fixes Instantly with Plural AI

Identifying an alert is just the beginning. The real challenge is figuring out what caused it—and fixing it quickly.

Plural’s AI Insight Engine automates root cause analysis by building a causal evidence graph that correlates:

  • Prometheus alerts
  • Kubernetes events and logs
  • GitOps history and code diffs

For example, if a deployment enters a CrashLoopBackOff state, Plural can trace the root cause to a bad config change merged two hours ago. It then recommends the exact change needed to resolve the issue, turning hours of investigation into a one-click fix.

his shifts your monitoring from reactive alerting to proactive resolution, dramatically reducing mean time to recovery (MTTR).

Apply Advanced Prometheus Techniques

As your monitoring needs grow beyond out-of-the-box solutions, Prometheus provides the flexibility to adapt. From writing custom exporters to integrating with the broader CNCF observability ecosystem, mastering advanced Prometheus techniques lets you monitor what matters most to your business. Combined with a GitOps workflow, these capabilities ensure both flexibility and consistency across environments.

Create Custom Exporters and ServiceMonitors

Most core system metrics are already exposed through standard exporters like node_exporter or kube-state-metrics. But for application-specific insights—like transaction rates, user signup counts, or job queue depths—you'll need to expose custom metrics.

You can do this by:

  • Using a Prometheus client library (available in Go, Python, Java, etc.) to emit metrics from your application.
  • Serving them over an HTTP /metrics endpoint using the text/plain; version=0.0.4 format.
  • Deploying the application along with a ServiceMonitor CRD to dynamically register it with Prometheus.

A ServiceMonitor acts as a Kubernetes-native replacement for static scrape_configs. It uses label selectors to find matching Services, making metrics collection decentralized and scalable. This lets individual development teams manage their own monitoring configs—shipped right alongside the application code—without needing to touch the central Prometheus configuration.

Integrate with Other CNCF Tools

Prometheus shines as part of a composable observability stack. Some essential integrations include:

  • Grafana: The de facto tool for visualizing Prometheus metrics. Query with PromQL to build interactive, real-time dashboards.
  • Alertmanager: Handles deduplication, grouping, and routing of Prometheus alerts to tools like Slack, Opsgenie, or PagerDuty.
  • Prometheus Operator: Introduces CRDs like Prometheus, Alertmanager, ServiceMonitor, and PrometheusRule, enabling a fully declarative monitoring setup.

These tools help you move from a static, centralized model to a Kubernetes-native, GitOps-friendly architecture where observability components scale alongside your workloads. With Plural, these components are pre-integrated and lifecycle-managed, giving you a production-ready monitoring stack from day one.

Manage Prometheus as Code with Plural Stacks

Treating your observability setup as code is essential for scaling across environments. Plural Stacks provides a robust framework for defining, versioning, and deploying your entire monitoring stack declaratively.

With Plural Stacks, you can:

  • Define your full observability stack (Prometheus Operator, Grafana, Alertmanager, etc.) as code.
  • Version ServiceMonitor, PrometheusRule, and dashboard definitions in Git.
  • Use GitOps workflows to apply changes automatically and consistently across clusters.

When you push a change—say, a new alert or dashboard—it’s deployed across all clusters by Plural’s Continuous Deployment engine. This eliminates manual configuration, prevents drift, and ensures reliable, reproducible monitoring setups everywhere. Whether you're managing 3 clusters or 300, Plural ensures every environment has the same observability baseline, enforced through code and backed by CI/CD.

Unified Cloud Orchestration for Kubernetes

Manage Kubernetes at scale through a single, enterprise-ready platform.

GitOps Deployment
Secure Dashboards
Infrastructure-as-Code
Book a demo

Frequently Asked Questions

What's the difference between Prometheus and Grafana? They seem to always be used together. Think of them as two specialists that excel at different tasks. Prometheus is the engine for collecting and storing time-series data. Its main job is to scrape metrics from your services and provide a powerful query language, PromQL, to analyze that data. Grafana, on the other hand, is a visualization tool. It connects to data sources like Prometheus and allows you to build rich, interactive dashboards to display the metrics in a human-readable format. While Prometheus has a basic UI, Grafana is where you create the detailed graphs and charts your team will use daily. Plural simplifies this by providing an embedded Kubernetes dashboard that offers powerful visualization capabilities without needing to manage a separate Grafana instance.

Why is the Prometheus Operator recommended? Can't I just deploy Prometheus myself? You certainly can deploy Prometheus manually, but the Prometheus Operator automates the complex and repetitive tasks involved in managing it within a Kubernetes environment. Instead of manually editing configuration files, the Operator lets you use Kubernetes-native resources like ServiceMonitor to declaratively manage scrape targets, alerting rules, and even Prometheus server configurations. This approach is far more scalable and less error-prone, especially as your cluster changes. It integrates perfectly with GitOps workflows, which is why Plural uses it as a foundation for managing monitoring stacks consistently across your entire fleet.

How can I monitor a custom application that doesn't natively support Prometheus? If your application doesn't already expose metrics in the Prometheus format, you'll need to use an exporter. An exporter is a small, specialized service that acts as a translator. It collects metrics from your application or a third-party system and converts them into the format that Prometheus can understand and scrape. For custom applications, you can either find a pre-built exporter for the language or framework you're using or write a simple one yourself using a Prometheus client library. This allows you to track anything from application-specific performance indicators to business-level KPIs.

My team is already using Prometheus. How does Plural actually improve our setup? Plural enhances your existing Prometheus setup by addressing the operational challenges that arise when you manage it at scale. First, our Global Services feature ensures every cluster in your fleet has a consistent, standardized monitoring configuration, which eliminates drift and manual effort. Second, Plural Stacks allows you to manage your entire monitoring infrastructure as code, integrating it into a secure GitOps workflow. Most importantly, when an alert does fire, Plural's AI Insight Engine goes beyond just showing you a metric on a dashboard. It performs automatic root cause analysis by correlating Prometheus data with logs, cluster events, and code changes to pinpoint the exact source of the problem and even suggest a fix.

We're worried about Prometheus becoming a bottleneck as we scale. How do we handle long-term storage and performance? This is a common concern in large environments. The standard solution is to configure Prometheus to send its metrics to a dedicated remote storage system using its remote_write capability. Tools like Thanos or VictoriaMetrics are designed specifically for long-term, durable metric storage and offer features like data deduplication and a global query view across all your clusters. This architecture decouples storage from the Prometheus server, allowing you to scale each component independently. Plural helps you manage this entire stack as a cohesive unit, making it straightforward to deploy and configure these advanced, scalable monitoring architectures.