GitOps Setup of Cilium Multi-Cluster with Plural
When you're running multiple Kubernetes clusters, finding the right network topology is difficult. One of the major difficulties for multi-cluster Kubernetes environments is finding the appropriate network topology. There are two main approaches:
- Isolated networks: Each cluster maintains its own isolated network and service discovery, exposing services to each other via either ingress or gateway APIs. This is the simplest solution and is robust.
- Service mesh bridge: Bridge all clusters in a common service mesh, giving full network availability from the ingress down to the pod level. While more complex to manage, this approach is advantageous if you need direct pod or service-level communication between clusters. It's a common approach with distributed databases and clustered solutions like WebSocket-based distribution networks.
If you're considering the multi-cluster network route, this guide will help you get started quickly.
Why Cilium Multi-Cluster and GitOps Don't Play Well Together
It's important to understand why Cilium isn't necessarily the best bedfellow of a GitOps process.
Cilium's installation process creates friction with GitOps workflows. Their documentation centers around CLI-driven installs:
cilium install --set cluster.name=$CLUSTER1 --set cluster.id=1 --context $CLUSTER1 cilium install --set cluster.name=$CLUSTER2 --set cluster.id=2 --context $CLUSTER2 |
The typical setup process looks like this:
- Create clusters with Terraform
- Get kubeconfig access to each cluster
- Install Cilium via CLI, joining clusters to the mesh
- Perform all future Cilium upgrades through CLI
Step 2 and onward often require manual intervention or scripting in CI systems like GitHub Actions. This isn't true GitOps though; it's imperative and prone to drift.
There is technically a way around this; you can reverse-engineer the Cilium CLI commands via Helm (since the CLI ultimately uses Helm for Kubernetes installs), and if you do a bit of investigation, the values file patterns can be recrafted. That said, you'll still face the following manual steps:
- Ensuring Cilium installs on each cluster
- Defining DNS entries for each exposed Cilium gateway so they can discover each other
- Modifying Helm values file (likely in Git) on each cluster creation (Cilium sometimes fails to discover if gateways don't exist in time and won't retry—though this should ultimately be fixed in Cilium.)
A GitOps Solution with Plural
Fortunately, you can set all this up through Plural with this approach:
- Terraform stacks define clusters, determining cluster ID and gateway DNS names, then spawn PR automation to register clusters in the mesh post creation.
- Root CA creation and registration as a secret distributed to all clusters via global services (required by Cilium for mesh authentication).
- Global services sync Silium into each registered cluster.
- PR automation triggers to declare a new Cilium mesh cluster.
This creates a clean GitOps setup operated entirely through declarative PRs. Once onboarded with PR automations, operators only need to approve PRs—no manual code changes required.
Here’s how it all works:
The Setup
This setup assumes you've configured Plural using the base GitOps setup defined with the plural up command. While other configurations are possible, this is the simplest approach.
Start with the PR automation resource and create a new Cilium clustermesh-enabled cluster, which is defined in bootstrap/pr-automations/cilium-cluster-creator.yaml
:
apiVersion: deployments.plural.sh/v1alpha1 kind: PrAutomation metadata: name: cilium-cluster-creator spec: name: cilium-cluster-creator icon: <https://plural-assets.s3.us-east-2.amazonaws.com/uploads/repos/d1a82b07-b809-4eb4-b615-8f24365b72b8/k8s.png?v=63861145828> identifier: mgmt documentation: | Sets up a PR to add a new cluster with prerequisites for Cilium cluster-mesh to the provided fleet creates: templates: - source: 'templates/cilium-cluster.yaml' destination: "services/{{ context.fleet }}/clusters/{{ context.tier }}/{{ context.name }}.yaml.liquid" external: true catalogRef: name: infra scmConnectionRef: name: plural # you'll need to add this ScmConnection manually before this is functional title: "Setting up {{ context.name }} cluster in fleet {{ context.fleet }}" message: | Setting up {{ context.name }} cluster in fleet {{ context.fleet }} Plural Service: mgmt/{{ context.fleet }}-{{ context.tier }} configuration: - name: fleet type: STRING documentation: Name for the fleet you want this cluster to belong to. - name: name type: STRING documentation: the name for this cluster validation: regex: '[a-z\\-]+' - name: tier type: ENUM documentation: What tier to place this cluster in. values: - dev - prod - name: region type: STRING documentation: Region where the cluster should be created. - name: kubernetesVersion type: STRING documentation: Kubernetes version to use for this cluster. validation: regex: '^1\\.[2-3][0-9]$' - name: clusterId type: STRING documentation: Cilium Cluster ID to give to this cluster, must be an integer between 1 and 255. validation: regex: '^[1-9][0-9]*$' |
This takes the necessary inputs for defining the new cluster, including the numeric cluster ID, and renders a single template to define the GitOps manifests that instantiate the new stack. You can find it at templates/cilium-cluster.yaml
:
{% capture templated %}{{ context.fleet }}-{{ context.tier }}{% endcapture %} {% assign name = context.name | default: templated %} apiVersion: deployments.plural.sh/v1alpha1 kind: InfrastructureStack metadata: name: cluster-{{ name }} spec: {% if context.ai %} agentId: {{ context.ai.session.agent_id }} {% endif %} name: cluster-{{ name }} detach: false type: TERRAFORM approval: true manageState: true actor: console@plural.sh configuration: version: '1.8' repositoryRef: name: infra namespace: infra clusterRef: name: mgmt namespace: infra git: ref: main folder: terraform/modules/clusters/aws variables: cluster: {{ name }} fleet: {{ context.fleet }} tier: {{ context.tier }} region: {{ context.region }} cluster_id: {{ context.clusterId }} {% raw %} kubernetes_version: "{{ configuration.kubernetesVersion }}" {% endraw %} --- apiVersion: deployments.plural.sh/v1alpha1 kind: Cluster metadata: name: {{ name }} spec: handle: {{ name }} |
In this code, we mostly just define a Terraform InfrastructureStack
using the variables specified in spec.variables
. The cluster is also registered to accept deployments via your Kubernetes operator with the Cluster
cr.
The stack itself only has a few modifications:
- At
terraform/modules/clusters/aws/plural.tf
, you modify theplural_cluster
resource setup to include thecluster_id
and other necessary metadata:
resource "plural_cluster" "this" { handle = var.cluster name = var.cluster tags = { fleet = var.fleet tier = var.tier role = "workload" } metadata = jsonencode({ tier = var.tier dns_zone = try(local.vpc.ingress_dns_zone, "example.com") # the dns zone is also defined cilium_cluster_id = var.cluster_id # set cluster id # everything else is the default setup, but worth noting that externaldns is necessary iam = { load_balancer = module.addons.gitops_metadata.aws_load_balancer_controller_iam_role_arn cluster_autoscaler = module.addons.gitops_metadata.cluster_autoscaler_iam_role_arn external_dns = module.externaldns_irsa_role.iam_role_arn cert_manager = module.externaldns_irsa_role.iam_role_arn } vpc_id = local.vpc.vpc_id region = var.region
network = { private_subnets = local.vpc.private_subnets public_subnets = local.vpc.public_subnets } }) kubeconfig = { host = module.eks.cluster_endpoint cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data) token = data.aws_eks_cluster_auth.cluster.token } depends_on = [ module.addons, module.ebs_csi_irsa_role, module.vpc_cni_irsa_role, module.externaldns_irsa_role ] } |
- At
terraform/modules/aws/cilium.tf
, you add the following file to call the cluster registration PR:
data "plural_pr_automation" "cilium_cluster_registrar" { name = "cilium-cluster-registrar" } resource "plural_pr_automation_trigger" "cilium" { pr_automation_id = data.plural_pr_automation.cilium_cluster_registrar.id pr_automation_branch = "cilium/register/${var.cluster}" context = { name = var.cluster tier = var.tier ciliumApiserverIp = "10.0.255.${var.cluster_id}" dnsZone = try(local.vpc.ingress_dns_zone, "example.com") } } |
- And finally, at
terraform/core-infra/cilium.tf
, you define a common root certificate for all clusters in thecore-infra
stack:
resource "tls_private_key" "cilium_ca_key" { algorithm = "RSA" rsa_bits = 4096 } resource "tls_self_signed_cert" "cilium_ca_cert" { private_key_pem = tls_private_key.cilium_ca_key.private_key_pem is_ca_certificate = true subject { common_name = "Cilium CA" organization = "Pluralsh" } allowed_uses = [ "crl_signing", "cert_signing", "key_encipherment", "digital_signature", "server_auth", "client_auth" ] validity_period_hours = 87600 # 10 years early_renewal_hours = 240 # Renew 10 days before expiry } output "cilium_ca_cert" { value = tls_self_signed_cert.cilium_ca_cert.cert_pem sensitive = true } output "cilium_ca_key" { value = tls_private_key.cilium_ca_key.private_key_pem sensitive = true } resource "kubernetes_secret" "cilium_ca_cert" { # this will ultimately be used in the cilium global service metadata { name = "cilium-ca-cert" namespace = "infra" } data = { "ca.crt" = tls_self_signed_cert.cilium_ca_cert.cert_pem "ca.key" = tls_private_key.cilium_ca_key.private_key_pem "ca.cert.b64" = base64encode(tls_self_signed_cert.cilium_ca_cert.cert_pem) "ca.key.b64" = base64encode(tls_private_key.cilium_ca_key.private_key_pem) } } |
This is all you need for the Terraform configuration. The rest is via Helm.
The Helm configuration is split up into two value files:
At helm/cilium/base.yaml.liquid,
you have the following:
cni: chainingMode: aws-cni exclusive: false enableIPv4Masquerade: false routingMode: native tls: ca: cert: {{ configuration["ca.cert.b64"] }} # getting the root ca setup from core-infra key: {{ configuration["ca.key.b64"] }} cluster: id: {{ cluster.metadata.cilium_cluster_id }} # cluster id from `plural_cluster` resource name: {{ cluster.handle }} clustermesh: useAPIServer: true config: enabled: true mcsapi: enabled: true
apiserver: tls: server: extraDnsNames: - {{ cluster.handle }}-cilium-apiserver.{{ cluster.metadata.dns_zone }} # use external dns to register a unique dns name for this clusters gateway auto: enabled: true method: "certmanager" certManagerIssuerRef: group: cert-manager.io kind: ClusterIssuer name: cilium service: type: LoadBalancer annotations: service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip service.beta.kubernetes.io/aws-load-balancer-scheme: "internal" external-dns.alpha.kubernetes.io/hostname: {{ cluster.handle }}-cilium-apiserver.{{ cluster.metadata.dns_zone }} |
And then for dev and prod clusters, you have separate values files declaring all the registered clusters at helm/cilium/dev-clusters.yaml
:
clustermesh: config: clusters: - address: orchid-dev-cilium-apiserver.dev.pocs.plural.sh name: orchid-dev port: 2379 - address: orchid-dev-usw1-cilium-apiserver.dev.pocs.plural.sh name: orchid-dev-usw1 port: 2379 - address: orchid-dev-usw2-cilium-apiserver.dev.pocs.plural.sh name: orchid-dev-usw2 port: 2379 |
Here's the global service that sets up the Helm chart (and references these values files):
apiVersion: deployments.plural.sh/v1alpha1 kind: GlobalService metadata: name: cilium-dev namespace: infra spec: mgmt: false tags: tier: dev # target only dev clusters template: name: cilium namespace: kube-system configurationRef: kind: Secret name: cilium-ca-cert # note this is referencing the secret we declared in the core-infra stack namespace: infra protect: false helm: version: "1.18.1" chart: cilium url: <https://helm.cilium.io> valuesFiles: - base.yaml.liquid # the two values files - dev-clusters.yaml git: folder: helm/cilium ref: main repositoryRef: kind: GitRepository name: infra namespace: infra |
The cilium-cluster-registrar
PR automation updates the cluster-specific YAML files:
apiVersion: deployments.plural.sh/v1alpha1 kind: PrAutomation metadata: name: cilium-cluster-registrar spec: name: cilium-cluster-registrar documentation: Registers a new cluster with the Cilium cluster-mesh updates: yamlOverlays: - file: "helm/cilium/{{ context.tier }}-clusters.yaml" listMerge: APPEND yaml: | clustermesh: config: clusters: - name: {{ context.name }} port: 2379 address: "{{ context.name }}-cilium-apiserver.{{ context.dnsZone }}" scmConnectionRef: name: plural title: "Registering {{ context.name }} cluster in {{ context.tier }} with Cilium cluster-mesh" message: "Registering {{ context.name }} cluster in {{ context.tier }} with Cilium cluster-mesh" identifier: mgmt configuration: - name: tier type: ENUM documentation: "the tier of the cluster" values: - dev - prod - name: name type: STRING documentation: The name of the cluster to register - name: dnsZone type: STRING documentation: The DNS zone of the cluster |
This automation appends the new cluster to the existing YAML and generates a PR for approval, adding the cluster to the mesh.
The Final Product
Once configured, cluster creation is extremely clean. Using the cilium-cluster-creator PR automation in the infra catalog, you complete a UI wizard, which then creates the PR to instantiate the stack. Everything runs automatically from there.
Conclusion
Setting up custom Kubernetes networking is never going to be an easy process. From certificate authorities, to DNS registration and config management, there are many concerns that need to be addressed from the ground up. We hope this walkthrough gave you a maintainable solution that ensures:
- Declarative and Git-based workflows: No hidden sources of truth or complex scripts.
- Maintainability: All operations run via pre-defined PRs and UI wizards, minimizing misconfiguration.
- Observability: All Cilium instances are registered and monitored within the Plural UI, rather than being hidden in Helm charts.