GitOps Setup of Cilium Multi-Cluster with Plural

When you're running multiple Kubernetes clusters, finding the right network topology is difficult. One of the major difficulties for multi-cluster Kubernetes environments is finding the appropriate network topology. There are two main approaches:

Isolated networks: Each cluster maintains its own isolated network and service discovery, exposing services to each other via either ingress or gateway APIs. This is the simplest solution and is robust.
Service mesh bridge: Bridge all clusters in a common service mesh, giving full network availability from the ingress down to the pod level. While more complex to manage, this approach is advantageous if you need direct pod or service-level communication between clusters. It's a common approach with distributed databases and clustered solutions like WebSocket-based distribution networks.

If you're considering the multi-cluster network route, this guide will help you get started quickly.

Why Cilium Multi-Cluster and GitOps Don't Play Well Together

It's important to understand why Cilium isn't necessarily the best bedfellow of a GitOps process.

Cilium's installation process creates friction with GitOps workflows. Their documentation centers around CLI-driven installs:

cilium install --set cluster.name=$CLUSTER1 --set cluster.id=1 --context $CLUSTER1

cilium install --set cluster.name=$CLUSTER2 --set cluster.id=2 --context $CLUSTER2

The typical setup process looks like this:

Create clusters with Terraform
Get kubeconfig access to each cluster
Install Cilium via CLI, joining clusters to the mesh
Perform all future Cilium upgrades through CLI

Step 2 and onward often require manual intervention or scripting in CI systems like GitHub Actions. This isn't true GitOps though; it's imperative and prone to drift.

There is technically a way around this; you can reverse-engineer the Cilium CLI commands via Helm (since the CLI ultimately uses Helm for Kubernetes installs), and if you do a bit of investigation, the values file patterns can be recrafted. That said, you'll still face the following manual steps:

Ensuring Cilium installs on each cluster
Defining DNS entries for each exposed Cilium gateway so they can discover each other
Modifying Helm values file (likely in Git) on each cluster creation (Cilium sometimes fails to discover if gateways don't exist in time and won't retry—though this should ultimately be fixed in Cilium.)

A GitOps Solution with Plural

Fortunately, you can set all this up through Plural with this approach:

Terraform stacks define clusters, determining cluster ID and gateway DNS names, then spawn PR automation to register clusters in the mesh post creation.
Root CA creation and registration as a secret distributed to all clusters via global services (required by Cilium for mesh authentication).
Global services sync Silium into each registered cluster.
PR automation triggers to declare a new Cilium mesh cluster.

This creates a clean GitOps setup operated entirely through declarative PRs. Once onboarded with PR automations, operators only need to approve PRs—no manual code changes required.

Here’s how it all works:

The Setup

This setup assumes you've configured Plural using the base GitOps setup defined with the plural up command. While other configurations are possible, this is the simplest approach.

Start with the PR automation resource and create a new Cilium clustermesh-enabled cluster, which is defined in bootstrap/pr-automations/cilium-cluster-creator.yaml:

apiVersion: deployments.plural.sh/v1alpha1

kind: PrAutomation

metadata:

spec:

icon: <https://plural-assets.s3.us-east-2.amazonaws.com/uploads/repos/d1a82b07-b809-4eb4-b615-8f24365b72b8/k8s.png?v=63861145828>

identifier: mgmt

documentation: |

Sets up a PR to add a new cluster with prerequisites for Cilium cluster-mesh to the provided fleet

creates:

templates:

- source: 'templates/cilium-cluster.yaml'

destination: "services/{{ context.fleet }}/clusters/{{ context.tier }}/{{ context.name }}.yaml.liquid"

external: true

catalogRef:

scmConnectionRef:

name: plural # you'll need to add this ScmConnection manually before this is functional

title: "Setting up {{ context.name }} cluster in fleet {{ context.fleet }}"

message: |

Setting up {{ context.name }} cluster in fleet {{ context.fleet }}

Plural Service: mgmt/{{ context.fleet }}-{{ context.tier }}

configuration:

- name: fleet

type: STRING

documentation: Name for the fleet you want this cluster to belong to.

- name: name

type: STRING

documentation: the name for this cluster

validation:

regex: '[a-z\\-]+'

- name: tier

type: ENUM

documentation: What tier to place this cluster in.

values:

- dev

- prod

- name: region

type: STRING

documentation: Region where the cluster should be created.

- name: kubernetesVersion

type: STRING

documentation: Kubernetes version to use for this cluster.

validation:

regex: '^1\\.[2-3][0-9]$'

- name: clusterId

type: STRING

documentation: Cilium Cluster ID to give to this cluster, must be an integer between 1 and 255.

validation:

regex: '^[1-9][0-9]*$'

This takes the necessary inputs for defining the new cluster, including the numeric cluster ID, and renders a single template to define the GitOps manifests that instantiate the new stack. You can find it at templates/cilium-cluster.yaml:

{% capture templated %}{{ context.fleet }}-{{ context.tier }}{% endcapture %}

{% assign name = context.name | default: templated %}

apiVersion: deployments.plural.sh/v1alpha1

kind: InfrastructureStack

metadata:

spec:

{% if context.ai %}

agentId: {{ context.ai.session.agent_id }}

{% endif %}

detach: false

type: TERRAFORM

approval: true

manageState: true

actor: console@plural.sh

configuration:

version: '1.8'

repositoryRef:

namespace: infra

clusterRef:

namespace: infra

git:

ref: main

folder: terraform/modules/clusters/aws

variables:

cluster: {{ name }}

fleet: {{ context.fleet }}

tier: {{ context.tier }}

region: {{ context.region }}

cluster_id: {{ context.clusterId }}

{% raw %}

kubernetes_version: "{{ configuration.kubernetesVersion }}"

{% endraw %}

---

apiVersion: deployments.plural.sh/v1alpha1

kind: Cluster

metadata:

spec:

handle: {{ name }}

In this code, we mostly just define a Terraform InfrastructureStack using the variables specified in spec.variables. The cluster is also registered to accept deployments via your Kubernetes operator with the Cluster cr.

The stack itself only has a few modifications:

At terraform/modules/clusters/aws/plural.tf, you modify the plural_cluster resource setup to include the cluster_id and other necessary metadata:

resource "plural_cluster" "this" {

handle = var.cluster

name = var.cluster

tags = {

fleet = var.fleet

tier = var.tier

role = "workload"

}

metadata = jsonencode({

tier = var.tier

dns_zone = try(local.vpc.ingress_dns_zone, "example.com") # the dns zone is also defined

cilium_cluster_id = var.cluster_id # set cluster id

# everything else is the default setup, but worth noting that externaldns is necessary

iam = {

load_balancer = module.addons.gitops_metadata.aws_load_balancer_controller_iam_role_arn

cluster_autoscaler = module.addons.gitops_metadata.cluster_autoscaler_iam_role_arn

external_dns = module.externaldns_irsa_role.iam_role_arn

cert_manager = module.externaldns_irsa_role.iam_role_arn

}

vpc_id = local.vpc.vpc_id

region = var.region

network = {

private_subnets = local.vpc.private_subnets

public_subnets = local.vpc.public_subnets

}

})

kubeconfig = {

host = module.eks.cluster_endpoint

cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)

token = data.aws_eks_cluster_auth.cluster.token

}

depends_on = [

module.addons,

module.ebs_csi_irsa_role,

module.vpc_cni_irsa_role,

module.externaldns_irsa_role

]

}

At terraform/modules/aws/cilium.tf, you add the following file to call the cluster registration PR:

data "plural_pr_automation" "cilium_cluster_registrar" {

name = "cilium-cluster-registrar"

}

resource "plural_pr_automation_trigger" "cilium" {

pr_automation_id = data.plural_pr_automation.cilium_cluster_registrar.id

pr_automation_branch = "cilium/register/${var.cluster}"

context = {

name = var.cluster

tier = var.tier

ciliumApiserverIp = "10.0.255.${var.cluster_id}"

dnsZone = try(local.vpc.ingress_dns_zone, "example.com")

}

And finally, at terraform/core-infra/cilium.tf, you define a common root certificate for all clusters in the core-infra stack:

resource "tls_private_key" "cilium_ca_key" {

algorithm = "RSA"

rsa_bits = 4096

}

resource "tls_self_signed_cert" "cilium_ca_cert" {

private_key_pem = tls_private_key.cilium_ca_key.private_key_pem

is_ca_certificate = true

subject {

common_name = "Cilium CA"

organization = "Pluralsh"

}

allowed_uses = [

"crl_signing",

"cert_signing",

"key_encipherment",

"digital_signature",

"server_auth",

"client_auth"

]

validity_period_hours = 87600 # 10 years

early_renewal_hours = 240 # Renew 10 days before expiry

}

output "cilium_ca_cert" {

value = tls_self_signed_cert.cilium_ca_cert.cert_pem

sensitive = true

}

output "cilium_ca_key" {

value = tls_private_key.cilium_ca_key.private_key_pem

sensitive = true

}

resource "kubernetes_secret" "cilium_ca_cert" { # this will ultimately be used in the cilium global service

metadata {

name = "cilium-ca-cert"

namespace = "infra"

}

data = {

"ca.crt" = tls_self_signed_cert.cilium_ca_cert.cert_pem

"ca.key" = tls_private_key.cilium_ca_key.private_key_pem

"ca.cert.b64" = base64encode(tls_self_signed_cert.cilium_ca_cert.cert_pem)

"ca.key.b64" = base64encode(tls_private_key.cilium_ca_key.private_key_pem)

}

This is all you need for the Terraform configuration. The rest is via Helm.

The Helm configuration is split up into two value files:

At helm/cilium/base.yaml.liquid, you have the following:

cni:

chainingMode: aws-cni

exclusive: false

enableIPv4Masquerade: false

routingMode: native

tls:

ca:

cert: {{ configuration["ca.cert.b64"] }} # getting the root ca setup from core-infra

key: {{ configuration["ca.key.b64"] }}

cluster:

id: {{ cluster.metadata.cilium_cluster_id }} # cluster id from `plural_cluster` resource

clustermesh:

useAPIServer: true

config:

enabled: true

mcsapi:

enabled: true

apiserver:

tls:

server:

extraDnsNames:

- {{ cluster.handle }}-cilium-apiserver.{{ cluster.metadata.dns_zone }} # use external dns to register a unique dns name for this clusters gateway

auto:

enabled: true

method: "certmanager"

certManagerIssuerRef:

group: cert-manager.io

kind: ClusterIssuer

service:

type: LoadBalancer

annotations:

service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip

service.beta.kubernetes.io/aws-load-balancer-scheme: "internal"

external-dns.alpha.kubernetes.io/hostname: {{ cluster.handle }}-cilium-apiserver.{{ cluster.metadata.dns_zone }}

And then for dev and prod clusters, you have separate values files declaring all the registered clusters at helm/cilium/dev-clusters.yaml:

clustermesh:

config:

clusters:

- address: orchid-dev-cilium-apiserver.dev.pocs.plural.sh

port: 2379

- address: orchid-dev-usw1-cilium-apiserver.dev.pocs.plural.sh

port: 2379

- address: orchid-dev-usw2-cilium-apiserver.dev.pocs.plural.sh

port: 2379

Here's the global service that sets up the Helm chart (and references these values files):

apiVersion: deployments.plural.sh/v1alpha1

kind: GlobalService

metadata:

namespace: infra

spec:

mgmt: false

tags:

tier: dev # target only dev clusters

template:

namespace: kube-system

configurationRef:

kind: Secret

name: cilium-ca-cert # note this is referencing the secret we declared in the core-infra stack

namespace: infra

protect: false

helm:

version: "1.18.1"

chart: cilium

url: <https://helm.cilium.io>

valuesFiles:

- base.yaml.liquid # the two values files

- dev-clusters.yaml

git:

folder: helm/cilium

ref: main

repositoryRef:

kind: GitRepository

namespace: infra

The cilium-cluster-registrar PR automation updates the cluster-specific YAML files:

apiVersion: deployments.plural.sh/v1alpha1

kind: PrAutomation

metadata:

spec:

documentation: Registers a new cluster with the Cilium cluster-mesh

updates:

yamlOverlays:

- file: "helm/cilium/{{ context.tier }}-clusters.yaml"

listMerge: APPEND

yaml: |

clustermesh:

config:

clusters:

- name: {{ context.name }}

port: 2379

address: "{{ context.name }}-cilium-apiserver.{{ context.dnsZone }}"

scmConnectionRef:

title: "Registering {{ context.name }} cluster in {{ context.tier }} with Cilium cluster-mesh"

message: "Registering {{ context.name }} cluster in {{ context.tier }} with Cilium cluster-mesh"

identifier: mgmt

configuration:

- name: tier

type: ENUM

documentation: "the tier of the cluster"

values:

- dev

- prod

- name: name

type: STRING

documentation: The name of the cluster to register

- name: dnsZone

type: STRING

documentation: The DNS zone of the cluster

This automation appends the new cluster to the existing YAML and generates a PR for approval, adding the cluster to the mesh.

The Final Product

Once configured, cluster creation is extremely clean. Using the cilium-cluster-creator PR automation in the infra catalog, you complete a UI wizard, which then creates the PR to instantiate the stack. Everything runs automatically from there.

Conclusion

Setting up custom Kubernetes networking is never going to be an easy process. From certificate authorities, to DNS registration and config management, there are many concerns that need to be addressed from the ground up. We hope this walkthrough gave you a maintainable solution that ensures:

Declarative and Git-based workflows: No hidden sources of truth or complex scripts.
Maintainability: All operations run via pre-defined PRs and UI wizards, minimizing misconfiguration.
Observability: All Cilium instances are registered and monitored within the Plural UI, rather than being hidden in Helm charts.