GitOps Setup of Cilium Multi-Cluster with Plural

When you're running multiple Kubernetes clusters, finding the right network topology is difficult. If you're considering the multi-cluster network route, this guide will help you get started quickly. 

Michael Guarino
Michael Guarino

When you're running multiple Kubernetes clusters, finding the right network topology is difficult. One of the major difficulties for multi-cluster Kubernetes environments is finding the appropriate network topology. There are two main approaches:

  • Isolated networks: Each cluster maintains its own isolated network and service discovery, exposing services to each other via either ingress or gateway APIs. This is the simplest solution and is robust.
  • Service mesh bridge: Bridge all clusters in a common service mesh, giving full network availability from the ingress down to the pod level. While more complex to manage, this approach is advantageous if you need direct pod or service-level communication between clusters. It's a common approach with distributed databases and clustered solutions like WebSocket-based distribution networks. 

If you're considering the multi-cluster network route, this guide will help you get started quickly. 

Why Cilium Multi-Cluster and GitOps Don't Play Well Together

It's important to understand why Cilium isn't necessarily the best bedfellow of a GitOps process. 

Cilium's installation process creates friction with GitOps workflows. Their documentation centers around CLI-driven installs:

cilium install --set cluster.name=$CLUSTER1 --set cluster.id=1 --context $CLUSTER1

cilium install --set cluster.name=$CLUSTER2 --set cluster.id=2 --context $CLUSTER2

The typical setup process looks like this: 

  1. Create clusters with Terraform
  2. Get kubeconfig access to each cluster
  3. Install Cilium via CLI, joining clusters to the mesh
  4. Perform all future Cilium upgrades through CLI

Step 2 and onward often require manual intervention or scripting in CI systems like GitHub Actions. This isn't true GitOps though; it's imperative and prone to drift. 

There is technically a way around this; you can reverse-engineer the Cilium CLI commands via Helm (since the CLI ultimately uses Helm for Kubernetes installs), and if you do a bit of investigation, the values file patterns can be recrafted. That said, you'll still face the following manual steps:

  1. Ensuring Cilium installs on each cluster
  2. Defining DNS entries for each exposed Cilium gateway so they can discover each other
  3. Modifying Helm values file (likely in Git) on each cluster creation (Cilium sometimes fails to discover if gateways don't exist in time and won't retry—though this should ultimately be fixed in Cilium.)

A GitOps Solution with Plural

Fortunately, you can set all this up through Plural with this approach:

  1. Terraform stacks define clusters, determining cluster ID and gateway DNS names, then spawn PR automation to register clusters in the mesh post creation.
  2. Root CA creation and registration as a secret distributed to all clusters via global services (required by Cilium for mesh authentication). 
  3. Global services sync Silium into each registered cluster.
  4. PR automation triggers to declare a new Cilium mesh cluster.

This creates a clean GitOps setup operated entirely through declarative PRs. Once onboarded with PR automations, operators only need to approve PRs—no manual code changes required. 

Here’s how it all works:

The Setup 

This setup assumes you've configured Plural using the base GitOps setup defined with the plural up command. While other configurations are possible, this is the simplest approach.

Start with the PR automation resource and create a new Cilium clustermesh-enabled cluster, which is defined in bootstrap/pr-automations/cilium-cluster-creator.yaml:

apiVersion: deployments.plural.sh/v1alpha1

kind: PrAutomation

metadata:

  name: cilium-cluster-creator

spec:

  name: cilium-cluster-creator

  icon: <https://plural-assets.s3.us-east-2.amazonaws.com/uploads/repos/d1a82b07-b809-4eb4-b615-8f24365b72b8/k8s.png?v=63861145828>

  identifier: mgmt

  documentation: |

    Sets up a PR to add a new cluster with prerequisites for Cilium cluster-mesh to the provided fleet

  creates:

    templates:

      - source: 'templates/cilium-cluster.yaml'

        destination: "services/{{ context.fleet }}/clusters/{{ context.tier }}/{{ context.name }}.yaml.liquid"

        external: true

  catalogRef:

    name: infra

  scmConnectionRef:

    name: plural  # you'll need to add this ScmConnection manually before this is functional

  title: "Setting up {{ context.name }} cluster in fleet {{ context.fleet }}"

  message: |

    Setting up {{ context.name }} cluster in fleet {{ context.fleet }}


    Plural Service: mgmt/{{ context.fleet }}-{{ context.tier }}

  configuration:

    - name: fleet

      type: STRING

      documentation: Name for the fleet you want this cluster to belong to.

    - name: name

      type: STRING

      documentation: the name for this cluster

      validation:

        regex: '[a-z\\-]+'

    - name: tier

      type: ENUM

      documentation: What tier to place this cluster in.

      values:

        - dev

        - prod

    - name: region

      type: STRING

      documentation: Region where the cluster should be created.

    - name: kubernetesVersion

      type: STRING

      documentation: Kubernetes version to use for this cluster.

      validation:

        regex: '^1\\.[2-3][0-9]$'

    - name: clusterId

      type: STRING

      documentation: Cilium Cluster ID to give to this cluster, must be an integer between 1 and 255.

      validation:

        regex: '^[1-9][0-9]*$'

This takes the necessary inputs for defining the new cluster, including the numeric cluster ID, and renders a single template to define the GitOps manifests that instantiate the new stack. You can find it at templates/cilium-cluster.yaml:

{% capture templated %}{{ context.fleet }}-{{ context.tier }}{% endcapture %}

{% assign name = context.name | default: templated %}

apiVersion: deployments.plural.sh/v1alpha1

kind: InfrastructureStack

metadata:

  name: cluster-{{ name }}

spec:

{% if context.ai %}

  agentId: {{ context.ai.session.agent_id }}

{% endif %}

  name: cluster-{{ name }}

  detach: false

  type: TERRAFORM

  approval: true

  manageState: true

  actor: console@plural.sh

  configuration:

    version: '1.8'

  repositoryRef:

    name: infra

    namespace: infra

  clusterRef:

    name: mgmt

    namespace: infra

  git:

    ref: main

    folder: terraform/modules/clusters/aws

  variables:

    cluster: {{ name }}

    fleet: {{ context.fleet }}

    tier: {{ context.tier }}

    region: {{ context.region }}

    cluster_id: {{ context.clusterId }}

    {% raw %} 

    kubernetes_version: "{{ configuration.kubernetesVersion }}"

    {% endraw %}

---

apiVersion: deployments.plural.sh/v1alpha1

kind: Cluster

metadata:

  name: {{ name }}

spec:

  handle: {{ name }}

In this code, we mostly just define a Terraform InfrastructureStack using the variables specified in spec.variables. The cluster is also registered to accept deployments via your Kubernetes operator with the Cluster cr.

The stack itself only has a few modifications:

  • At terraform/modules/clusters/aws/plural.tf, you modify the plural_cluster resource setup to include the cluster_id and other necessary metadata:

resource "plural_cluster" "this" {

    handle = var.cluster

    name   = var.cluster

    tags   = {

        fleet = var.fleet

        tier = var.tier

        role = "workload"

    }


    metadata = jsonencode({

        tier = var.tier

        dns_zone = try(local.vpc.ingress_dns_zone, "example.com") # the dns zone is also defined

        cilium_cluster_id = var.cluster_id # set cluster id


        # everything else is the default setup, but worth noting that externaldns is necessary

        iam = {

          load_balancer = module.addons.gitops_metadata.aws_load_balancer_controller_iam_role_arn

          cluster_autoscaler = module.addons.gitops_metadata.cluster_autoscaler_iam_role_arn

          external_dns = module.externaldns_irsa_role.iam_role_arn

          cert_manager = module.externaldns_irsa_role.iam_role_arn

        }


        vpc_id = local.vpc.vpc_id

        region = var.region

        

        network = {

          private_subnets = local.vpc.private_subnets

          public_subnets  = local.vpc.public_subnets

        }

    })


    kubeconfig = {

      host                   = module.eks.cluster_endpoint

      cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)

      token                  = data.aws_eks_cluster_auth.cluster.token

    }


    depends_on = [ 

      module.addons,

      module.ebs_csi_irsa_role, 

      module.vpc_cni_irsa_role, 

      module.externaldns_irsa_role 

    ]

}

  • At terraform/modules/aws/cilium.tf, you add the following file to call the cluster registration PR:

          data "plural_pr_automation" "cilium_cluster_registrar" {

    name = "cilium-cluster-registrar"

}


resource "plural_pr_automation_trigger" "cilium" {

    pr_automation_id = data.plural_pr_automation.cilium_cluster_registrar.id

    pr_automation_branch = "cilium/register/${var.cluster}"

    context = {

        name = var.cluster

        tier = var.tier

        ciliumApiserverIp = "10.0.255.${var.cluster_id}"

        dnsZone = try(local.vpc.ingress_dns_zone, "example.com")

    }

}

  • And finally, at terraform/core-infra/cilium.tf, you define a common root certificate for all clusters in the core-infra stack:

resource "tls_private_key" "cilium_ca_key" {

    algorithm = "RSA"

    rsa_bits  = 4096

}


resource "tls_self_signed_cert" "cilium_ca_cert" {

    private_key_pem = tls_private_key.cilium_ca_key.private_key_pem


    is_ca_certificate = true


    subject {

        common_name  = "Cilium CA"

        organization = "Pluralsh"

    }


    allowed_uses = [

        "crl_signing",

        "cert_signing",

        "key_encipherment",

        "digital_signature",

        "server_auth",

        "client_auth"

    ]


    validity_period_hours = 87600 # 10 years

    early_renewal_hours   = 240   # Renew 10 days before expiry

}


output "cilium_ca_cert" {

    value = tls_self_signed_cert.cilium_ca_cert.cert_pem

    sensitive = true

}


output "cilium_ca_key" {

    value = tls_private_key.cilium_ca_key.private_key_pem

    sensitive = true

}


resource "kubernetes_secret" "cilium_ca_cert" { # this will ultimately be used in the cilium global service

    metadata {

        name = "cilium-ca-cert"

        namespace = "infra"

    }


    data = {

        "ca.crt" = tls_self_signed_cert.cilium_ca_cert.cert_pem

        "ca.key" = tls_private_key.cilium_ca_key.private_key_pem

        "ca.cert.b64" = base64encode(tls_self_signed_cert.cilium_ca_cert.cert_pem)

        "ca.key.b64" = base64encode(tls_private_key.cilium_ca_key.private_key_pem)

    }

}

This is all you need for the Terraform configuration. The rest is via Helm. 

The Helm configuration is split up into two value files: 

At helm/cilium/base.yaml.liquid, you have the following:

cni:

  chainingMode: aws-cni

  exclusive: false

enableIPv4Masquerade: false

routingMode: native


tls:

  ca:

    cert: {{ configuration["ca.cert.b64"] }} # getting the root ca setup from core-infra

    key: {{ configuration["ca.key.b64"] }}


cluster:

  id: {{ cluster.metadata.cilium_cluster_id }} # cluster id from `plural_cluster` resource

  name: {{ cluster.handle }}


clustermesh:

  useAPIServer: true

  config:

    enabled: true

  mcsapi:

    enabled: true

  

  apiserver:

    tls:

      server:

        extraDnsNames:

        - {{ cluster.handle }}-cilium-apiserver.{{ cluster.metadata.dns_zone }} # use external dns to register a unique dns name for this clusters gateway

      auto:

        enabled: true

        method: "certmanager"

        certManagerIssuerRef:

          group: cert-manager.io

          kind: ClusterIssuer

          name: cilium

    service:

      type: LoadBalancer

      annotations:

        service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip

        service.beta.kubernetes.io/aws-load-balancer-scheme: "internal"

        external-dns.alpha.kubernetes.io/hostname: {{ cluster.handle }}-cilium-apiserver.{{ cluster.metadata.dns_zone }}


And then for dev and prod clusters, you have separate values files declaring all the registered clusters at helm/cilium/dev-clusters.yaml:

clustermesh:

  config:

    clusters:

      - address: orchid-dev-cilium-apiserver.dev.pocs.plural.sh

        name: orchid-dev

        port: 2379

      - address: orchid-dev-usw1-cilium-apiserver.dev.pocs.plural.sh

        name: orchid-dev-usw1

        port: 2379

      - address: orchid-dev-usw2-cilium-apiserver.dev.pocs.plural.sh

        name: orchid-dev-usw2

        port: 2379

Here's the global service that sets up the Helm chart (and references these values files):

apiVersion: deployments.plural.sh/v1alpha1

kind: GlobalService

metadata:

  name: cilium-dev

  namespace: infra

spec:

  mgmt: false

  tags:

    tier: dev # target only dev clusters

  template:

    name: cilium

    namespace: kube-system

    configurationRef:

      kind: Secret

      name: cilium-ca-cert # note this is referencing the secret we declared in the core-infra stack

      namespace: infra

    protect: false

    helm:

      version: "1.18.1"

      chart: cilium

      url: <https://helm.cilium.io>

      valuesFiles:

      - base.yaml.liquid # the two values files

      - dev-clusters.yaml

    git:

      folder: helm/cilium

      ref: main

    repositoryRef:

      kind: GitRepository

      name: infra

      namespace: infra

The cilium-cluster-registrar PR automation updates the cluster-specific YAML files: 

apiVersion: deployments.plural.sh/v1alpha1

kind: PrAutomation

metadata:

  name: cilium-cluster-registrar

spec:

  name: cilium-cluster-registrar

  documentation: Registers a new cluster with the Cilium cluster-mesh

  updates:

    yamlOverlays:

    - file: "helm/cilium/{{ context.tier }}-clusters.yaml"

      listMerge: APPEND

      yaml: |

        clustermesh:

          config:

            clusters:

            - name: {{ context.name }}

              port: 2379

              address: "{{ context.name }}-cilium-apiserver.{{ context.dnsZone }}"

  scmConnectionRef:

    name: plural

  title: "Registering {{ context.name }} cluster in {{ context.tier }} with Cilium cluster-mesh"

  message: "Registering {{ context.name }} cluster in {{ context.tier }} with Cilium cluster-mesh"

  identifier: mgmt

  configuration:

  - name: tier

    type: ENUM

    documentation: "the tier of the cluster"

    values:

      - dev

      - prod

  - name: name

    type: STRING

    documentation: The name of the cluster to register

  - name: dnsZone

    type: STRING

    documentation: The DNS zone of the cluster

This automation appends the new cluster to the existing YAML and generates a PR for approval, adding the cluster to the mesh. 

The Final Product

Once configured, cluster creation is extremely clean. Using the cilium-cluster-creator PR automation in the infra catalog, you complete a UI wizard, which then creates the PR to instantiate the stack. Everything runs automatically from there.

Conclusion

Setting up custom Kubernetes networking is never going to be an easy process. From certificate authorities, to DNS registration and config management, there are many concerns that need to be addressed from the ground up. We hope this walkthrough gave you a maintainable solution that ensures:

  • Declarative and Git-based workflows: No hidden sources of truth or complex scripts. 
  • Maintainability: All operations run via pre-defined PRs and UI wizards, minimizing misconfiguration. 
  • Observability: All Cilium instances are registered and monitored within the Plural UI, rather than being hidden in Helm charts.