Self-Hosting LLMs on Kubernetes: NVIDIA Jetson + K3s

Large language models (LLMs) are ubiquitous, and there's a vibrant open source ecosystem with tons of implementations to choose from. But here's the thing: actually running them yourself is surprisingly difficult. This guide walks you through the basics of operating your own LLM infrastructure. We cover: 

  • Why you might want to run your own LLM or AI model.
  • How to set up a low-cost NVIDIA Jetson with K3s.
  • How to use Plural to deploy a K3s cluster remotely and manage the LLM via GitOps.

Why Self-Host an LLM?

Before we dive into the technical stuff, let's talk about whether you should even bother running your own LLM. There are three main forcing functions for self-hosting an LLM, ranked by increasing relevance: cost, data governance, and edge compatibility.

Cost

Cost might seem obvious, but it's actually fairly rare. Running an LLM has a lot of fixed costs because you need expensive GPU hardware for any decent model. Those costs often overwhelm the on-demand cost of using a mainstream LLM API like OpenAI, especially since cloud providers let you share infrastructure costs with other users. You'd need serious scale (and I mean serious scale) to make the operational complexity worth it. 

Data Governance

Data governance is also a common issue, but major cloud providers have mostly solved this.  Amazon Bedrock, Google Vertex AI, and Azure OpenAI all offer strong business associate agreements (BAAs) around data usage. Even Anthropic and OpenAI offer similar protections. There are still edge cases where you can't trust vendors (think defense contracts), but they’re rarer than you might assume.

Edge Compatibility

This is where things get interesting. Neural networks, including LLMs, have a superpower: portability. You can literally copy the weights and run them anywhere. This opens up commoditized access to extremely powerful machine learning in places where reliable internet isn't guaranteed—robotics operations, remote agriculture, manufacturing in rural areas, military/police operations, etc. When you need machine vision, automated decision-making, or other AI capabilities in these environments, a low-powered edge GPU with a deployed model is often your best bet. This use case is still emerging, but I anticipate it will grow exponentially as the technology matures. 

This tutorial focuses on this sort of deployment model, using:

  • NVIDIA Jetson: A small-scale, affordable GPU device (around $250 USD) that's somewhat indicative of edge hardware.
  • K3s: A Kubernetes distribution that uses minimal resources (can be backed by SQLite) and has good tooling for edge deployment and remote GitOps updates. 

The Setup

First, we need to get the cluster installed on the Jetson and install the Plural agent on it. This can be done via a terminal session like so:

# necessary on jetson because a lot of drivers are pre-installed, and device toolkit just inherits them
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd

# set up k3s
# this lets us use the kubeconfig w/o sudo, and servicelb doesn't work on jetson devices iptables builtins
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="server --disable=servicelb --node-label nvidia.com/gpu.present=true --write-kubeconfig-mode=644 --write-kubeconfig=$HOME/.kube/config" sh - 

# install plural cli
export VERSION=0.12.29 # change to whichever version you prefer
curl -L https://github.com/pluralsh/plural-cli/releases/latest/download/plural-cli_"$VERSION"_Linux_arm64.tar.gz | tar zx

mv ./plural /usr/bin/local/plural


# install plural agent
plural cd login # you'll want to create an access token for this
plural cd clusters bootstrap --name k3s-jetson --tag gpu=nvidia # or whatever name you wish
kubectl get pods -A --watch # if you want to wait for cluster to converge to healthy

Once this script finishes, you should see a cluster named k3s-jetson like this in your Plural Console:

From there, you need to build a GitOps setup of the NVIDIA GPU runtime and Ollama (a simple OSS LLM runtime harness). If you set up your install with plural up, add something like bootstrap/globalservices/nvidia.yaml with the following:

apiVersion: deployments.plural.sh/v1alpha1

kind: GlobalService

metadata:

  name: nvidia-device-plugin # this sets up the nvidia tooling to expose gpus into any k8s cluster

spec:

  tags:

    gpu: nvidia # target only clusters with the gpu: nvidia tag, which we added to the jetson cluster above

  template:

    namespace: nvidia-device-plugin

    helm:

      url: https://nvidia.github.io/k8s-device-plugin

      chart: nvidia-device-plugin # jetson is not compatible with gpu operator, 

                                                              # if you're running on a more mainstream device, gpu-operator is recommended

      version: 0.17.4

      values:

        nfd:

          enabled: true

        gfd:

          enabled: false

        affinity: {}

---

apiVersion: deployments.plural.sh/v1alpha1

kind: GlobalService

metadata:

  name: gpu-setup

spec:

  tags:

    gpu: nvidia

  template:

    namespace: nvidia-device-plugin

    repositoryRef:

      name: infra

      namespace: infra

    git:

      folder: services/gpu # will set up the nvidia runtime class, for some reason not part of the chart

      ref: main

---

apiVersion: deployments.plural.sh/v1alpha1

kind: GlobalService

metadata:

  name: ollama

spec:

  tags:

    gpu: nvidia

  template:

    namespace: ollama

    helm:

      url: https://helm.otwld.com/

      chart: ollama

      version: 1.x.x

      values:

        runtimeClassName: nvidia

        extraArgs: ["/bin/bash", "-c", "/start_ollama && tail -f /dev/null"]

        securityContext:

          privileged: true

        image:

          repository: dustynv/ollama

          tag: r36.4.0-cu128-24.04 # this is a specific container for ollama that works with jetson

        ollama:

          gpu:

            enabled: true

        lifecycle:

          postStart: # the custom ollama container doesn't support the commands used by the charts pre-built pull declarations

            exec:

              command: 

              - "/bin/sh"

              - "-c"

              - |

                ollama pull gemma3:270m # feel free to use any model you wish, most jetsons have limited memory so 270m is recommended

You'll also need the runtime class definition at services/gpu/runtimeclass.yaml:

apiVersion: node.k8s.io/v1

kind: RuntimeClass

metadata:

  name: nvidia

handler: nvidia

​​That's it! Once everything converges to a healthy state, you can test it out. Here's a simple way using Open WebUI:

plural cd clusters kubeconfig @k3s-jetson # get kubeconfig access through our k8s proxy

kubectl port-forward svc/ollama-ollama 11434:http -n ollama # port-forward to ollama running on the jetson

# run this in another terminal since the port-foward is blocking

docker run -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data  ghcr.io/open-webui/open-webui:main

Visit http://localhost:3000, choose the preloaded Ollama gemma3 model, and start chatting:

Plural's Kubernetes proxy handles the complex networking between your laptop and the Jetson. As long as the agent can connect to your Plural Console and you can as well, Plural will securely proxy you into the remote.

Why Kubernetes At All?

You're probably wondering: if this is an edge use case, why add the complexity of Kubernetes? Seems like overkill for something that should be simple, right? There are two main reasons.

First, containerization is the way to go, and Kubernetes is the most mature orchestration platform. The benefits of containers over a straight OS install are obvious, especially portability across heterogeneous hardware. Sure, you could just use Docker Compose, but Kubernetes has far more features and a massive ecosystem of operators and Helm charts. 

The bigger reason, though, is deployment simplicity. In particular, it comes from realizing that:

GitOps pull deployments + edge Kubernetes == trivial over-the-air-updates

The typical open source approach is installing something like Argo or Flux (with Argo being the most resource-intensive) on each device and having them sync from Git continuously. Plural does this too, but with some notable advantages: 

  • Built to be an agent from day one: Guaranteed low resource usage, ideal for suboptimal hardware and network resources.
  • No credential management issues: You don't need to distribute Git or Helm credentials to every agent. Everything gets proxied through Plural's management API and cached with a CDN-like architecture for high-performance delivery of GitOps manifests.
  • Full UI support: Unlike Flux or Argo in pull mode (which is what you need for edge cases), you get complete visibility into all your GitOps resources. 

The main downside of Kubernetes is resource utilization. Distributions like K3s address this by using low-cost datastores like SQLite and providing ways to avoid heavy operations like Docker image pulls via pre-loading them at the node level. That said, every team needs to weigh these tradeoffs for their specific situation—there's no universal right answer.