Self-Hosting LLMs on Kubernetes: NVIDIA Jetson + K3s
Large language models (LLMs) are ubiquitous, and there's a vibrant open source ecosystem with tons of implementations to choose from. But here's the thing: actually running them yourself is surprisingly difficult. This guide walks you through the basics of operating your own LLM infrastructure. We cover:
- Why you might want to run your own LLM or AI model.
- How to set up a low-cost NVIDIA Jetson with K3s.
- How to use Plural to deploy a K3s cluster remotely and manage the LLM via GitOps.
Why Self-Host an LLM?
Before we dive into the technical stuff, let's talk about whether you should even bother running your own LLM. There are three main forcing functions for self-hosting an LLM, ranked by increasing relevance: cost, data governance, and edge compatibility.
Cost
Cost might seem obvious, but it's actually fairly rare. Running an LLM has a lot of fixed costs because you need expensive GPU hardware for any decent model. Those costs often overwhelm the on-demand cost of using a mainstream LLM API like OpenAI, especially since cloud providers let you share infrastructure costs with other users. You'd need serious scale (and I mean serious scale) to make the operational complexity worth it.
Data Governance
Data governance is also a common issue, but major cloud providers have mostly solved this. Amazon Bedrock, Google Vertex AI, and Azure OpenAI all offer strong business associate agreements (BAAs) around data usage. Even Anthropic and OpenAI offer similar protections. There are still edge cases where you can't trust vendors (think defense contracts), but they’re rarer than you might assume.
Edge Compatibility
This is where things get interesting. Neural networks, including LLMs, have a superpower: portability. You can literally copy the weights and run them anywhere. This opens up commoditized access to extremely powerful machine learning in places where reliable internet isn't guaranteed—robotics operations, remote agriculture, manufacturing in rural areas, military/police operations, etc. When you need machine vision, automated decision-making, or other AI capabilities in these environments, a low-powered edge GPU with a deployed model is often your best bet. This use case is still emerging, but I anticipate it will grow exponentially as the technology matures.
This tutorial focuses on this sort of deployment model, using:
- NVIDIA Jetson: A small-scale, affordable GPU device (around $250 USD) that's somewhat indicative of edge hardware.
- K3s: A Kubernetes distribution that uses minimal resources (can be backed by SQLite) and has good tooling for edge deployment and remote GitOps updates.
The Setup
First, we need to get the cluster installed on the Jetson and install the Plural agent on it. This can be done via a terminal session like so:
# necessary on jetson because a lot of drivers are pre-installed, and device toolkit just inherits them # set up k3s # install plural cli mv ./plural /usr/bin/local/plural
|
Once this script finishes, you should see a cluster named k3s-jetson
like this in your Plural Console:
From there, you need to build a GitOps setup of the NVIDIA GPU runtime and Ollama (a simple OSS LLM runtime harness). If you set up your install with plural up
, add something like bootstrap/globalservices/nvidia.yaml
with the following:
apiVersion: deployments.plural.sh/v1alpha1 kind: GlobalService metadata: name: nvidia-device-plugin # this sets up the nvidia tooling to expose gpus into any k8s cluster spec: tags: gpu: nvidia # target only clusters with the gpu: nvidia tag, which we added to the jetson cluster above template: namespace: nvidia-device-plugin helm: url: https://nvidia.github.io/k8s-device-plugin chart: nvidia-device-plugin # jetson is not compatible with gpu operator, # if you're running on a more mainstream device, gpu-operator is recommended version: 0.17.4 values: nfd: enabled: true gfd: enabled: false affinity: {} --- apiVersion: deployments.plural.sh/v1alpha1 kind: GlobalService metadata: name: gpu-setup spec: tags: gpu: nvidia template: namespace: nvidia-device-plugin repositoryRef: name: infra namespace: infra git: folder: services/gpu # will set up the nvidia runtime class, for some reason not part of the chart ref: main --- apiVersion: deployments.plural.sh/v1alpha1 kind: GlobalService metadata: name: ollama spec: tags: gpu: nvidia template: namespace: ollama helm: url: https://helm.otwld.com/ chart: ollama version: 1.x.x values: runtimeClassName: nvidia extraArgs: ["/bin/bash", "-c", "/start_ollama && tail -f /dev/null"] securityContext: privileged: true image: repository: dustynv/ollama tag: r36.4.0-cu128-24.04 # this is a specific container for ollama that works with jetson ollama: gpu: enabled: true lifecycle: postStart: # the custom ollama container doesn't support the commands used by the charts pre-built pull declarations exec: command: - "/bin/sh" - "-c" - | ollama pull gemma3:270m # feel free to use any model you wish, most jetsons have limited memory so 270m is recommended |
You'll also need the runtime class definition at services/gpu/runtimeclass.yaml
:
apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: nvidia handler: nvidia |
That's it! Once everything converges to a healthy state, you can test it out. Here's a simple way using Open WebUI:
plural cd clusters kubeconfig @k3s-jetson # get kubeconfig access through our k8s proxy kubectl port-forward svc/ollama-ollama 11434:http -n ollama # port-forward to ollama running on the jetson # run this in another terminal since the port-foward is blocking docker run -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data ghcr.io/open-webui/open-webui:main |
Visit http://localhost:3000, choose the preloaded Ollama gemma3 model, and start chatting:
Plural's Kubernetes proxy handles the complex networking between your laptop and the Jetson. As long as the agent can connect to your Plural Console and you can as well, Plural will securely proxy you into the remote.
Why Kubernetes At All?
You're probably wondering: if this is an edge use case, why add the complexity of Kubernetes? Seems like overkill for something that should be simple, right? There are two main reasons.
First, containerization is the way to go, and Kubernetes is the most mature orchestration platform. The benefits of containers over a straight OS install are obvious, especially portability across heterogeneous hardware. Sure, you could just use Docker Compose, but Kubernetes has far more features and a massive ecosystem of operators and Helm charts.
The bigger reason, though, is deployment simplicity. In particular, it comes from realizing that:
GitOps pull deployments + edge Kubernetes == trivial over-the-air-updates |
The typical open source approach is installing something like Argo or Flux (with Argo being the most resource-intensive) on each device and having them sync from Git continuously. Plural does this too, but with some notable advantages:
- Built to be an agent from day one: Guaranteed low resource usage, ideal for suboptimal hardware and network resources.
- No credential management issues: You don't need to distribute Git or Helm credentials to every agent. Everything gets proxied through Plural's management API and cached with a CDN-like architecture for high-performance delivery of GitOps manifests.
- Full UI support: Unlike Flux or Argo in pull mode (which is what you need for edge cases), you get complete visibility into all your GitOps resources.
The main downside of Kubernetes is resource utilization. Distributions like K3s address this by using low-cost datastores like SQLite and providing ways to avoid heavy operations like Docker image pulls via pre-loading them at the node level. That said, every team needs to weigh these tradeoffs for their specific situation—there's no universal right answer.