You don't need a subscription to run AI. You need a GPU, 16GB of VRAM, and twenty minutes.
Here's what running local LLMs actually looks like — before and after migrating from an RTX 5070 on Windows to an RX 9070 on Ubuntu 26.04. Same models, same quantization, same prompts. Real numbers.
The Setup
Before: RTX 5070, Windows 11, Ollama via CUDA.
After: PowerColor Hellhound RX 9070 (RDNA4), Ubuntu 26.04 LTS, Ollama via ROCm.
One environment variable to make it work: HSA_OVERRIDE_GFX_VERSION=12.0.1 in /etc/environment. That's the entire Linux-specific configuration.
Both machines ran identical models at identical quantization. Q4_K_M throughout — the sweet spot between size and quality for 16GB VRAM.
The Numbers
| Model | Platform | Eval tok/s | Prompt tok/s |
|---|---|---|---|
| llama3.1:8b | RTX 5070 · Win · CUDA | 104.16 | 4,665 |
| llama3.1:8b | RX 9070 · Ubuntu · ROCm | 92.69 | 7,489 ↑ |
| gemma3:12b | RTX 5070 · Win · CUDA | 67.32 | 132 |
| gemma3:12b | RX 9070 · Ubuntu · ROCm | 53.71 | 579 ↑ |
| gemma3:12b † | RX 9070 · Ubuntu · ROCm | 52.00 | 10,700 ↑↑ |
Two numbers matter here. Eval rate is how fast the model generates tokens — what you feel as response speed. Prompt eval rate is how fast it ingests your input — what you feel when pasting a long document or a code block.
What the Numbers Mean
Eval rate: CUDA wins. 104 vs 93 tok/s on llama3.1:8b, 67 vs 52 on gemma3:12b. The difference is real but not perceptible in normal use — both are fast enough that you're reading before the model finishes generating.
Prompt eval rate: ROCm wins, and the gap grows with context size.
The Context Scaling Story
This is the part the benchmarks usually miss.
The 10,700 tok/s run used a 1,248 token prompt — a real-world input, not a one-liner. The GPU pipeline had something to chew on. The result: AMD's memory bandwidth advantage compounds as context grows.
At short prompts ROCm is already faster. At longer prompts it becomes a different class of hardware entirely. The RTX 5070 ingested the same input at 132 tok/s. The RX 9070 did it at 10,700.
This matters for the workloads that actually justify running a local model:
- Pasting a codebase section for review
- Feeding a long document for summarization
- Multi-turn conversations with deep context history
- Anything using gemma3:12b's 131,072 token native context
The eval rate — what CUDA wins on — measures how fast you get tokens back once the model starts generating. That gap is 11 tok/s on llama3.1:8b. Perceptible only if you're staring at a stopwatch.
The prompt eval rate — what ROCm dominates — measures how fast the model processes what you sent. At real-world context lengths, that gap is measured in orders of magnitude.
What Fits in 16GB
With 16GB VRAM the constraint disappears for most practical use:
| Model | VRAM Used | Capability |
|---|---|---|
| llama3.1:8b Q4_K_M | 5.5GB | Fast. Tool use. Good for coding. |
| gemma3:12b Q4_K_M | 9.9GB | Vision. 131k context. Better reasoning. |
| Qwen2.5 14B Q4_K_M | ~9GB | Fits clean. Strong at code. |
| llama3 70B Q4_K_M | ~40GB | Needs offloading — partial GPU. |
At 16GB you can run gemma3:12b with room to spare and push the context window well past the 4096 default. That headroom is where the RX 9070 pulls ahead of 12GB cards in real use.
The Vision Demo
gemma3:12b ships with vision capability. Feed it an image locally — no API call, no data leaving the machine.
Private inference, on hardware you own, with no usage limits and no one logging the query.
The CLI Is Identical
This matters if you're migrating. Ollama syntax is the same everywhere:
ollama run llama3.1:8b
ollama run gemma3:12b
ollama list
ollama ps
Windows PowerShell, Linux bash, macOS zsh. No relearning. Pull your models once, run them anywhere.
The Honest Summary
CUDA is still faster at pure token generation. If raw eval speed is the only metric, NVIDIA wins — by 11 tok/s on llama3.1:8b.
The performance gap is single-digit percentages on the metric you notice least. At real-world context lengths, AMD wins by orders of magnitude on the metric that matters most.
The RX 9070 on Linux is faster at prompt ingestion, fits the same models, costs less than an RTX 5080, runs on open-source drivers that will never be end-of-life'd, and doesn't require a proprietary kernel module on every boot.
Your queries stay on your machine. Your models run locally. Your hardware works because Linux exists, not because NVIDIA decided it should.
That's worth 11 tokens per second.
Benchmarks run with ollama run --verbose. Hardware: PowerColor Hellhound RX 9070 16GB · Intel Core Ultra 7 265K · 64GB DDR5 5600 · Ubuntu 26.04 LTS kernel 7.0. ROCm via HSA_OVERRIDE_GFX_VERSION=12.0.1. † denotes 1,248 token prompt run.