Local AI on Your Own Hardware — What the Benchmarks Actually Show

You don't need a subscription to run AI. You need a GPU, 16GB of VRAM, and twenty minutes.

Here's what running local LLMs actually looks like — before and after migrating from an RTX 5070 on Windows to an RX 9070 on Ubuntu 26.04. Same models, same quantization, same prompts. Real numbers.

The Setup

Before: RTX 5070, Windows 11, Ollama via CUDA.

After: PowerColor Hellhound RX 9070 (RDNA4), Ubuntu 26.04 LTS, Ollama via ROCm.

One environment variable to make it work: HSA_OVERRIDE_GFX_VERSION=12.0.1 in /etc/environment. That's the entire Linux-specific configuration.

Both machines ran identical models at identical quantization. Q4_K_M throughout — the sweet spot between size and quality for 16GB VRAM.

The Numbers

ollama run --verbose · Q4_K_M · verified · † = 1,248 token prompt

Model	Platform	Eval tok/s	Prompt tok/s
llama3.1:8b	RTX 5070 · Win · CUDA	104.16	4,665
llama3.1:8b	RX 9070 · Ubuntu · ROCm	92.69	7,489 ↑
gemma3:12b	RTX 5070 · Win · CUDA	67.32	132
gemma3:12b	RX 9070 · Ubuntu · ROCm	53.71	579 ↑
gemma3:12b †	RX 9070 · Ubuntu · ROCm	52.00	10,700 ↑↑

Two numbers matter here. Eval rate is how fast the model generates tokens — what you feel as response speed. Prompt eval rate is how fast it ingests your input — what you feel when pasting a long document or a code block.

What the Numbers Mean

Eval rate: CUDA wins. 104 vs 93 tok/s on llama3.1:8b, 67 vs 52 on gemma3:12b. The difference is real but not perceptible in normal use — both are fast enough that you're reading before the model finishes generating.

Prompt eval rate: ROCm wins, and the gap grows with context size.

gemma3:12b prompt eval · RX 9070 · context scaling

SHORT PROMPT

579

TOK/S

1,248 TOKENS

10,700

TOK/S

RTX 5070 CUDA

132

TOK/S

The longer the prompt, the harder AMD pulls ahead. At 1,248 tokens: 81x faster than CUDA on gemma3:12b.

The Context Scaling Story

This is the part the benchmarks usually miss.

The 10,700 tok/s run used a 1,248 token prompt — a real-world input, not a one-liner. The GPU pipeline had something to chew on. The result: AMD's memory bandwidth advantage compounds as context grows.

At short prompts ROCm is already faster. At longer prompts it becomes a different class of hardware entirely. The RTX 5070 ingested the same input at 132 tok/s. The RX 9070 did it at 10,700.

This matters for the workloads that actually justify running a local model:

Pasting a codebase section for review
Feeding a long document for summarization
Multi-turn conversations with deep context history
Anything using gemma3:12b's 131,072 token native context

The eval rate — what CUDA wins on — measures how fast you get tokens back once the model starts generating. That gap is 11 tok/s on llama3.1:8b. Perceptible only if you're staring at a stopwatch.

The prompt eval rate — what ROCm dominates — measures how fast the model processes what you sent. At real-world context lengths, that gap is measured in orders of magnitude.

What Fits in 16GB

With 16GB VRAM the constraint disappears for most practical use:

Model	VRAM Used	Capability
llama3.1:8b Q4_K_M	5.5GB	Fast. Tool use. Good for coding.
gemma3:12b Q4_K_M	9.9GB	Vision. 131k context. Better reasoning.
Qwen2.5 14B Q4_K_M	~9GB	Fits clean. Strong at code.
llama3 70B Q4_K_M	~40GB	Needs offloading — partial GPU.

At 16GB you can run gemma3:12b with room to spare and push the context window well past the 4096 default. That headroom is where the RX 9070 pulls ahead of 12GB cards in real use.

The Vision Demo

gemma3:12b ships with vision capability. Feed it an image locally — no API call, no data leaving the machine.

gemma3:12b · local vision inference · RX 9070 · no API · no data leaving machine

Mid-century modern chair. Chrome swivel mechanism. Worn red upholstery. Debris on the floor. Signs of abandonment.

Identified from a local photo. Entirely on-device. Zero cloud. This is the ownit.run thesis in one demo.

Private inference, on hardware you own, with no usage limits and no one logging the query.

The CLI Is Identical

This matters if you're migrating. Ollama syntax is the same everywhere:

ollama run llama3.1:8b
ollama run gemma3:12b
ollama list
ollama ps

Windows PowerShell, Linux bash, macOS zsh. No relearning. Pull your models once, run them anywhere.

The Honest Summary

CUDA is still faster at pure token generation. If raw eval speed is the only metric, NVIDIA wins — by 11 tok/s on llama3.1:8b.

The performance gap is single-digit percentages on the metric you notice least. At real-world context lengths, AMD wins by orders of magnitude on the metric that matters most.

The RX 9070 on Linux is faster at prompt ingestion, fits the same models, costs less than an RTX 5080, runs on open-source drivers that will never be end-of-life'd, and doesn't require a proprietary kernel module on every boot.

Your queries stay on your machine. Your models run locally. Your hardware works because Linux exists, not because NVIDIA decided it should.

That's worth 11 tokens per second.

Benchmarks run with ollama run --verbose. Hardware: PowerColor Hellhound RX 9070 16GB · Intel Core Ultra 7 265K · 64GB DDR5 5600 · Ubuntu 26.04 LTS kernel 7.0. ROCm via HSA_OVERRIDE_GFX_VERSION=12.0.1. † denotes 1,248 token prompt run.