Gemma 4 on RTX 3090: 4B vs 12B vs 26B Benchmark — What Happens When VRAM Runs Out

Three models, one GPU. These are Ollama 0.21.0 API measurements on a live RTX 3090 — no synthetic data, no clean-room conditions.

Hardware and Software

Spec	Value
GPU	NVIDIA GeForce RTX 3090
VRAM	24 GB
Driver	550.144.03
CUDA	12.4
Compute Capability	8.6
OS	Ubuntu 22.04.5 LTS
Inference Engine	Ollama 0.21.0

VRAM at test time: ~15.1 GB free (9.1 GB occupied by desktop + a background Python process). This is a real-world condition, not a clean benchmark machine.

Models Tested

Model	File Size	Quantization	Parameters
`gemma4:e4b`	9.6 GB	Q4_K_M	4B
`gemma4:12b` (llama-bench)	6.86 GB	Q4_K_M	11.91B
`gemma4:26b`	17.99 GB	Q4_K_M	25.8B

The 12B result is from a separate llama-bench test on the same GPU with a clean VRAM state. Included here for size comparison.

Benchmark Method

For the 4B and 26B models, each test used the Ollama REST API:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:26b",
  "prompt": "<test prompt>",
  "stream": false,
  "options": {"num_predict": 128}
}'

Metrics extracted from the response:

Prefill (pp): prompt_eval_count / (prompt_eval_duration / 1e9) t/s
Generation (tg): eval_count / (eval_duration / 1e9) t/s

Each model was run once to warm up (load into memory), then measured on the second run.

Two prompt lengths tested:

Short: ~26 tokens ("What is machine learning? Answer in one paragraph.")
Long: ~192 tokens (8× repeated passage about AI history)

Results

Prefill Speed (Prompt Processing)

Model	Short Prompt (pp)	Long Prompt (pp)
gemma4:e4b (4B)	2,893 t/s	16,953 t/s
gemma4:12b (llama-bench)	1,100–2,702 t/s	—
gemma4:26b	291 t/s	2,805 t/s

Generation Speed (Token Output)

Model	Short Prompt (tg)	Long Prompt (tg)
gemma4:e4b (4B)	131.0 t/s	129.7 t/s
gemma4:12b (llama-bench)	70.3–70.7 t/s	—
gemma4:26b	15.4 t/s	16.4 t/s

My Take

Honestly the 26B result caught me off guard. I expected slower, but not that slow — 15 t/s on a 3090 feels like running with the handbrake on. Once I saw the VRAM math it made sense, but it's a good reminder that model file size and available VRAM aren't the same thing, especially on a machine that's also running a desktop and background services.

Why the 26B Is 8× Slower

The core issue is simple arithmetic:

Model file size:  17.99 GB
Free VRAM:        15.10 GB
Overflow to RAM:   2.89 GB

When Ollama loads gemma4:26b, approximately 2.9 GB of model weights cannot fit in VRAM and are placed in system RAM. During generation, every forward pass has to fetch those layers from RAM over the PCIe bus, which is roughly 50–100× slower than VRAM bandwidth.

The result: 15–16 t/s instead of the ~40–50 t/s you would expect on a clean 24GB card.

The 4B and 12B models fit entirely in VRAM and their generation speed scales roughly as expected with model size.

Generation Speed Comparison

gemma4:e4b  (4B)  ████████████████████████████████  131 t/s  ✓ full VRAM
gemma4:12b  (12B) ██████████████                    70  t/s  ✓ full VRAM
gemma4:26b  (26B) ████                              16  t/s  ✗ partial CPU

What This Means for RTX 3090 Users

4B model (gemma4:e4b):

131 t/s — roughly 26–43× faster than human reading speed
9.6 GB fits comfortably, leaves room for other VRAM users
Best choice for rapid iteration, coding assistants, or multi-model setups

12B model (gemma4:12b):

70 t/s with full VRAM and llama.cpp
Balanced choice: significantly better reasoning than 4B, still fast
Requires ~7 GB VRAM — easily fits even with desktop overhead

26B model (gemma4:26b):

If your 3090 has ≥ 20 GB free: expect ~40–50 t/s (full VRAM)
If your 3090 is shared (like this test): 15–16 t/s, barely interactive
Only worth running if VRAM is mostly clear and quality matters more than speed

Practical rule: For RTX 3090 with a typical desktop environment, the 12B model is the sweet spot — it fits in VRAM, generates at 70 t/s, and the quality gap versus 26B is meaningful but not enormous for most tasks.

Prefill Speed Note

The short-prompt 26B prefill (291 t/s) looks artificially low because the model was cold-loading on first contact. The long-prompt result (2,805 t/s) is more representative of actual prefill performance when the model is already in memory. For conversational use, prompt processing completes in well under a second regardless of model size.

Reproduce These Results

# Install Ollama and pull models
curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma4:e4b
ollama pull gemma4:26b

# Quick generation speed test via API
curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:26b",
  "prompt": "What is machine learning? Answer in one paragraph.",
  "stream": false,
  "options": {"num_predict": 128}
}' | python3 -c "
import json, sys
d = json.load(sys.stdin)
tg = d['eval_count'] / (d['eval_duration'] / 1e9)
pp = d['prompt_eval_count'] / (d['prompt_eval_duration'] / 1e9)
print(f'Prefill: {pp:.1f} t/s | Generation: {tg:.1f} t/s')
"

Check your free VRAM before running the 26B:

nvidia-smi --query-gpu=memory.free,memory.total --format=csv,noheader

If free VRAM is below 18 GB, expect CPU offloading and slower generation.

Benchmarks collected June 8, 2026. Results vary with VRAM availability, Ollama version, and quantization. The 12B data is from a separate llama-bench session on the same GPU.