Gemma 4 on RTX 3090: 4B vs 12B vs 26B Benchmark — What Happens When VRAM Runs Out
Three models, one GPU. These are Ollama 0.21.0 API measurements on a live RTX 3090 — no synthetic data, no clean-room conditions.
Hardware and Software
| Spec | Value |
|---|---|
| GPU | NVIDIA GeForce RTX 3090 |
| VRAM | 24 GB |
| Driver | 550.144.03 |
| CUDA | 12.4 |
| Compute Capability | 8.6 |
| OS | Ubuntu 22.04.5 LTS |
| Inference Engine | Ollama 0.21.0 |
VRAM at test time: ~15.1 GB free (9.1 GB occupied by desktop + a background Python process). This is a real-world condition, not a clean benchmark machine.
Models Tested
| Model | File Size | Quantization | Parameters |
|---|---|---|---|
gemma4:e4b | 9.6 GB | Q4_K_M | 4B |
gemma4:12b (llama-bench) | 6.86 GB | Q4_K_M | 11.91B |
gemma4:26b | 17.99 GB | Q4_K_M | 25.8B |
The 12B result is from a separate llama-bench test on the same GPU with a clean VRAM state. Included here for size comparison.
Benchmark Method
For the 4B and 26B models, each test used the Ollama REST API:
curl http://localhost:11434/api/generate -d '{
"model": "gemma4:26b",
"prompt": "<test prompt>",
"stream": false,
"options": {"num_predict": 128}
}'
Metrics extracted from the response:
- Prefill (pp):
prompt_eval_count / (prompt_eval_duration / 1e9)t/s - Generation (tg):
eval_count / (eval_duration / 1e9)t/s
Each model was run once to warm up (load into memory), then measured on the second run.
Two prompt lengths tested:
- Short: ~26 tokens ("What is machine learning? Answer in one paragraph.")
- Long: ~192 tokens (8× repeated passage about AI history)
Results
Prefill Speed (Prompt Processing)
| Model | Short Prompt (pp) | Long Prompt (pp) |
|---|---|---|
| gemma4:e4b (4B) | 2,893 t/s | 16,953 t/s |
| gemma4:12b (llama-bench) | 1,100–2,702 t/s | — |
| gemma4:26b | 291 t/s | 2,805 t/s |
Generation Speed (Token Output)
| Model | Short Prompt (tg) | Long Prompt (tg) |
|---|---|---|
| gemma4:e4b (4B) | 131.0 t/s | 129.7 t/s |
| gemma4:12b (llama-bench) | 70.3–70.7 t/s | — |
| gemma4:26b | 15.4 t/s | 16.4 t/s |
Why the 26B Is 8× Slower
The core issue is simple arithmetic:
Model file size: 17.99 GB
Free VRAM: 15.10 GB
Overflow to RAM: 2.89 GB
When Ollama loads gemma4:26b, approximately 2.9 GB of model weights cannot fit in VRAM and are placed in system RAM. During generation, every forward pass has to fetch those layers from RAM over the PCIe bus, which is roughly 50–100× slower than VRAM bandwidth.
The result: 15–16 t/s instead of the ~40–50 t/s you would expect on a clean 24GB card.
The 4B and 12B models fit entirely in VRAM and their generation speed scales roughly as expected with model size.
Generation Speed Comparison
gemma4:e4b (4B) ████████████████████████████████ 131 t/s ✓ full VRAM
gemma4:12b (12B) ██████████████ 70 t/s ✓ full VRAM
gemma4:26b (26B) ████ 16 t/s ✗ partial CPU
What This Means for RTX 3090 Users
4B model (gemma4:e4b):
- 131 t/s — roughly 26–43× faster than human reading speed
- 9.6 GB fits comfortably, leaves room for other VRAM users
- Best choice for rapid iteration, coding assistants, or multi-model setups
12B model (gemma4:12b):
- 70 t/s with full VRAM and llama.cpp
- Balanced choice: significantly better reasoning than 4B, still fast
- Requires ~7 GB VRAM — easily fits even with desktop overhead
26B model (gemma4:26b):
- If your 3090 has ≥ 20 GB free: expect ~40–50 t/s (full VRAM)
- If your 3090 is shared (like this test): 15–16 t/s, barely interactive
- Only worth running if VRAM is mostly clear and quality matters more than speed
Practical rule: For RTX 3090 with a typical desktop environment, the 12B model is the sweet spot — it fits in VRAM, generates at 70 t/s, and the quality gap versus 26B is meaningful but not enormous for most tasks.
Prefill Speed Note
The short-prompt 26B prefill (291 t/s) looks artificially low because the model was cold-loading on first contact. The long-prompt result (2,805 t/s) is more representative of actual prefill performance when the model is already in memory. For conversational use, prompt processing completes in well under a second regardless of model size.
Reproduce These Results
# Install Ollama and pull models
curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma4:e4b
ollama pull gemma4:26b
# Quick generation speed test via API
curl http://localhost:11434/api/generate -d '{
"model": "gemma4:26b",
"prompt": "What is machine learning? Answer in one paragraph.",
"stream": false,
"options": {"num_predict": 128}
}' | python3 -c "
import json, sys
d = json.load(sys.stdin)
tg = d['eval_count'] / (d['eval_duration'] / 1e9)
pp = d['prompt_eval_count'] / (d['prompt_eval_duration'] / 1e9)
print(f'Prefill: {pp:.1f} t/s | Generation: {tg:.1f} t/s')
"
Check your free VRAM before running the 26B:
nvidia-smi --query-gpu=memory.free,memory.total --format=csv,noheader
If free VRAM is below 18 GB, expect CPU offloading and slower generation.
Benchmarks collected June 8, 2026. Results vary with VRAM availability, Ollama version, and quantization. The 12B data is from a separate llama-bench session on the same GPU.