MultimodalFlow
← Back to Blog

Gemma 4 on RTX 3090: 4B vs 12B vs 26B Benchmark — What Happens When VRAM Runs Out

Gemma4benchmarkRTX 3090LLMVRAMOllamalocal inference

Three models, one GPU. These are Ollama 0.21.0 API measurements on a live RTX 3090 — no synthetic data, no clean-room conditions.


Hardware and Software

SpecValue
GPUNVIDIA GeForce RTX 3090
VRAM24 GB
Driver550.144.03
CUDA12.4
Compute Capability8.6
OSUbuntu 22.04.5 LTS
Inference EngineOllama 0.21.0

VRAM at test time: ~15.1 GB free (9.1 GB occupied by desktop + a background Python process). This is a real-world condition, not a clean benchmark machine.


Models Tested

ModelFile SizeQuantizationParameters
gemma4:e4b9.6 GBQ4_K_M4B
gemma4:12b (llama-bench)6.86 GBQ4_K_M11.91B
gemma4:26b17.99 GBQ4_K_M25.8B

The 12B result is from a separate llama-bench test on the same GPU with a clean VRAM state. Included here for size comparison.


Benchmark Method

For the 4B and 26B models, each test used the Ollama REST API:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:26b",
  "prompt": "<test prompt>",
  "stream": false,
  "options": {"num_predict": 128}
}'

Metrics extracted from the response:

  • Prefill (pp): prompt_eval_count / (prompt_eval_duration / 1e9) t/s
  • Generation (tg): eval_count / (eval_duration / 1e9) t/s

Each model was run once to warm up (load into memory), then measured on the second run.

Two prompt lengths tested:

  • Short: ~26 tokens ("What is machine learning? Answer in one paragraph.")
  • Long: ~192 tokens (8× repeated passage about AI history)

Results

Prefill Speed (Prompt Processing)

ModelShort Prompt (pp)Long Prompt (pp)
gemma4:e4b (4B)2,893 t/s16,953 t/s
gemma4:12b (llama-bench)1,100–2,702 t/s
gemma4:26b291 t/s2,805 t/s

Generation Speed (Token Output)

ModelShort Prompt (tg)Long Prompt (tg)
gemma4:e4b (4B)131.0 t/s129.7 t/s
gemma4:12b (llama-bench)70.3–70.7 t/s
gemma4:26b15.4 t/s16.4 t/s

Why the 26B Is 8× Slower

The core issue is simple arithmetic:

Model file size:  17.99 GB
Free VRAM:        15.10 GB
Overflow to RAM:   2.89 GB

When Ollama loads gemma4:26b, approximately 2.9 GB of model weights cannot fit in VRAM and are placed in system RAM. During generation, every forward pass has to fetch those layers from RAM over the PCIe bus, which is roughly 50–100× slower than VRAM bandwidth.

The result: 15–16 t/s instead of the ~40–50 t/s you would expect on a clean 24GB card.

The 4B and 12B models fit entirely in VRAM and their generation speed scales roughly as expected with model size.


Generation Speed Comparison

gemma4:e4b  (4B)  ████████████████████████████████  131 t/s  ✓ full VRAM
gemma4:12b  (12B) ██████████████                    70  t/s  ✓ full VRAM
gemma4:26b  (26B) ████                              16  t/s  ✗ partial CPU

What This Means for RTX 3090 Users

4B model (gemma4:e4b):

  • 131 t/s — roughly 26–43× faster than human reading speed
  • 9.6 GB fits comfortably, leaves room for other VRAM users
  • Best choice for rapid iteration, coding assistants, or multi-model setups

12B model (gemma4:12b):

  • 70 t/s with full VRAM and llama.cpp
  • Balanced choice: significantly better reasoning than 4B, still fast
  • Requires ~7 GB VRAM — easily fits even with desktop overhead

26B model (gemma4:26b):

  • If your 3090 has ≥ 20 GB free: expect ~40–50 t/s (full VRAM)
  • If your 3090 is shared (like this test): 15–16 t/s, barely interactive
  • Only worth running if VRAM is mostly clear and quality matters more than speed

Practical rule: For RTX 3090 with a typical desktop environment, the 12B model is the sweet spot — it fits in VRAM, generates at 70 t/s, and the quality gap versus 26B is meaningful but not enormous for most tasks.


Prefill Speed Note

The short-prompt 26B prefill (291 t/s) looks artificially low because the model was cold-loading on first contact. The long-prompt result (2,805 t/s) is more representative of actual prefill performance when the model is already in memory. For conversational use, prompt processing completes in well under a second regardless of model size.


Reproduce These Results

# Install Ollama and pull models
curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma4:e4b
ollama pull gemma4:26b

# Quick generation speed test via API
curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:26b",
  "prompt": "What is machine learning? Answer in one paragraph.",
  "stream": false,
  "options": {"num_predict": 128}
}' | python3 -c "
import json, sys
d = json.load(sys.stdin)
tg = d['eval_count'] / (d['eval_duration'] / 1e9)
pp = d['prompt_eval_count'] / (d['prompt_eval_duration'] / 1e9)
print(f'Prefill: {pp:.1f} t/s | Generation: {tg:.1f} t/s')
"

Check your free VRAM before running the 26B:

nvidia-smi --query-gpu=memory.free,memory.total --format=csv,noheader

If free VRAM is below 18 GB, expect CPU offloading and slower generation.


Benchmarks collected June 8, 2026. Results vary with VRAM availability, Ollama version, and quantization. The 12B data is from a separate llama-bench session on the same GPU.