Gemma 4 12B Benchmark: Jetson AGX Thor vs RTX 3090

All numbers in this post come from live hardware testing. No synthetic data, no manufacturer specs — llama-bench against the real model on two different platforms.

What Is Gemma 4 12B?

Gemma 4 12B is Google's 12-billion-parameter model from the Gemma 4 family. Key facts for deployment decisions:

11.91 billion parameters (Q4_K_M quantized: 6.86 GiB on disk)
Multimodal: includes a vision projector (mmproj-gemma-4-12B-it-bf16.gguf) for image understanding
Architecture: gemma4 — requires llama.cpp build ≥ 9000 to load
Quantization tested: Q4_K_M (4-bit, medium quality)

This benchmark covers text-only inference using llama.cpp on both platforms.

Test Environments

Spec	Jetson AGX Thor	RTX 3090 (feolpc)
GPU	NVIDIA Thor	NVIDIA GeForce RTX 3090
VRAM / Unified Memory	122.8 GiB	24 GiB
CUDA Version	13.0	12.1
Compute Capability	11.0	8.6
System	JetPack 6.8.12-tegra (ARM64)	Ubuntu 22.04 (x86_64)
llama.cpp Build	9159 (5c0e94683)	9496
Model	gemma4 11.91B Q4_K_M	gemma4 11.91B Q4_K_M
Flash Attention	✓	✓
GPU Offload	999 layers (full)	999 layers (full)

Benchmark Command

Same command on both machines:

LD_LIBRARY_PATH=build/bin build/bin/llama-bench \
  -m /path/to/gemma4-12b-Q4_K_M.gguf \
  -ngl 999 -fa 1 \
  -p 32,128,512 -n 64,128 \
  -r 3

-ngl 999: all layers on GPU
-fa 1: Flash Attention enabled
-p: prompt token counts (prefill test)
-n: output token counts (generation test)
-r 3: 3 runs, averaged

Results

Jetson AGX Thor

Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes, VRAM: 125771 MiB

Model	Size	Backend	Test	Speed (t/s)
gemma4 11.91B Q4_K_M	6.86 GiB	CUDA	pp32	366.81 ± 11.59
gemma4 11.91B Q4_K_M	6.86 GiB	CUDA	pp128	755.17 ± 28.85
gemma4 11.91B Q4_K_M	6.86 GiB	CUDA	pp512	791.97 ± 50.70
gemma4 11.91B Q4_K_M	6.86 GiB	CUDA	tg64	18.48 ± 0.32
gemma4 11.91B Q4_K_M	6.86 GiB	CUDA	tg128	18.65 ± 0.16

RTX 3090

Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24251 MiB

Model	Size	Backend	Test	Speed (t/s)
gemma4 11.91B Q4_K_M	6.86 GiB	CUDA	pp32	1099.89 ± 377.14
gemma4 11.91B Q4_K_M	6.86 GiB	CUDA	pp128	2158.47 ± 135.86
gemma4 11.91B Q4_K_M	6.86 GiB	CUDA	pp512	2702.24 ± 10.83
gemma4 11.91B Q4_K_M	6.86 GiB	CUDA	tg64	70.30 ± 0.54
gemma4 11.91B Q4_K_M	6.86 GiB	CUDA	tg128	70.66 ± 0.22

Head-to-Head Comparison

Test	Jetson AGX Thor	RTX 3090	3090 advantage
pp32	366.81 t/s	1099.89 t/s	3.0×
pp128	755.17 t/s	2158.47 t/s	2.9×
pp512	791.97 t/s	2702.24 t/s	3.4×
tg64	18.48 t/s	70.30 t/s	3.8×
tg128	18.65 t/s	70.66 t/s	3.8×

The RTX 3090 is consistently 3–4× faster than the Jetson Thor on this model. This is expected: the 3090 has 336 GB/s memory bandwidth vs roughly 102 GB/s on Thor's unified memory, and a much higher CUDA core count.

Analysis

Thor: 18.65 t/s Generation

Human reading speed is 3–5 tokens per second. At 18.65 t/s, the Thor generates Gemma 4 12B output roughly 4–6× faster than a person can read — comfortable for interactive use without any perceptible lag.

Compared to previous Thor benchmarks:

Model	Size	Speed
Qwen3.6-35B-A3B FP8 (SGLang)	~36 GB	14.7 t/s
Gemma 4 12B Q4_K_M (llama.cpp)	6.86 GB	18.65 t/s
Qwen2.5-1.5B Q4_K_M (llama.cpp)	1.04 GB	107–113 t/s

Gemma 4 12B is faster than the 35B Qwen model because it uses 5× less memory, reducing memory bandwidth pressure during generation.

RTX 3090: 70.66 t/s Generation

At 70.66 t/s, the RTX 3090 delivers output roughly 14–23× faster than human reading speed. This makes it suitable for:

High-throughput batch generation
Developer testing and iteration
Multi-user inference serving (with queuing)

The prefill speed of 2702 t/s at 512 tokens means even long prompts process in under 200ms — negligible for any application.

Memory Footprint

At 6.86 GiB, Gemma 4 12B is memory-efficient on both platforms:

Thor (122.8 GiB unified): model uses 5.6% of available memory — room for simultaneous models, large KV caches, and vision pipelines
RTX 3090 (24 GiB VRAM): model uses 28% of VRAM — comfortable headroom for long-context generation

Prefill Speed

The prefill speed of 792 t/s (Thor) and 2702 t/s (3090) at 512 tokens means prompt processing completes in under 650ms and 190ms respectively. For conversational applications, prompt processing is invisible to the user on both platforms.

What This Means for Edge AI

Choose Jetson AGX Thor if you need:

Private, on-device multimodal AI (text + vision)
12B-class reasoning without cloud dependency
Long-context applications (122 GB unified memory handles very large KV caches)
Production edge deployment with 24/7 uptime

Choose RTX 3090 if you need:

Higher generation throughput for batch applications
Faster development iteration cycles
x86_64 compatibility with existing ML tooling

Model Architecture Note

Gemma 4 12B includes a vision projector (mmproj), making it a vision-language model. This benchmark covers text-only inference — the projector is not loaded during text-only generation.

For vision tasks, you would add the projector and pass image tokens alongside text. The generation speed for vision+text prompts will be slightly lower due to the additional projector computation.

Reproduce These Results

# Download model (requires HuggingFace account with Gemma terms accepted)
# Or use via ollama on a supported version

# Thor
cd ~/kwkthor/llama.cpp
LD_LIBRARY_PATH=build/bin build/bin/llama-bench \
  -m /path/to/gemma-4-12B-Q4_K_M.gguf \
  -ngl 999 -fa 1 -p 32,128,512 -n 64,128 -r 3

# RTX 3090
cd ~/llama.cpp
LD_LIBRARY_PATH=build/lib build/bin/llama-bench \
  -m /path/to/gemma-4-12B-Q4_K_M.gguf \
  -ngl 999 -fa 1 -p 32,128,512 -n 64,128 -r 3

Benchmark collected on June 4, 2026. Results may vary with different quantization levels, llama.cpp versions, or thermal conditions.