Gemma 4 12B Benchmark: Jetson AGX Thor vs RTX 3090
All numbers in this post come from live hardware testing. No synthetic data, no manufacturer specs — llama-bench against the real model on two different platforms.
What Is Gemma 4 12B?
Gemma 4 12B is Google's 12-billion-parameter model from the Gemma 4 family. Key facts for deployment decisions:
- 11.91 billion parameters (Q4_K_M quantized: 6.86 GiB on disk)
- Multimodal: includes a vision projector (
mmproj-gemma-4-12B-it-bf16.gguf) for image understanding - Architecture:
gemma4— requires llama.cpp build ≥ 9000 to load - Quantization tested: Q4_K_M (4-bit, medium quality)
This benchmark covers text-only inference using llama.cpp on both platforms.
Test Environments
| Spec | Jetson AGX Thor | RTX 3090 (feolpc) |
|---|---|---|
| GPU | NVIDIA Thor | NVIDIA GeForce RTX 3090 |
| VRAM / Unified Memory | 122.8 GiB | 24 GiB |
| CUDA Version | 13.0 | 12.1 |
| Compute Capability | 11.0 | 8.6 |
| System | JetPack 6.8.12-tegra (ARM64) | Ubuntu 22.04 (x86_64) |
| llama.cpp Build | 9159 (5c0e94683) | 9496 |
| Model | gemma4 11.91B Q4_K_M | gemma4 11.91B Q4_K_M |
| Flash Attention | ✓ | ✓ |
| GPU Offload | 999 layers (full) | 999 layers (full) |
Benchmark Command
Same command on both machines:
LD_LIBRARY_PATH=build/bin build/bin/llama-bench \
-m /path/to/gemma4-12b-Q4_K_M.gguf \
-ngl 999 -fa 1 \
-p 32,128,512 -n 64,128 \
-r 3
-ngl 999: all layers on GPU-fa 1: Flash Attention enabled-p: prompt token counts (prefill test)-n: output token counts (generation test)-r 3: 3 runs, averaged
Results
Jetson AGX Thor
Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes, VRAM: 125771 MiB
| Model | Size | Backend | Test | Speed (t/s) |
|---|---|---|---|---|
| gemma4 11.91B Q4_K_M | 6.86 GiB | CUDA | pp32 | 366.81 ± 11.59 |
| gemma4 11.91B Q4_K_M | 6.86 GiB | CUDA | pp128 | 755.17 ± 28.85 |
| gemma4 11.91B Q4_K_M | 6.86 GiB | CUDA | pp512 | 791.97 ± 50.70 |
| gemma4 11.91B Q4_K_M | 6.86 GiB | CUDA | tg64 | 18.48 ± 0.32 |
| gemma4 11.91B Q4_K_M | 6.86 GiB | CUDA | tg128 | 18.65 ± 0.16 |
RTX 3090
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24251 MiB
| Model | Size | Backend | Test | Speed (t/s) |
|---|---|---|---|---|
| gemma4 11.91B Q4_K_M | 6.86 GiB | CUDA | pp32 | 1099.89 ± 377.14 |
| gemma4 11.91B Q4_K_M | 6.86 GiB | CUDA | pp128 | 2158.47 ± 135.86 |
| gemma4 11.91B Q4_K_M | 6.86 GiB | CUDA | pp512 | 2702.24 ± 10.83 |
| gemma4 11.91B Q4_K_M | 6.86 GiB | CUDA | tg64 | 70.30 ± 0.54 |
| gemma4 11.91B Q4_K_M | 6.86 GiB | CUDA | tg128 | 70.66 ± 0.22 |
Head-to-Head Comparison
| Test | Jetson AGX Thor | RTX 3090 | 3090 advantage |
|---|---|---|---|
| pp32 | 366.81 t/s | 1099.89 t/s | 3.0× |
| pp128 | 755.17 t/s | 2158.47 t/s | 2.9× |
| pp512 | 791.97 t/s | 2702.24 t/s | 3.4× |
| tg64 | 18.48 t/s | 70.30 t/s | 3.8× |
| tg128 | 18.65 t/s | 70.66 t/s | 3.8× |
The RTX 3090 is consistently 3–4× faster than the Jetson Thor on this model. This is expected: the 3090 has 336 GB/s memory bandwidth vs roughly 102 GB/s on Thor's unified memory, and a much higher CUDA core count.
Analysis
Thor: 18.65 t/s Generation
Human reading speed is 3–5 tokens per second. At 18.65 t/s, the Thor generates Gemma 4 12B output roughly 4–6× faster than a person can read — comfortable for interactive use without any perceptible lag.
Compared to previous Thor benchmarks:
| Model | Size | Speed |
|---|---|---|
| Qwen3.6-35B-A3B FP8 (SGLang) | ~36 GB | 14.7 t/s |
| Gemma 4 12B Q4_K_M (llama.cpp) | 6.86 GB | 18.65 t/s |
| Qwen2.5-1.5B Q4_K_M (llama.cpp) | 1.04 GB | 107–113 t/s |
Gemma 4 12B is faster than the 35B Qwen model because it uses 5× less memory, reducing memory bandwidth pressure during generation.
RTX 3090: 70.66 t/s Generation
At 70.66 t/s, the RTX 3090 delivers output roughly 14–23× faster than human reading speed. This makes it suitable for:
- High-throughput batch generation
- Developer testing and iteration
- Multi-user inference serving (with queuing)
The prefill speed of 2702 t/s at 512 tokens means even long prompts process in under 200ms — negligible for any application.
Memory Footprint
At 6.86 GiB, Gemma 4 12B is memory-efficient on both platforms:
- Thor (122.8 GiB unified): model uses 5.6% of available memory — room for simultaneous models, large KV caches, and vision pipelines
- RTX 3090 (24 GiB VRAM): model uses 28% of VRAM — comfortable headroom for long-context generation
Prefill Speed
The prefill speed of 792 t/s (Thor) and 2702 t/s (3090) at 512 tokens means prompt processing completes in under 650ms and 190ms respectively. For conversational applications, prompt processing is invisible to the user on both platforms.
What This Means for Edge AI
Choose Jetson AGX Thor if you need:
- Private, on-device multimodal AI (text + vision)
- 12B-class reasoning without cloud dependency
- Long-context applications (122 GB unified memory handles very large KV caches)
- Production edge deployment with 24/7 uptime
Choose RTX 3090 if you need:
- Higher generation throughput for batch applications
- Faster development iteration cycles
- x86_64 compatibility with existing ML tooling
Model Architecture Note
Gemma 4 12B includes a vision projector (mmproj), making it a vision-language model. This benchmark covers text-only inference — the projector is not loaded during text-only generation.
For vision tasks, you would add the projector and pass image tokens alongside text. The generation speed for vision+text prompts will be slightly lower due to the additional projector computation.
Reproduce These Results
# Download model (requires HuggingFace account with Gemma terms accepted)
# Or use via ollama on a supported version
# Thor
cd ~/kwkthor/llama.cpp
LD_LIBRARY_PATH=build/bin build/bin/llama-bench \
-m /path/to/gemma-4-12B-Q4_K_M.gguf \
-ngl 999 -fa 1 -p 32,128,512 -n 64,128 -r 3
# RTX 3090
cd ~/llama.cpp
LD_LIBRARY_PATH=build/lib build/bin/llama-bench \
-m /path/to/gemma-4-12B-Q4_K_M.gguf \
-ngl 999 -fa 1 -p 32,128,512 -n 64,128 -r 3
Benchmark collected on June 4, 2026. Results may vary with different quantization levels, llama.cpp versions, or thermal conditions.