MultimodalFlow
← Back to Blog

Gemma 4 12B Benchmark: Jetson AGX Thor vs RTX 3090

Gemma4benchmarkJetsonThorRTX 3090LLMedge inferencellama.cpp

All numbers in this post come from live hardware testing. No synthetic data, no manufacturer specs — llama-bench against the real model on two different platforms.


What Is Gemma 4 12B?

Gemma 4 12B is Google's 12-billion-parameter model from the Gemma 4 family. Key facts for deployment decisions:

  • 11.91 billion parameters (Q4_K_M quantized: 6.86 GiB on disk)
  • Multimodal: includes a vision projector (mmproj-gemma-4-12B-it-bf16.gguf) for image understanding
  • Architecture: gemma4 — requires llama.cpp build ≥ 9000 to load
  • Quantization tested: Q4_K_M (4-bit, medium quality)

This benchmark covers text-only inference using llama.cpp on both platforms.


Test Environments

SpecJetson AGX ThorRTX 3090 (feolpc)
GPUNVIDIA ThorNVIDIA GeForce RTX 3090
VRAM / Unified Memory122.8 GiB24 GiB
CUDA Version13.012.1
Compute Capability11.08.6
SystemJetPack 6.8.12-tegra (ARM64)Ubuntu 22.04 (x86_64)
llama.cpp Build9159 (5c0e94683)9496
Modelgemma4 11.91B Q4_K_Mgemma4 11.91B Q4_K_M
Flash Attention
GPU Offload999 layers (full)999 layers (full)

Benchmark Command

Same command on both machines:

LD_LIBRARY_PATH=build/bin build/bin/llama-bench \
  -m /path/to/gemma4-12b-Q4_K_M.gguf \
  -ngl 999 -fa 1 \
  -p 32,128,512 -n 64,128 \
  -r 3
  • -ngl 999: all layers on GPU
  • -fa 1: Flash Attention enabled
  • -p: prompt token counts (prefill test)
  • -n: output token counts (generation test)
  • -r 3: 3 runs, averaged

Results

Jetson AGX Thor

Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes, VRAM: 125771 MiB
ModelSizeBackendTestSpeed (t/s)
gemma4 11.91B Q4_K_M6.86 GiBCUDApp32366.81 ± 11.59
gemma4 11.91B Q4_K_M6.86 GiBCUDApp128755.17 ± 28.85
gemma4 11.91B Q4_K_M6.86 GiBCUDApp512791.97 ± 50.70
gemma4 11.91B Q4_K_M6.86 GiBCUDAtg6418.48 ± 0.32
gemma4 11.91B Q4_K_M6.86 GiBCUDAtg12818.65 ± 0.16

RTX 3090

Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24251 MiB
ModelSizeBackendTestSpeed (t/s)
gemma4 11.91B Q4_K_M6.86 GiBCUDApp321099.89 ± 377.14
gemma4 11.91B Q4_K_M6.86 GiBCUDApp1282158.47 ± 135.86
gemma4 11.91B Q4_K_M6.86 GiBCUDApp5122702.24 ± 10.83
gemma4 11.91B Q4_K_M6.86 GiBCUDAtg6470.30 ± 0.54
gemma4 11.91B Q4_K_M6.86 GiBCUDAtg12870.66 ± 0.22

Head-to-Head Comparison

TestJetson AGX ThorRTX 30903090 advantage
pp32366.81 t/s1099.89 t/s3.0×
pp128755.17 t/s2158.47 t/s2.9×
pp512791.97 t/s2702.24 t/s3.4×
tg6418.48 t/s70.30 t/s3.8×
tg12818.65 t/s70.66 t/s3.8×

The RTX 3090 is consistently 3–4× faster than the Jetson Thor on this model. This is expected: the 3090 has 336 GB/s memory bandwidth vs roughly 102 GB/s on Thor's unified memory, and a much higher CUDA core count.


Analysis

Thor: 18.65 t/s Generation

Human reading speed is 3–5 tokens per second. At 18.65 t/s, the Thor generates Gemma 4 12B output roughly 4–6× faster than a person can read — comfortable for interactive use without any perceptible lag.

Compared to previous Thor benchmarks:

ModelSizeSpeed
Qwen3.6-35B-A3B FP8 (SGLang)~36 GB14.7 t/s
Gemma 4 12B Q4_K_M (llama.cpp)6.86 GB18.65 t/s
Qwen2.5-1.5B Q4_K_M (llama.cpp)1.04 GB107–113 t/s

Gemma 4 12B is faster than the 35B Qwen model because it uses 5× less memory, reducing memory bandwidth pressure during generation.

RTX 3090: 70.66 t/s Generation

At 70.66 t/s, the RTX 3090 delivers output roughly 14–23× faster than human reading speed. This makes it suitable for:

  • High-throughput batch generation
  • Developer testing and iteration
  • Multi-user inference serving (with queuing)

The prefill speed of 2702 t/s at 512 tokens means even long prompts process in under 200ms — negligible for any application.

Memory Footprint

At 6.86 GiB, Gemma 4 12B is memory-efficient on both platforms:

  • Thor (122.8 GiB unified): model uses 5.6% of available memory — room for simultaneous models, large KV caches, and vision pipelines
  • RTX 3090 (24 GiB VRAM): model uses 28% of VRAM — comfortable headroom for long-context generation

Prefill Speed

The prefill speed of 792 t/s (Thor) and 2702 t/s (3090) at 512 tokens means prompt processing completes in under 650ms and 190ms respectively. For conversational applications, prompt processing is invisible to the user on both platforms.


What This Means for Edge AI

Choose Jetson AGX Thor if you need:

  • Private, on-device multimodal AI (text + vision)
  • 12B-class reasoning without cloud dependency
  • Long-context applications (122 GB unified memory handles very large KV caches)
  • Production edge deployment with 24/7 uptime

Choose RTX 3090 if you need:

  • Higher generation throughput for batch applications
  • Faster development iteration cycles
  • x86_64 compatibility with existing ML tooling

Model Architecture Note

Gemma 4 12B includes a vision projector (mmproj), making it a vision-language model. This benchmark covers text-only inference — the projector is not loaded during text-only generation.

For vision tasks, you would add the projector and pass image tokens alongside text. The generation speed for vision+text prompts will be slightly lower due to the additional projector computation.


Reproduce These Results

# Download model (requires HuggingFace account with Gemma terms accepted)
# Or use via ollama on a supported version

# Thor
cd ~/kwkthor/llama.cpp
LD_LIBRARY_PATH=build/bin build/bin/llama-bench \
  -m /path/to/gemma-4-12B-Q4_K_M.gguf \
  -ngl 999 -fa 1 -p 32,128,512 -n 64,128 -r 3

# RTX 3090
cd ~/llama.cpp
LD_LIBRARY_PATH=build/lib build/bin/llama-bench \
  -m /path/to/gemma-4-12B-Q4_K_M.gguf \
  -ngl 999 -fa 1 -p 32,128,512 -n 64,128 -r 3

Benchmark collected on June 4, 2026. Results may vary with different quantization levels, llama.cpp versions, or thermal conditions.