Qwen3 30B vs Gemma 4 26B on RTX 3090: Full VRAM Showdown
Two 30B-class models, one RTX 3090, full VRAM. These are live Ollama 0.21.0 measurements — no shared memory, no CPU offloading.
Why This Comparison Matters
Both models are Q4_K_M quantized and sit in the same size bracket:
| Model | File Size | Parameters | Quantization |
|---|---|---|---|
qwen3:30b | 18 GB | 30B | Q4_K_M |
gemma4:26b | 17 GB | 25.8B | Q4_K_M |
With a clean RTX 3090 (23.8 GB free), both load entirely into VRAM. This is the test that shows what these models actually do when given the hardware they need.
Context: In an earlier test with 9 GB occupied by background processes, Gemma 4 26B dropped to just 15–16 t/s due to VRAM overflow. Here, with full VRAM, it runs at 123 t/s — an 8× difference. VRAM availability matters enormously.
Test Environment
| Spec | Value |
|---|---|
| GPU | NVIDIA GeForce RTX 3090 |
| VRAM | 24 GB (23.8 GB free at test time) |
| Driver | 550.144.03 / CUDA 12.4 |
| OS | Ubuntu 22.04.5 LTS |
| Inference Engine | Ollama 0.21.0 |
Benchmark Method
Three prompt types tested via Ollama REST API, each with a warm-up run followed by a measured run:
- Short (~26 tokens): "What is machine learning? Answer in one paragraph."
- Long (~230 tokens): Repeated AI history passage
- Reasoning (~30 tokens): "A farmer has 17 sheep. All but 9 die. How many sheep are left? Think step by step."
Each run generates 128 output tokens. Metrics from response:
- pp =
prompt_eval_count / (prompt_eval_duration / 1e9) - tg =
eval_count / (eval_duration / 1e9)
Results
Generation Speed (tg)
| Prompt Type | Qwen3 30B | Gemma 4 26B | Winner |
|---|---|---|---|
| Short | 141.9 t/s | 123.4 t/s | Qwen3 +15% |
| Long | 140.9 t/s | 123.0 t/s | Qwen3 +15% |
| Reasoning | 141.5 t/s | 123.6 t/s | Qwen3 +15% |
Prefill Speed (pp)
| Prompt Type | Qwen3 30B | Gemma 4 26B | Winner |
|---|---|---|---|
| Short | 2,354 t/s | 2,798 t/s | Gemma4 +19% |
| Long | 10,728 t/s | 10,433 t/s | Qwen3 +3% |
| Reasoning | 4,090 t/s | 4,394 t/s | Gemma4 +7% |
Analysis
Generation Speed: Qwen3 Wins Consistently
Qwen3 30B generates at 141 t/s across all prompt types — no variance. Gemma 4 26B sits at 123 t/s, also stable. The gap is exactly 15% and holds regardless of prompt length or type.
Qwen3 30B ████████████████████████████████████ 141 t/s
Gemma4 26B ████████████████████████████████ 123 t/s
At 141 t/s, Qwen3 30B produces output roughly 28–47× faster than human reading speed — well above what any interactive application needs. Even Gemma 4 at 123 t/s is far beyond interactive threshold.
The generation speed difference comes down to memory bandwidth utilization. Qwen3 30B at 18 GB loads slightly more data per forward pass than Gemma 4 26B at 17 GB, yet still runs faster — suggesting Qwen3's architecture makes more efficient use of the 3090's 936 GB/s bandwidth.
Prefill Speed: Gemma 4 Leads on Short Prompts
For short prompts, Gemma 4 prefills at 2,798 t/s vs 2,354 t/s for Qwen3 — a 19% advantage. This translates to roughly 1ms difference in a 26-token prompt, which is imperceptible in any real application.
For long prompts (230 tokens), the gap nearly disappears: Qwen3 edges ahead at 10,728 vs 10,433 t/s. Both process a 512-token prompt in under 50ms.
Consistency
Both models show near-zero variance across prompt types in generation speed — a sign that both fit fully in VRAM with no partial CPU offloading. The stability of these numbers (141.5 ± 0.5 and 123.5 ± 0.3) is what full-VRAM inference looks like.
Which to Choose for RTX 3090
Choose Qwen3 30B if:
- Generation throughput is your bottleneck (batch processing, long outputs)
- You're running an inference server and want to maximize requests/second
- You prefer Alibaba's training data mix and instruction following style
Choose Gemma 4 26B if:
- You have many short, prompt-heavy workloads where prefill matters more
- You prefer Google's model family and its safety/instruction tuning
- You're already using Gemma 4 in a pipeline and want size consistency
For most users, either model will feel identical in interactive chat. The 15% generation gap becomes meaningful at scale: running 1000 batch requests, Qwen3 saves ~2 minutes.
VRAM Is the Deciding Factor
Both models require roughly 17–18 GB to load. On an RTX 3090, this leaves ~6 GB of headroom — enough for a desktop environment, but not for another large model or a memory-hungry background process.
If your available VRAM drops below ~19 GB, expect partial CPU offloading and speeds similar to what we measured under memory pressure: 15–16 t/s instead of 123–141 t/s.
Check before loading:
nvidia-smi --query-gpu=memory.free --format=csv,noheader
# Need at least 18000 MiB for either model
Reproduce These Results
ollama pull qwen3:30b
ollama pull gemma4:26b
# Benchmark via API
for MODEL in qwen3:30b gemma4:26b; do
echo "=== $MODEL ==="
curl -s http://localhost:11434/api/generate -d "{
\"model\": \"$MODEL\",
\"prompt\": \"What is machine learning? Answer in one paragraph.\",
\"stream\": false,
\"options\": {\"num_predict\": 128}
}" | python3 -c "
import json, sys
d = json.load(sys.stdin)
print(f'pp={d[\"prompt_eval_count\"]/(d[\"prompt_eval_duration\"]/1e9):.0f} t/s tg={d[\"eval_count\"]/(d[\"eval_duration\"]/1e9):.1f} t/s')
"
done
Benchmarks collected June 8, 2026 on Ollama 0.21.0. Results vary with VRAM availability — both models require ~18 GB free for full GPU inference.