Qwen3 30B vs Gemma 4 26B on RTX 3090: Full VRAM Showdown

Two 30B-class models, one RTX 3090, full VRAM. These are live Ollama 0.21.0 measurements — no shared memory, no CPU offloading.

Why This Comparison Matters

Both models are Q4_K_M quantized and sit in the same size bracket:

Model	File Size	Parameters	Quantization
`qwen3:30b`	18 GB	30B	Q4_K_M
`gemma4:26b`	17 GB	25.8B	Q4_K_M

With a clean RTX 3090 (23.8 GB free), both load entirely into VRAM. This is the test that shows what these models actually do when given the hardware they need.

Context: In an earlier test with 9 GB occupied by background processes, Gemma 4 26B dropped to just 15–16 t/s due to VRAM overflow. Here, with full VRAM, it runs at 123 t/s — an 8× difference. VRAM availability matters enormously.

Test Environment

Spec	Value
GPU	NVIDIA GeForce RTX 3090
VRAM	24 GB (23.8 GB free at test time)
Driver	550.144.03 / CUDA 12.4
OS	Ubuntu 22.04.5 LTS
Inference Engine	Ollama 0.21.0

Benchmark Method

Three prompt types tested via Ollama REST API, each with a warm-up run followed by a measured run:

Short (~26 tokens): "What is machine learning? Answer in one paragraph."
Long (~230 tokens): Repeated AI history passage
Reasoning (~30 tokens): "A farmer has 17 sheep. All but 9 die. How many sheep are left? Think step by step."

Each run generates 128 output tokens. Metrics from response:

pp = prompt_eval_count / (prompt_eval_duration / 1e9)
tg = eval_count / (eval_duration / 1e9)

Results

Generation Speed (tg)

Prompt Type	Qwen3 30B	Gemma 4 26B	Winner
Short	141.9 t/s	123.4 t/s	Qwen3 +15%
Long	140.9 t/s	123.0 t/s	Qwen3 +15%
Reasoning	141.5 t/s	123.6 t/s	Qwen3 +15%

Prefill Speed (pp)

Prompt Type	Qwen3 30B	Gemma 4 26B	Winner
Short	2,354 t/s	2,798 t/s	Gemma4 +19%
Long	10,728 t/s	10,433 t/s	Qwen3 +3%
Reasoning	4,090 t/s	4,394 t/s	Gemma4 +7%

My Take

I ran this test right after seeing Gemma 4 26B crawl at 15 t/s under memory pressure — so watching both models hit triple digits with clean VRAM felt like a completely different machine. The 15% gap between Qwen3 and Gemma 4 is real but honestly not something you'd notice in everyday chat. Where I'd actually pick Qwen3 is for batch jobs where I'm generating a lot of text overnight.

Analysis

Generation Speed: Qwen3 Wins Consistently

Qwen3 30B generates at 141 t/s across all prompt types — no variance. Gemma 4 26B sits at 123 t/s, also stable. The gap is exactly 15% and holds regardless of prompt length or type.

Qwen3  30B  ████████████████████████████████████  141 t/s
Gemma4 26B  ████████████████████████████████      123 t/s

At 141 t/s, Qwen3 30B produces output roughly 28–47× faster than human reading speed — well above what any interactive application needs. Even Gemma 4 at 123 t/s is far beyond interactive threshold.

The generation speed difference comes down to memory bandwidth utilization. Qwen3 30B at 18 GB loads slightly more data per forward pass than Gemma 4 26B at 17 GB, yet still runs faster — suggesting Qwen3's architecture makes more efficient use of the 3090's 936 GB/s bandwidth.

Prefill Speed: Gemma 4 Leads on Short Prompts

For short prompts, Gemma 4 prefills at 2,798 t/s vs 2,354 t/s for Qwen3 — a 19% advantage. This translates to roughly 1ms difference in a 26-token prompt, which is imperceptible in any real application.

For long prompts (230 tokens), the gap nearly disappears: Qwen3 edges ahead at 10,728 vs 10,433 t/s. Both process a 512-token prompt in under 50ms.

Consistency

Both models show near-zero variance across prompt types in generation speed — a sign that both fit fully in VRAM with no partial CPU offloading. The stability of these numbers (141.5 ± 0.5 and 123.5 ± 0.3) is what full-VRAM inference looks like.

Which to Choose for RTX 3090

Choose Qwen3 30B if:

Generation throughput is your bottleneck (batch processing, long outputs)
You're running an inference server and want to maximize requests/second
You prefer Alibaba's training data mix and instruction following style

Choose Gemma 4 26B if:

You have many short, prompt-heavy workloads where prefill matters more
You prefer Google's model family and its safety/instruction tuning
You're already using Gemma 4 in a pipeline and want size consistency

For most users, either model will feel identical in interactive chat. The 15% generation gap becomes meaningful at scale: running 1000 batch requests, Qwen3 saves ~2 minutes.

VRAM Is the Deciding Factor

Both models require roughly 17–18 GB to load. On an RTX 3090, this leaves ~6 GB of headroom — enough for a desktop environment, but not for another large model or a memory-hungry background process.

If your available VRAM drops below ~19 GB, expect partial CPU offloading and speeds similar to what we measured under memory pressure: 15–16 t/s instead of 123–141 t/s.

Check before loading:

nvidia-smi --query-gpu=memory.free --format=csv,noheader
# Need at least 18000 MiB for either model

Reproduce These Results

ollama pull qwen3:30b
ollama pull gemma4:26b

# Benchmark via API
for MODEL in qwen3:30b gemma4:26b; do
  echo "=== $MODEL ==="
  curl -s http://localhost:11434/api/generate -d "{
    \"model\": \"$MODEL\",
    \"prompt\": \"What is machine learning? Answer in one paragraph.\",
    \"stream\": false,
    \"options\": {\"num_predict\": 128}
  }" | python3 -c "
import json, sys
d = json.load(sys.stdin)
print(f'pp={d[\"prompt_eval_count\"]/(d[\"prompt_eval_duration\"]/1e9):.0f} t/s  tg={d[\"eval_count\"]/(d[\"eval_duration\"]/1e9):.1f} t/s')
"
done

Benchmarks collected June 8, 2026 on Ollama 0.21.0. Results vary with VRAM availability — both models require ~18 GB free for full GPU inference.