MultimodalFlow
← Back to Blog

Qwen3 30B vs Gemma 4 26B on RTX 3090: Full VRAM Showdown

Qwen3Gemma4benchmarkRTX 3090LLMOllamalocal inference

Two 30B-class models, one RTX 3090, full VRAM. These are live Ollama 0.21.0 measurements — no shared memory, no CPU offloading.


Why This Comparison Matters

Both models are Q4_K_M quantized and sit in the same size bracket:

ModelFile SizeParametersQuantization
qwen3:30b18 GB30BQ4_K_M
gemma4:26b17 GB25.8BQ4_K_M

With a clean RTX 3090 (23.8 GB free), both load entirely into VRAM. This is the test that shows what these models actually do when given the hardware they need.

Context: In an earlier test with 9 GB occupied by background processes, Gemma 4 26B dropped to just 15–16 t/s due to VRAM overflow. Here, with full VRAM, it runs at 123 t/s — an 8× difference. VRAM availability matters enormously.


Test Environment

SpecValue
GPUNVIDIA GeForce RTX 3090
VRAM24 GB (23.8 GB free at test time)
Driver550.144.03 / CUDA 12.4
OSUbuntu 22.04.5 LTS
Inference EngineOllama 0.21.0

Benchmark Method

Three prompt types tested via Ollama REST API, each with a warm-up run followed by a measured run:

  • Short (~26 tokens): "What is machine learning? Answer in one paragraph."
  • Long (~230 tokens): Repeated AI history passage
  • Reasoning (~30 tokens): "A farmer has 17 sheep. All but 9 die. How many sheep are left? Think step by step."

Each run generates 128 output tokens. Metrics from response:

  • pp = prompt_eval_count / (prompt_eval_duration / 1e9)
  • tg = eval_count / (eval_duration / 1e9)

Results

Generation Speed (tg)

Prompt TypeQwen3 30BGemma 4 26BWinner
Short141.9 t/s123.4 t/sQwen3 +15%
Long140.9 t/s123.0 t/sQwen3 +15%
Reasoning141.5 t/s123.6 t/sQwen3 +15%

Prefill Speed (pp)

Prompt TypeQwen3 30BGemma 4 26BWinner
Short2,354 t/s2,798 t/sGemma4 +19%
Long10,728 t/s10,433 t/sQwen3 +3%
Reasoning4,090 t/s4,394 t/sGemma4 +7%

Analysis

Generation Speed: Qwen3 Wins Consistently

Qwen3 30B generates at 141 t/s across all prompt types — no variance. Gemma 4 26B sits at 123 t/s, also stable. The gap is exactly 15% and holds regardless of prompt length or type.

Qwen3  30B  ████████████████████████████████████  141 t/s
Gemma4 26B  ████████████████████████████████      123 t/s

At 141 t/s, Qwen3 30B produces output roughly 28–47× faster than human reading speed — well above what any interactive application needs. Even Gemma 4 at 123 t/s is far beyond interactive threshold.

The generation speed difference comes down to memory bandwidth utilization. Qwen3 30B at 18 GB loads slightly more data per forward pass than Gemma 4 26B at 17 GB, yet still runs faster — suggesting Qwen3's architecture makes more efficient use of the 3090's 936 GB/s bandwidth.

Prefill Speed: Gemma 4 Leads on Short Prompts

For short prompts, Gemma 4 prefills at 2,798 t/s vs 2,354 t/s for Qwen3 — a 19% advantage. This translates to roughly 1ms difference in a 26-token prompt, which is imperceptible in any real application.

For long prompts (230 tokens), the gap nearly disappears: Qwen3 edges ahead at 10,728 vs 10,433 t/s. Both process a 512-token prompt in under 50ms.

Consistency

Both models show near-zero variance across prompt types in generation speed — a sign that both fit fully in VRAM with no partial CPU offloading. The stability of these numbers (141.5 ± 0.5 and 123.5 ± 0.3) is what full-VRAM inference looks like.


Which to Choose for RTX 3090

Choose Qwen3 30B if:

  • Generation throughput is your bottleneck (batch processing, long outputs)
  • You're running an inference server and want to maximize requests/second
  • You prefer Alibaba's training data mix and instruction following style

Choose Gemma 4 26B if:

  • You have many short, prompt-heavy workloads where prefill matters more
  • You prefer Google's model family and its safety/instruction tuning
  • You're already using Gemma 4 in a pipeline and want size consistency

For most users, either model will feel identical in interactive chat. The 15% generation gap becomes meaningful at scale: running 1000 batch requests, Qwen3 saves ~2 minutes.


VRAM Is the Deciding Factor

Both models require roughly 17–18 GB to load. On an RTX 3090, this leaves ~6 GB of headroom — enough for a desktop environment, but not for another large model or a memory-hungry background process.

If your available VRAM drops below ~19 GB, expect partial CPU offloading and speeds similar to what we measured under memory pressure: 15–16 t/s instead of 123–141 t/s.

Check before loading:

nvidia-smi --query-gpu=memory.free --format=csv,noheader
# Need at least 18000 MiB for either model

Reproduce These Results

ollama pull qwen3:30b
ollama pull gemma4:26b

# Benchmark via API
for MODEL in qwen3:30b gemma4:26b; do
  echo "=== $MODEL ==="
  curl -s http://localhost:11434/api/generate -d "{
    \"model\": \"$MODEL\",
    \"prompt\": \"What is machine learning? Answer in one paragraph.\",
    \"stream\": false,
    \"options\": {\"num_predict\": 128}
  }" | python3 -c "
import json, sys
d = json.load(sys.stdin)
print(f'pp={d[\"prompt_eval_count\"]/(d[\"prompt_eval_duration\"]/1e9):.0f} t/s  tg={d[\"eval_count\"]/(d[\"eval_duration\"]/1e9):.1f} t/s')
"
done

Benchmarks collected June 8, 2026 on Ollama 0.21.0. Results vary with VRAM availability — both models require ~18 GB free for full GPU inference.