LLM Inference Benchmarks on Jetson AGX Orin 64GB (2026)

All numbers in this post come from live hardware testing on a Jetson AGX Orin 64GB Developer Kit. The goal is a practical reference for anyone choosing models for edge AI deployment — not a synthetic benchmark, not a spec sheet.

Test Environment

Spec	Value
Device	NVIDIA Jetson AGX Orin Developer Kit
Unified Memory	64 GB
JetPack Version	6.1
CUDA Version	12.6
Compute Capability	8.7
Inference Backend	llama.cpp (CUDA)
GPU Offload	Full (`-ngl 999`)
Flash Attention	Enabled (`-fa 1`)

The Orin's 64 GB of unified memory is shared between CPU and GPU — there is no separate VRAM pool. All three models tested fit entirely in unified memory without any CPU offloading.

Models Tested

Three models were selected to cover different use cases and memory footprints:

Llama 3.1 8B Q4_K_M — general-purpose English model, widely deployed
Qwen2.5 7B Q4_K_M — strong Chinese/English bilingual model
Phi-3 Mini Q4_K_M — compact 3.8B model optimized for efficiency

All models use Q4_K_M quantization, which offers a good balance between output quality and memory footprint on edge hardware.

Benchmark Results

Model	Params	Quant	Size	Token/s	TTFT	Memory
Llama 3.1 8B	8B	Q4_K_M	4.9 GiB	28 t/s	1.2s	5.8 GB
Qwen2.5 7B	7B	Q4_K_M	4.7 GiB	31 t/s	1.0s	5.2 GB
Phi-3 Mini	3.8B	Q4_K_M	2.4 GiB	47 t/s	0.7s	2.8 GB

Analysis

Generation Speed

Human reading speed is approximately 3–5 tokens per second. All three models exceed this threshold significantly:

Phi-3 Mini at 47 t/s — roughly 10× faster than human reading speed. Responses appear nearly instantaneous for short outputs.
Qwen2.5 7B at 31 t/s — 6–10× faster than reading speed. Comfortable for interactive chat and document processing pipelines.
Llama 3.1 8B at 28 t/s — still well above the interactive threshold. Slightly slower than Qwen2.5 7B despite similar parameter counts, likely due to architectural differences and vocabulary size.

Time to First Token

TTFT measures how long before the first output token appears — important for perceived responsiveness in streaming applications:

0.7s (Phi-3 Mini): Nearly imperceptible in a chat interface
1.0s (Qwen2.5 7B): Acceptable for most use cases
1.2s (Llama 3.1 8B): Noticeable but tolerable for document-length prompts

All three are within acceptable ranges for production edge applications.

Memory Footprint

The Orin's 64 GB unified memory is comfortable even running multiple models or leaving headroom for other processes:

Model	Memory Used	Remaining
Phi-3 Mini	2.8 GB	61.2 GB
Qwen2.5 7B	5.2 GB	58.8 GB
Llama 3.1 8B	5.8 GB	58.2 GB

This means you could in principle load all three models simultaneously and still have 50+ GB available for other applications, camera pipelines, or larger KV caches.

Deployment Command

To start a persistent inference server with any of these models:

# Start llama.cpp server (OpenAI-compatible API on port 8080)
cd ~/llama.cpp
LD_LIBRARY_PATH=build/bin build/bin/llama-server \
  -m /models/qwen2.5-7b-q4_k_m.gguf \
  -ngl 999 \
  -fa 1 \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 4096

The server exposes an OpenAI-compatible /v1/chat/completions endpoint. Any client that works with OpenAI's API works without modification.

# Test the running server
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-7b",
    "messages": [{"role": "user", "content": "Hello, what can you do?"}],
    "max_tokens": 100
  }'

Which Model to Choose

Choose Phi-3 Mini if:

Speed is the top priority
You need maximum memory headroom for other applications
The task is English-only and relatively straightforward (Q&A, classification, short generation)

Choose Qwen2.5 7B if:

Your application needs Chinese language support
You're building a bilingual assistant or processing Chinese documents
You need slightly better reasoning than Phi-3 Mini for complex prompts

Choose Llama 3.1 8B if:

You need maximum compatibility with the broader open-source ecosystem
Your use case benefits from Llama's fine-tuned variants (instruction, code, etc.)
English-language reasoning quality is a priority

Orin vs Thor: Context

The Orin 64GB is one step below the Jetson AGX Thor in the Jetson lineup. The Thor has approximately 2× the memory (123 GB) and a newer GPU architecture, enabling larger models like 35B+ at FP8 precision.

For models in the 7–13B range, the Orin is a capable and cost-effective platform. The generation speeds above — 28 to 47 t/s — are sufficient for real-time applications, robot control pipelines, and private on-device assistants.

Benchmarks collected May 2026 on live hardware. Results may vary with different JetPack versions, model variants, or thermal conditions.