MultimodalFlow
← Back to Blog

LLM Inference Benchmarks on Jetson AGX Orin

JetsonLLMedge inferencebenchmark

Test Environment

  • Device: NVIDIA Jetson AGX Orin 64GB
  • JetPack: 6.1
  • Inference backend: llama.cpp (CUDA)

Results

| Model | Quant | Token/s | TTFT | VRAM | |-------|-------|---------|------|------| | Llama 3.1 8B | Q4_K_M | 28 | 1.2s | 5.8 GB | | Qwen2.5 7B | Q4_K_M | 31 | 1.0s | 5.2 GB | | Phi-3 Mini | Q4_K_M | 47 | 0.7s | 2.8 GB |

Takeaways

Phi-3 Mini leads in speed and memory efficiency, making it a solid choice for conversational use cases. Qwen2.5 7B performs better in Chinese and is the preferred model for Chinese knowledge-base Q&A.

# Start inference server
./llama-server -m qwen2.5-7b-q4_k_m.gguf -ngl 999 --host 0.0.0.0 --port 8080

Comparisons with Jetson AGX Thor and DGX Spark will be added later.