LocateAnything-3B Benchmark: RTX 3090 vs Jetson AGX Thor

NVIDIA's LocateAnything-3B is a 3-billion-parameter vision-language model that handles object detection, phrase grounding, text detection, and GUI pointing through natural language prompts. Unlike classical detection models, it takes a text description and returns bounding boxes — no class-specific training required.

I ran it on two pieces of hardware I had available: an RTX 3090 workstation and a Jetson AGX Thor edge device. Same model, same code, real numbers.

Hardware and Software

	RTX 3090	Jetson AGX Thor
GPU	GeForce RTX 3090	NVIDIA Thor (integrated)
VRAM / Memory	24 GB GDDR6X	128 GB unified memory
CUDA	12.4	13.0
PyTorch	2.6.0+cu124	2.12.0+cu130
Transformers	4.57.1	4.57.1
Model dtype	bfloat16	bfloat16
Model path	local NVMe	local NVMe

The model was loaded from local storage on both machines — no network latency in the inference numbers.

What LocateAnything-3B Does

The model takes an image and a natural language query, and returns structured bounding box coordinates. The supported task types are:

Object detection — "Locate all instances matching: person</c>car</c>bicycle"
Phrase grounding — "Locate a single instance: red handbag on the left side"
Text detection — "Detect all text in box format"
Pointing — "Point to: the submit button"
GUI grounding — useful for screen interaction and UI automation

Output format is structured token sequences like <ref>label</ref><box><x1><y1><x2><y2></box>, with coordinates normalized to 0–1000. The model has three generation modes: fast (multi-token prediction), slow (standard next-token), and hybrid (adaptive).

Benchmark Results

All tests used a 640×480 synthetic test image. Each task was run after a warm-up pass to flush JIT compilation overhead. The throughput test ran 10 iterations of the same single-category detection query.

Model Load Time

	RTX 3090	Jetson AGX Thor
Load time	8.1 s	20.4 s
GPU memory after load	7,406 MB	7,406 MB
GPU memory peak	7,938 MB	7,938 MB

Load time on Thor is 2.5× slower, primarily due to lower memory bandwidth for reading checkpoint shards from storage. Memory footprint is identical — Thor's unified memory architecture means there's no separate VRAM allocation; the 7.4 GB is carved from the shared 128 GB pool.

Task Inference Latency (single run, hybrid mode)

Task	RTX 3090	Jetson AGX Thor	Ratio
Detection (2 categories)	4,150 ms	5,118 ms	1.2×
Phrase grounding (single)	298 ms	642 ms	2.1×
Text detection	299 ms	643 ms	2.1×
Pointing	300 ms	626 ms	2.1×

The detection task with two output categories is much slower than single-result tasks on both devices — the model generates significantly more tokens for multi-label outputs. Grounding and pointing queries that produce one short result converge quickly and show the clearest performance gap between devices.

Generation Mode Comparison (detection task)

Mode	RTX 3090	Jetson AGX Thor
Hybrid	4,150 ms	5,118 ms
Fast (MTP)	2,020 ms	2,694 ms
Slow (NTP)	966 ms	1,399 ms

Slow mode (standard next-token prediction) is the fastest for the detection task because the multi-token prediction in Fast mode incurs overhead that only pays off on longer sequences. For production deployments where you're running fixed query types, testing all three modes is worthwhile.

Throughput (10 runs, single-category detection, hybrid mode)

	RTX 3090	Jetson AGX Thor
Average latency	301 ms	674 ms
Throughput	3.32 fps	1.48 fps
Consistency (std dev)	~3 ms	~120 ms

The 3090 is remarkably consistent — all 10 runs landed within 297–307 ms. Thor shows more variance (596–1004 ms), likely due to thermal management and unified memory bandwidth sharing with other system processes.

Key Findings

1. Full bf16 runs on Thor with no compression

The model loads and runs at full bfloat16 precision on Thor without quantization, pruning, or any model surgery. Memory headroom is substantial — only 6% of Thor's 128 GB unified memory is used. This leaves room to run perception, LLM inference, and other tasks simultaneously.

2. The 3090 is ~2.2× faster on throughput

For batch inspection workflows that don't require real-time speed, Thor's 1.48 fps is usable. At 674 ms per query, you can process ~5,300 images per hour — which is practical for quality control, dataset labeling, or document analysis pipelines running unattended.

3. Generation mode matters more than device

On the 3090, switching from hybrid to slow mode cuts detection latency from 4,150 ms to 966 ms — a 4.3× improvement from mode selection alone, larger than the device gap. On Thor, the same switch gives 5,118 ms → 1,399 ms (3.7×). Task-specific mode tuning should be done before hardware optimization.

4. Thor is a viable deployment target for non-real-time vision grounding

At 1.48 fps, Thor handles inspection and grounding workloads where frames arrive at 1–2 fps or where batched processing is acceptable. It is not suitable for real-time video grounding without model quantization (INT8 or INT4).

5. Output correctness is identical

Both devices produced bit-identical bounding box coordinates across all task types. The model's detection quality doesn't degrade on Thor's ARM+CUDA stack.

Warm-up Behavior

First-inference latency was notably different from steady-state:

	First inference (warm-up)	Steady-state
RTX 3090	6,760 ms	300 ms
Jetson AGX Thor	4,696 ms	640 ms

Interestingly, Thor's warm-up was faster (4.7s vs 6.8s on 3090) despite being slower at steady state. The warm-up cost on the 3090 includes CUDA kernel compilation that is more extensive on the desktop CUDA stack. In production, keep the model loaded and warm between inference calls.

Practical Deployment Guide

Scenario	Recommended hardware	Notes
Real-time video (≥10 fps)	RTX 3090 or better	Thor needs INT8 quantization
Batch inspection (1–3 fps)	Either	Thor is cost-effective
Edge deployment, no cloud	Jetson AGX Thor	Full precision, no compression
Camera-side grounding	Jetson AGX Thor	Local inference, data stays on device
Interactive demo	RTX 3090	Sub-300ms responses
Multi-task pipeline	Thor (128 GB shared)	Run grounding + LLM together

Environment Notes

LocateAnything-3B requires Python 3.10+, torch, transformers ≥4.51 (for Qwen3 support), peft, and lmdb. On Jetson (aarch64), decord has no prebuilt wheel — a stub package satisfies the import check since the model only uses decord for video loading, which grounding tasks don't require.

The magi_attention kernel (NVIDIA's custom attention implementation) was not available in the tested environment on either device — both fell back to PyTorch's SDPA. Enabling magi_attention would likely reduce latency further on both machines.

Benchmark Script

The full benchmark script used for this test is available. It measures load time, per-task latency across all three generation modes, and 10-run throughput — and saves results as JSON.

# Key parameters used
model = AutoModel.from_pretrained(
    MODEL_PATH, torch_dtype=torch.bfloat16, trust_remote_code=True
).cuda().eval()

# Tasks tested
tasks = ["detection", "grounding-single", "text-detection", "pointing"]
modes = ["hybrid", "fast", "slow"]
throughput_runs = 10

Summary

Metric	RTX 3090	Jetson AGX Thor
Model load	8.1s	20.4s
Memory used	7.4 GB	7.4 GB / 128 GB
Single query (hybrid)	~300 ms	~640 ms
Throughput	3.32 fps	1.48 fps
Best detection mode	Slow: 966ms	Slow: 1,399ms
Real-time capable?	Borderline	No (batch only)
Quantization needed?	No	For real-time
Power envelope	~350W peak	~60W

LocateAnything-3B runs cleanly on both a desktop workstation GPU and an edge AI device with no modifications. The 3090 delivers ~2.2× the throughput, but Thor's power efficiency (~60W vs ~350W), unified memory architecture, and edge deployment advantages make it a compelling platform for inspection and grounding workloads that don't require real-time response rates.