LocateAnything-3B Benchmark: RTX 3090 vs Jetson AGX Thor
NVIDIA's LocateAnything-3B is a 3-billion-parameter vision-language model that handles object detection, phrase grounding, text detection, and GUI pointing through natural language prompts. Unlike classical detection models, it takes a text description and returns bounding boxes — no class-specific training required.
I ran it on two pieces of hardware I had available: an RTX 3090 workstation and a Jetson AGX Thor edge device. Same model, same code, real numbers.
Hardware and Software
| RTX 3090 | Jetson AGX Thor | |
|---|---|---|
| GPU | GeForce RTX 3090 | NVIDIA Thor (integrated) |
| VRAM / Memory | 24 GB GDDR6X | 128 GB unified memory |
| CUDA | 12.4 | 13.0 |
| PyTorch | 2.6.0+cu124 | 2.12.0+cu130 |
| Transformers | 4.57.1 | 4.57.1 |
| Model dtype | bfloat16 | bfloat16 |
| Model path | local NVMe | local NVMe |
The model was loaded from local storage on both machines — no network latency in the inference numbers.
What LocateAnything-3B Does
The model takes an image and a natural language query, and returns structured bounding box coordinates. The supported task types are:
- Object detection —
"Locate all instances matching: person</c>car</c>bicycle" - Phrase grounding —
"Locate a single instance: red handbag on the left side" - Text detection —
"Detect all text in box format" - Pointing —
"Point to: the submit button" - GUI grounding — useful for screen interaction and UI automation
Output format is structured token sequences like <ref>label</ref><box><x1><y1><x2><y2></box>, with coordinates normalized to 0–1000. The model has three generation modes: fast (multi-token prediction), slow (standard next-token), and hybrid (adaptive).
Benchmark Results
All tests used a 640×480 synthetic test image. Each task was run after a warm-up pass to flush JIT compilation overhead. The throughput test ran 10 iterations of the same single-category detection query.
Model Load Time
| RTX 3090 | Jetson AGX Thor | |
|---|---|---|
| Load time | 8.1 s | 20.4 s |
| GPU memory after load | 7,406 MB | 7,406 MB |
| GPU memory peak | 7,938 MB | 7,938 MB |
Load time on Thor is 2.5× slower, primarily due to lower memory bandwidth for reading checkpoint shards from storage. Memory footprint is identical — Thor's unified memory architecture means there's no separate VRAM allocation; the 7.4 GB is carved from the shared 128 GB pool.
Task Inference Latency (single run, hybrid mode)
| Task | RTX 3090 | Jetson AGX Thor | Ratio |
|---|---|---|---|
| Detection (2 categories) | 4,150 ms | 5,118 ms | 1.2× |
| Phrase grounding (single) | 298 ms | 642 ms | 2.1× |
| Text detection | 299 ms | 643 ms | 2.1× |
| Pointing | 300 ms | 626 ms | 2.1× |
The detection task with two output categories is much slower than single-result tasks on both devices — the model generates significantly more tokens for multi-label outputs. Grounding and pointing queries that produce one short result converge quickly and show the clearest performance gap between devices.
Generation Mode Comparison (detection task)
| Mode | RTX 3090 | Jetson AGX Thor |
|---|---|---|
| Hybrid | 4,150 ms | 5,118 ms |
| Fast (MTP) | 2,020 ms | 2,694 ms |
| Slow (NTP) | 966 ms | 1,399 ms |
Slow mode (standard next-token prediction) is the fastest for the detection task because the multi-token prediction in Fast mode incurs overhead that only pays off on longer sequences. For production deployments where you're running fixed query types, testing all three modes is worthwhile.
Throughput (10 runs, single-category detection, hybrid mode)
| RTX 3090 | Jetson AGX Thor | |
|---|---|---|
| Average latency | 301 ms | 674 ms |
| Throughput | 3.32 fps | 1.48 fps |
| Consistency (std dev) | ~3 ms | ~120 ms |
The 3090 is remarkably consistent — all 10 runs landed within 297–307 ms. Thor shows more variance (596–1004 ms), likely due to thermal management and unified memory bandwidth sharing with other system processes.
Key Findings
1. Full bf16 runs on Thor with no compression
The model loads and runs at full bfloat16 precision on Thor without quantization, pruning, or any model surgery. Memory headroom is substantial — only 6% of Thor's 128 GB unified memory is used. This leaves room to run perception, LLM inference, and other tasks simultaneously.
2. The 3090 is ~2.2× faster on throughput
For batch inspection workflows that don't require real-time speed, Thor's 1.48 fps is usable. At 674 ms per query, you can process ~5,300 images per hour — which is practical for quality control, dataset labeling, or document analysis pipelines running unattended.
3. Generation mode matters more than device
On the 3090, switching from hybrid to slow mode cuts detection latency from 4,150 ms to 966 ms — a 4.3× improvement from mode selection alone, larger than the device gap. On Thor, the same switch gives 5,118 ms → 1,399 ms (3.7×). Task-specific mode tuning should be done before hardware optimization.
4. Thor is a viable deployment target for non-real-time vision grounding
At 1.48 fps, Thor handles inspection and grounding workloads where frames arrive at 1–2 fps or where batched processing is acceptable. It is not suitable for real-time video grounding without model quantization (INT8 or INT4).
5. Output correctness is identical
Both devices produced bit-identical bounding box coordinates across all task types. The model's detection quality doesn't degrade on Thor's ARM+CUDA stack.
Warm-up Behavior
First-inference latency was notably different from steady-state:
| First inference (warm-up) | Steady-state | |
|---|---|---|
| RTX 3090 | 6,760 ms | 300 ms |
| Jetson AGX Thor | 4,696 ms | 640 ms |
Interestingly, Thor's warm-up was faster (4.7s vs 6.8s on 3090) despite being slower at steady state. The warm-up cost on the 3090 includes CUDA kernel compilation that is more extensive on the desktop CUDA stack. In production, keep the model loaded and warm between inference calls.
Practical Deployment Guide
| Scenario | Recommended hardware | Notes |
|---|---|---|
| Real-time video (≥10 fps) | RTX 3090 or better | Thor needs INT8 quantization |
| Batch inspection (1–3 fps) | Either | Thor is cost-effective |
| Edge deployment, no cloud | Jetson AGX Thor | Full precision, no compression |
| Camera-side grounding | Jetson AGX Thor | Local inference, data stays on device |
| Interactive demo | RTX 3090 | Sub-300ms responses |
| Multi-task pipeline | Thor (128 GB shared) | Run grounding + LLM together |
Environment Notes
LocateAnything-3B requires Python 3.10+, torch, transformers ≥4.51 (for Qwen3 support), peft, and lmdb. On Jetson (aarch64), decord has no prebuilt wheel — a stub package satisfies the import check since the model only uses decord for video loading, which grounding tasks don't require.
The magi_attention kernel (NVIDIA's custom attention implementation) was not available in the tested environment on either device — both fell back to PyTorch's SDPA. Enabling magi_attention would likely reduce latency further on both machines.
Benchmark Script
The full benchmark script used for this test is available. It measures load time, per-task latency across all three generation modes, and 10-run throughput — and saves results as JSON.
# Key parameters used
model = AutoModel.from_pretrained(
MODEL_PATH, torch_dtype=torch.bfloat16, trust_remote_code=True
).cuda().eval()
# Tasks tested
tasks = ["detection", "grounding-single", "text-detection", "pointing"]
modes = ["hybrid", "fast", "slow"]
throughput_runs = 10
Summary
| Metric | RTX 3090 | Jetson AGX Thor |
|---|---|---|
| Model load | 8.1s | 20.4s |
| Memory used | 7.4 GB | 7.4 GB / 128 GB |
| Single query (hybrid) | ~300 ms | ~640 ms |
| Throughput | 3.32 fps | 1.48 fps |
| Best detection mode | Slow: 966ms | Slow: 1,399ms |
| Real-time capable? | Borderline | No (batch only) |
| Quantization needed? | No | For real-time |
| Power envelope | ~350W peak | ~60W |
LocateAnything-3B runs cleanly on both a desktop workstation GPU and an edge AI device with no modifications. The 3090 delivers ~2.2× the throughput, but Thor's power efficiency (~60W vs ~350W), unified memory architecture, and edge deployment advantages make it a compelling platform for inspection and grounding workloads that don't require real-time response rates.