MultimodalFlow
← Back to Blog

LocateAnything-3B Benchmark: RTX 3090 vs Jetson AGX Thor

locateanythingnvidiajetson thorrtx 3090benchmarkedge aivision language modelobject detection

NVIDIA's LocateAnything-3B is a 3-billion-parameter vision-language model that handles object detection, phrase grounding, text detection, and GUI pointing through natural language prompts. Unlike classical detection models, it takes a text description and returns bounding boxes — no class-specific training required.

I ran it on two pieces of hardware I had available: an RTX 3090 workstation and a Jetson AGX Thor edge device. Same model, same code, real numbers.


Hardware and Software

RTX 3090Jetson AGX Thor
GPUGeForce RTX 3090NVIDIA Thor (integrated)
VRAM / Memory24 GB GDDR6X128 GB unified memory
CUDA12.413.0
PyTorch2.6.0+cu1242.12.0+cu130
Transformers4.57.14.57.1
Model dtypebfloat16bfloat16
Model pathlocal NVMelocal NVMe

The model was loaded from local storage on both machines — no network latency in the inference numbers.


What LocateAnything-3B Does

The model takes an image and a natural language query, and returns structured bounding box coordinates. The supported task types are:

  • Object detection"Locate all instances matching: person</c>car</c>bicycle"
  • Phrase grounding"Locate a single instance: red handbag on the left side"
  • Text detection"Detect all text in box format"
  • Pointing"Point to: the submit button"
  • GUI grounding — useful for screen interaction and UI automation

Output format is structured token sequences like <ref>label</ref><box><x1><y1><x2><y2></box>, with coordinates normalized to 0–1000. The model has three generation modes: fast (multi-token prediction), slow (standard next-token), and hybrid (adaptive).


Benchmark Results

All tests used a 640×480 synthetic test image. Each task was run after a warm-up pass to flush JIT compilation overhead. The throughput test ran 10 iterations of the same single-category detection query.

Model Load Time

RTX 3090Jetson AGX Thor
Load time8.1 s20.4 s
GPU memory after load7,406 MB7,406 MB
GPU memory peak7,938 MB7,938 MB

Load time on Thor is 2.5× slower, primarily due to lower memory bandwidth for reading checkpoint shards from storage. Memory footprint is identical — Thor's unified memory architecture means there's no separate VRAM allocation; the 7.4 GB is carved from the shared 128 GB pool.

Task Inference Latency (single run, hybrid mode)

TaskRTX 3090Jetson AGX ThorRatio
Detection (2 categories)4,150 ms5,118 ms1.2×
Phrase grounding (single)298 ms642 ms2.1×
Text detection299 ms643 ms2.1×
Pointing300 ms626 ms2.1×

The detection task with two output categories is much slower than single-result tasks on both devices — the model generates significantly more tokens for multi-label outputs. Grounding and pointing queries that produce one short result converge quickly and show the clearest performance gap between devices.

Generation Mode Comparison (detection task)

ModeRTX 3090Jetson AGX Thor
Hybrid4,150 ms5,118 ms
Fast (MTP)2,020 ms2,694 ms
Slow (NTP)966 ms1,399 ms

Slow mode (standard next-token prediction) is the fastest for the detection task because the multi-token prediction in Fast mode incurs overhead that only pays off on longer sequences. For production deployments where you're running fixed query types, testing all three modes is worthwhile.

Throughput (10 runs, single-category detection, hybrid mode)

RTX 3090Jetson AGX Thor
Average latency301 ms674 ms
Throughput3.32 fps1.48 fps
Consistency (std dev)~3 ms~120 ms

The 3090 is remarkably consistent — all 10 runs landed within 297–307 ms. Thor shows more variance (596–1004 ms), likely due to thermal management and unified memory bandwidth sharing with other system processes.


Key Findings

1. Full bf16 runs on Thor with no compression

The model loads and runs at full bfloat16 precision on Thor without quantization, pruning, or any model surgery. Memory headroom is substantial — only 6% of Thor's 128 GB unified memory is used. This leaves room to run perception, LLM inference, and other tasks simultaneously.

2. The 3090 is ~2.2× faster on throughput

For batch inspection workflows that don't require real-time speed, Thor's 1.48 fps is usable. At 674 ms per query, you can process ~5,300 images per hour — which is practical for quality control, dataset labeling, or document analysis pipelines running unattended.

3. Generation mode matters more than device

On the 3090, switching from hybrid to slow mode cuts detection latency from 4,150 ms to 966 ms — a 4.3× improvement from mode selection alone, larger than the device gap. On Thor, the same switch gives 5,118 ms → 1,399 ms (3.7×). Task-specific mode tuning should be done before hardware optimization.

4. Thor is a viable deployment target for non-real-time vision grounding

At 1.48 fps, Thor handles inspection and grounding workloads where frames arrive at 1–2 fps or where batched processing is acceptable. It is not suitable for real-time video grounding without model quantization (INT8 or INT4).

5. Output correctness is identical

Both devices produced bit-identical bounding box coordinates across all task types. The model's detection quality doesn't degrade on Thor's ARM+CUDA stack.


Warm-up Behavior

First-inference latency was notably different from steady-state:

First inference (warm-up)Steady-state
RTX 30906,760 ms300 ms
Jetson AGX Thor4,696 ms640 ms

Interestingly, Thor's warm-up was faster (4.7s vs 6.8s on 3090) despite being slower at steady state. The warm-up cost on the 3090 includes CUDA kernel compilation that is more extensive on the desktop CUDA stack. In production, keep the model loaded and warm between inference calls.


Practical Deployment Guide

ScenarioRecommended hardwareNotes
Real-time video (≥10 fps)RTX 3090 or betterThor needs INT8 quantization
Batch inspection (1–3 fps)EitherThor is cost-effective
Edge deployment, no cloudJetson AGX ThorFull precision, no compression
Camera-side groundingJetson AGX ThorLocal inference, data stays on device
Interactive demoRTX 3090Sub-300ms responses
Multi-task pipelineThor (128 GB shared)Run grounding + LLM together

Environment Notes

LocateAnything-3B requires Python 3.10+, torch, transformers ≥4.51 (for Qwen3 support), peft, and lmdb. On Jetson (aarch64), decord has no prebuilt wheel — a stub package satisfies the import check since the model only uses decord for video loading, which grounding tasks don't require.

The magi_attention kernel (NVIDIA's custom attention implementation) was not available in the tested environment on either device — both fell back to PyTorch's SDPA. Enabling magi_attention would likely reduce latency further on both machines.


Benchmark Script

The full benchmark script used for this test is available. It measures load time, per-task latency across all three generation modes, and 10-run throughput — and saves results as JSON.

# Key parameters used
model = AutoModel.from_pretrained(
    MODEL_PATH, torch_dtype=torch.bfloat16, trust_remote_code=True
).cuda().eval()

# Tasks tested
tasks = ["detection", "grounding-single", "text-detection", "pointing"]
modes = ["hybrid", "fast", "slow"]
throughput_runs = 10

Summary

MetricRTX 3090Jetson AGX Thor
Model load8.1s20.4s
Memory used7.4 GB7.4 GB / 128 GB
Single query (hybrid)~300 ms~640 ms
Throughput3.32 fps1.48 fps
Best detection modeSlow: 966msSlow: 1,399ms
Real-time capable?BorderlineNo (batch only)
Quantization needed?NoFor real-time
Power envelope~350W peak~60W

LocateAnything-3B runs cleanly on both a desktop workstation GPU and an edge AI device with no modifications. The 3090 delivers ~2.2× the throughput, but Thor's power efficiency (~60W vs ~350W), unified memory architecture, and edge deployment advantages make it a compelling platform for inspection and grounding workloads that don't require real-time response rates.