🎯 GRASP

Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning

Allison Andreyev, Landon Eum, Nestor Tiglao, Romel Gomez
University of Maryland, Cyber-Physical Systems Engineering
GRASP main loop architecture

Figure 2. The GRASP main loop. GroundingDINO detects objects from shelf and desk frames; the Goal State Similarity Module computes IoU and center distance; Roll-Pitch-Yaw adjustments close the loop until the similarity threshold is met.

Abstract

For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although vision-language models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally heavyweight or require extensive training on thousands of demonstrations. In this paper, we present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, which are then grounded in the physical world via a specialized bounding-box detection pipeline. Unlike previous methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts—such as top shelf—and execute tasks without additional fine-tuning. Experimental results demonstrate that our system achieves high instruction compliance and precision, offering a scalable solution for general-purpose robotic sorting and arrangement.

Neuro-Symbolic Goal States

Natural language instructions compiled into explicit symbolic goal states, enabling interpretable and verifiable task execution.

Closed-Loop FSM Control

Closed-loop symbolic goal evaluation replaces policy fine-tuning for small-scale open-world rearrangement with zero-shot execution.

Zero-Shot Grounding

A lightweight grounding and proportional control pipeline links open-vocabulary detection to continuous motion without learned action policies.

System Pipeline

At each timestep, GRASP closes the loop between language, perception, and action.

User Input "Place blue objects on top shelf"
LLM JSON goal state
G.DINO Bounding boxes
Goal Similarity IoU + center distance
FSM + RPY Closed-loop actuation
Closed loop — re-evaluates every frame until goal is reached
GRASP teaser: user input, task execution, and goal state

Given a natural language instruction, GRASP grounds it into a symbolic goal state, executes the manipulation task, and evaluates completion via bounding box similarity.

Method

Finite State Machine

Click any state to see its transition predicates.

LOOKING
closet
GRASPED
det(claw)
PLACING
in_goal ∧ release
PLACED

Click a state above to see details

LLM Prompt → Goal State

A structured prompt instructs the LLM to return a JSON goal state; GroundingDINO candidates are used to ground each label to a bounding box in the scene.

LLM prompt structure and resulting goal state JSON

Figure 1. User input and LLM prompt (left) produce a structured JSON goal state with per-label bounding box coordinates, which are then visualized on the shelf (right).

Interactive Similarity Score

Drag or resize the orange detection box to explore how IoU and center distance combine into the similarity score: S = max(0, min(1, (IoU + (1 − dist)) / 2))

IoU
Center Dist
Similarity Score
Goal State
metal things
metal things
Shelf State
metal things
metal things
"place the metal things to the right side of the 2 bottom shelves" Score: 0.266
Goal State
basketball
Shelf State
basketball
"put the basketball on top of the red cube that's on the top shelf" Score: 0.783

Results

Closed-Loop Grasping Benchmark

90 total trials: 3 difficulty levels × 3 objects × 10 trials each.

Difficulty condition examples: Easy, Medium, Hard

Figure 6. Representative grasping scenarios for each difficulty level. Easy: single object, no visual occlusion. Medium: some visual occlusion with distractors. Hard: heavy visual occlusion and multiple objects per frame.

Easy Single object, minimal occlusion
Brown Block
7/10
Bubble Wrap
10/10
Pink Bottle
9/10
Overall86.7%
Medium Distractors, partial occlusion
Brown Rect Prism
8/10
Yellow Tape
8/10
Letter 'T'
7/10
Overall76.7%
Hard Heavy clutter, high occlusion
Blk/Ylw Screwdriver
7/10
Magenta Marker
5/10
Lime Green Scissors
5/10
Overall56.7%

User Study — Goal State Quality (Likert 1–5)

31 participants aged 15–45. Average rating: 4.18 (σ = 1.14). Over 50% of responses in every category rated the visualization a 5.

Two-Group Cross Constraints
"Move all large bottles to the left side and all small cubes to the right side"
large bottles
small cubes
Rate 1–5
1
2
3
4
5
Triple-Attribute Filtering
"Place all small red metal objects on the top shelf"
small red metal
Rate 1–5
1
2
3
4
5
Three-Way Spatial Partition
"Place red objects on top, metal objects in the middle, and striped objects on the bottom"
red
metal
striped
Rate 1–5
1
2
3
4
5
Overlapping Constraints
"Move all smooth blue objects to the right side and all textured blue objects to the left side"
textured blue
smooth blue
Rate 1–5
1
2
3
4
5

Training Data Comparison

Per-task training data requirements across language-conditioned manipulation systems. Click a column header to sort.

Paper Demos / Task Training Type
GEM10–20Imitation
LEMMA800Imitation
RFST1,000Imitation
ERRA300Inference learning
RoboMamba~500Sim fine-tune
MOO~11.8KTeleop imitation
LoHoRavens20KPrimitive IL
GRASP (Ours)0Zero-shot

Hardware Setup

GRASP robot platform
Differential Claw
PiCam (end-effector)
Robot Arm

Custom-designed differential claw arm, 3 degrees of freedom, controlled via Python on Raspberry Pi 4B

Global Scene Camera

Logitech Brio 100 — 1080p @ 30fps, 58° diagonal FOV, fixed focus, USB-A, mounted at top of shelf looking down over workspace

End-Effector Camera

Raspberry Pi Camera Module v2 — 8MP Sony IMX219, 3280×2464 still / 1080p @ 30fps video, 62.2° H × 48.8° V FOV, fixed focus, CSI ribbon cable, mounted on end-effector

Compute

Raspberry Pi 4B (Linux) — onboard control & actuation
MacBook Air — G.DINO inference over WiFi
GPT-4o — cloud API

Network Architecture

Pi captures frames → transmits to MacBook over WiFi → MacBook runs G.DINO → returns annotated frames to Pi. "Network latency" in ablations refers to round-trip time.

Ablation Study

Three conditions evaluated against the full GRASP system.

No Smoothing & Deadband

Motion planning without exponential smoothing or deadband filtering. Isolates the contribution of noise-reduction to stable grasping.

Open-Loop vs. Closed-Loop

Compares alignment quality when the system executes a fixed plan without re-evaluating scene state versus full closed-loop feedback.

Random Logit Selection

Replaces highest-confidence G.DINO logit selection with random bounding box selection, isolating confidence-ranked targeting contribution.

Loop Smoothing Deadband Selection Success Rate Net Latency (s) Inference (s) Total (s)
OpenHighest logit 4/104.4215.49512.416
ClosedHighest logit 5/104.0974.9847.072
ClosedRandom 3/103.5603.3776.001
ClosedFirst 4/103.5373.2444.134
ClosedHighest logit 8/103.5393.4444.036

Highlighted row is the full GRASP system. All timing metrics averaged over 10 trials.

BibTeX

@misc{andreyev2026grasp,
  title     = {Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning},
  author    = {Allison Andreyev and Landon Eum and Nestor Tiglao and Romel Gomez},
  year      = {2026},
  eprint    = {2606.12910},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url       = {https://arxiv.org/abs/2606.12910},
}