GRASP: Grounded Reasoning and Symbolic Planning

Figure 2. The GRASP main loop. GroundingDINO detects objects from shelf and desk frames; the Goal State Similarity Module computes IoU and center distance; Roll-Pitch-Yaw adjustments close the loop until the similarity threshold is met.

Abstract

For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although vision-language models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally heavyweight or require extensive training on thousands of demonstrations. In this paper, we present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, which are then grounded in the physical world via a specialized bounding-box detection pipeline. Unlike previous methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts—such as top shelf—and execute tasks without additional fine-tuning. Experimental results demonstrate that our system achieves high instruction compliance and precision, offering a scalable solution for general-purpose robotic sorting and arrangement.

Neuro-Symbolic Goal States

Natural language instructions compiled into explicit symbolic goal states, enabling interpretable and verifiable task execution.

Closed-Loop FSM Control

Closed-loop symbolic goal evaluation replaces policy fine-tuning for small-scale open-world rearrangement with zero-shot execution.

Zero-Shot Grounding

A lightweight grounding and proportional control pipeline links open-vocabulary detection to continuous motion without learned action policies.

System Pipeline

At each timestep, GRASP closes the loop between language, perception, and action.

User Input "Place blue objects on top shelf"

LLM JSON goal state

G.DINO Bounding boxes

Goal Similarity IoU + center distance

FSM + RPY Closed-loop actuation

Closed loop — re-evaluates every frame until goal is reached

GRASP teaser: user input, task execution, and goal state

Given a natural language instruction, GRASP grounds it into a symbolic goal state, executes the manipulation task, and evaluates completion via bounding box similarity.

Method

Finite State Machine

Click any state to see its transition predicates.

LOOKING

close_t

GRASPED

det(claw)

PLACING

in_goal ∧ release

PLACED

Click a state above to see details

LLM Prompt → Goal State

A structured prompt instructs the LLM to return a JSON goal state; GroundingDINO candidates are used to ground each label to a bounding box in the scene.

LLM prompt structure and resulting goal state JSON

Figure 1. User input and LLM prompt (left) produce a structured JSON goal state with per-label bounding box coordinates, which are then visualized on the shelf (right).

Interactive Similarity Score

Drag or resize the orange detection box to explore how IoU and center distance combine into the similarity score: S = max(0, min(1, (IoU + (1 − dist)) / 2))

IoU —

Center Dist —

Similarity Score —

Goal State

metal things

Shelf State

metal things

"place the metal things to the right side of the 2 bottom shelves" Score: 0.266

Goal State

basketball

Shelf State

basketball

"put the basketball on top of the red cube that's on the top shelf" Score: 0.783

Results

Closed-Loop Grasping Benchmark

90 total trials: 3 difficulty levels × 3 objects × 10 trials each.

Difficulty condition examples: Easy, Medium, Hard

Figure 6. Representative grasping scenarios for each difficulty level. Easy: single object, no visual occlusion. Medium: some visual occlusion with distractors. Hard: heavy visual occlusion and multiple objects per frame.

Easy Single object, minimal occlusion

Brown Block

7/10

Bubble Wrap

10/10

Pink Bottle

9/10

Overall86.7%

Medium Distractors, partial occlusion

Brown Rect Prism

8/10

Yellow Tape

8/10

Letter 'T'

7/10

Overall76.7%

Hard Heavy clutter, high occlusion

Blk/Ylw Screwdriver

7/10

Magenta Marker

5/10

Lime Green Scissors

5/10

Overall56.7%

User Study — Goal State Quality (Likert 1–5)

31 participants aged 15–45. Average rating: 4.18 (σ = 1.14). Over 50% of responses in every category rated the visualization a 5.

Two-Group Cross Constraints

"Move all large bottles to the left side and all small cubes to the right side"

large bottles

small cubes

Rate 1–5

1

2

3

4

5

Triple-Attribute Filtering

"Place all small red metal objects on the top shelf"

small red metal

Rate 1–5

1

2

3

4

5

Three-Way Spatial Partition

"Place red objects on top, metal objects in the middle, and striped objects on the bottom"

red

metal

striped

Rate 1–5

1

2

3

4

5

Overlapping Constraints

"Move all smooth blue objects to the right side and all textured blue objects to the left side"

textured blue

smooth blue

Rate 1–5

1

2

3

4

5

Training Data Comparison

Per-task training data requirements across language-conditioned manipulation systems. Click a column header to sort.

Paper	Demos / Task	Training Type
GEM	10–20	Imitation
LEMMA	800	Imitation
RFST	1,000	Imitation
ERRA	300	Inference learning
RoboMamba	~500	Sim fine-tune
MOO	~11.8K	Teleop imitation
LoHoRavens	20K	Primitive IL
GRASP (Ours)	0	Zero-shot

Hardware Setup

Differential Claw

PiCam (end-effector)

Robot Arm

Custom-designed differential claw arm, 3 degrees of freedom, controlled via Python on Raspberry Pi 4B

Global Scene Camera

Logitech Brio 100 — 1080p @ 30fps, 58° diagonal FOV, fixed focus, USB-A, mounted at top of shelf looking down over workspace

End-Effector Camera

Raspberry Pi Camera Module v2 — 8MP Sony IMX219, 3280×2464 still / 1080p @ 30fps video, 62.2° H × 48.8° V FOV, fixed focus, CSI ribbon cable, mounted on end-effector

Compute

Raspberry Pi 4B (Linux) — onboard control & actuation
MacBook Air — G.DINO inference over WiFi
GPT-4o — cloud API

Network Architecture

Pi captures frames → transmits to MacBook over WiFi → MacBook runs G.DINO → returns annotated frames to Pi. "Network latency" in ablations refers to round-trip time.

Ablation Study

Three conditions evaluated against the full GRASP system.

No Smoothing & Deadband

Motion planning without exponential smoothing or deadband filtering. Isolates the contribution of noise-reduction to stable grasping.

Open-Loop vs. Closed-Loop

Compares alignment quality when the system executes a fixed plan without re-evaluating scene state versus full closed-loop feedback.

Random Logit Selection

Replaces highest-confidence G.DINO logit selection with random bounding box selection, isolating confidence-ranked targeting contribution.

Loop	Smoothing	Deadband	Selection	Success Rate	Net Latency (s)	Inference (s)	Total (s)
Open	✓	✓	Highest logit	4/10	4.421	5.495	12.416
Closed	—	—	Highest logit	5/10	4.097	4.984	7.072
Closed	✓	✓	Random	3/10	3.560	3.377	6.001
Closed	✓	✓	First	4/10	3.537	3.244	4.134
Closed	✓	✓	Highest logit	8/10	3.539	3.444	4.036

Highlighted row is the full GRASP system. All timing metrics averaged over 10 trials.

BibTeX

@misc{andreyev2026grasp,
  title     = {Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning},
  author    = {Allison Andreyev and Landon Eum and Nestor Tiglao and Romel Gomez},
  year      = {2026},
  eprint    = {2606.12910},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url       = {https://arxiv.org/abs/2606.12910},
}

🎯 GRASP

Abstract

Neuro-Symbolic Goal States

Closed-Loop FSM Control

Zero-Shot Grounding

System Pipeline

Method

Finite State Machine

LLM Prompt → Goal State

Interactive Similarity Score

Results

Closed-Loop Grasping Benchmark

User Study — Goal State Quality (Likert 1–5)

Training Data Comparison

Hardware Setup

Ablation Study

No Smoothing & Deadband

Open-Loop vs. Closed-Loop

Random Logit Selection

BibTeX