Robotics × LLM

Building a hierarchical VLM+RL manipulation system

Notes on building RoboLLM: a language-driven robotic manipulation system where a VLM decomposes instructions into sub-tasks and RL-trained policies execute each step in MuJoCo simulation.

On this page

Robotic manipulation is hard because it sits at the intersection of perception, planning, and precise motor control. End-to-end learning can handle simple tasks, but multi-step instructions like "stack the red block on the blue one, then move the green sphere to the left" require structured decomposition.

RoboLLM implements the hierarchical approach from SayCan, RT-2, and Code-as-Policies: a vision-language model handles high-level task understanding, an object grounder maps descriptions to poses, and RL-trained motor policies execute each primitive. Scoped to run on a single T4 GPU.

Why hierarchical

The core insight is separation of concerns. Language understanding and task decomposition are fundamentally different problems from precise motor control:

The practical benefit: you can swap the VLM (MockVLM for testing, PaliGemma-3B for production) without retraining motor policies, and vice versa. Clean interfaces between components.

System architecture

pipeline
flowchart TB Instruction["Natural language instruction"] --> VLM["VLM Planner
Decompose → sub-tasks"] VLM --> Parser["Task Parser
Validate + normalise"] Parser --> Grounder["Object Grounder
Text → object pose"] Grounder --> Executor["Hierarchical Executor
Per-subtask policies"] Executor --> MuJoCo["MuJoCo Environment
7-DoF arm + objects"] MuJoCo --> Executor style VLM fill:#eff6ff,stroke:#2563eb,color:#0f172a style Grounder fill:#eff6ff,stroke:#2563eb,color:#0f172a style Executor fill:#eff6ff,stroke:#2563eb,color:#0f172a

The pipeline has five stages: (1) the VLM decomposes a natural language instruction into a JSON sequence of primitives, (2) the task parser validates ordering and normalises color/shape references, (3) the object grounder maps text descriptions to MuJoCo object poses, (4) the hierarchical executor runs the appropriate policy per sub-task, and (5) MuJoCo simulates the physics.

MuJoCo environment design

The base environment models a Franka-inspired 7-DOF arm on a 60×60 cm tabletop. Key design choices:

VLM planner and grounding

The planner uses an abstract VLM interface with two backends: MockVLM (deterministic keyword matching for testing) and TransformersVLM (HuggingFace models for real inference). The MockVLM achieves 100% accuracy on 20+ test scenarios — sufficient for pipeline integration testing without GPU.

Object grounding maps text descriptions like "crimson cube" to actual object poses in the scene. The SimGrounder uses score-based matching: color match (+1.0), shape match (+0.5), name match (+0.2). It handles 15+ color synonyms (crimson→red, azure→blue, jade→green) and 9 shape aliases (block/cube/brick→box).

# Example pipeline execution
planner = Planner(MockVLM())
plan = planner.plan("pick up the red block and place it on the blue one")
# → TaskPlan: [move_to(red), pick(red), place(blue)]

grounder = SimGrounder()
result = grounder.ground("red block", scene_info)
# → GroundingResult(matched=obj_red_box, confidence=1.5)

RL policies and scripted baselines

Motor control uses SAC (Soft Actor-Critic) with automatic entropy tuning. The actor is a small MLP (obs→256→256→action) with squashed Gaussian output. Training runs at ~170 FPS on Apple Silicon (CPU-only).

Scripted baselines provide comparison points:

Benchmark results

100 episodes per task/policy pair, randomized initial placement, 200-step max:

TaskRandomScriptedNotes
L1 — Pick & Place0.0%0.0%Scripted gets 2.6× better returns
L2 — Color Pick0.0%Multi-object color selection
L3 — Stack0.0%Requires sequential precision
L4 — Sort0.0%N objects → N zones
L5 — Language4.0%Random meets some conditions by chance
Move To0.0%20.0%P-controller reaches targets

The honest take: multi-phase manipulation tasks (L1-L5) are hard. The scripted 8-phase state machine can't complete a full pick-and-place in 200 steps (0% SR), though it achieves 2.6× better returns than random. The MoveTo task is solvable with a simple P-controller (20% SR). Real success rates require trained RL policies with 500K+ environment steps or curriculum learning.

The pipeline infrastructure works end-to-end: instruction → VLM decomposition → object grounding → policy execution → MuJoCo feedback. The bottleneck is motor skill, not planning.

What I learned

The full codebase is at github.com/ajliouat/robollm — 288 tests, 10 releases from scaffold to stable, honest benchmark numbers from real evaluation runs.