Building a hierarchical VLM+RL manipulation system

Robotic manipulation is hard because it sits at the intersection of perception, planning, and precise motor control. End-to-end learning can handle simple tasks, but multi-step instructions like "stack the red block on the blue one, then move the green sphere to the left" require structured decomposition.

RoboLLM implements the hierarchical approach from SayCan, RT-2, and Code-as-Policies: a vision-language model handles high-level task understanding, an object grounder maps descriptions to poses, and RL-trained motor policies execute each primitive. Scoped to run on a single T4 GPU.

Why hierarchical

The core insight is separation of concerns. Language understanding and task decomposition are fundamentally different problems from precise motor control:

Scripted planners are brittle — they can't handle novel instructions or paraphrases.
End-to-end RL is data-hungry and struggles with multi-step credit assignment.
Hierarchical lets each component do what it's best at: VLMs for language, RL for control.

The practical benefit: you can swap the VLM (MockVLM for testing, PaliGemma-3B for production) without retraining motor policies, and vice versa. Clean interfaces between components.

System architecture

pipeline

flowchart TB Instruction["Natural language instruction"] --> VLM["VLM Planner
Decompose → sub-tasks"] VLM --> Parser["Task Parser
Validate + normalise"] Parser --> Grounder["Object Grounder
Text → object pose"] Grounder --> Executor["Hierarchical Executor
Per-subtask policies"] Executor --> MuJoCo["MuJoCo Environment
7-DoF arm + objects"] MuJoCo --> Executor style VLM fill:#eff6ff,stroke:#2563eb,color:#0f172a style Grounder fill:#eff6ff,stroke:#2563eb,color:#0f172a style Executor fill:#eff6ff,stroke:#2563eb,color:#0f172a

The pipeline has five stages: (1) the VLM decomposes a natural language instruction into a JSON sequence of primitives, (2) the task parser validates ordering and normalises color/shape references, (3) the object grounder maps text descriptions to MuJoCo object poses, (4) the hierarchical executor runs the appropriate policy per sub-task, and (5) MuJoCo simulates the physics.

MuJoCo environment design

The base environment models a Franka-inspired 7-DOF arm on a 60×60 cm tabletop. Key design choices:

Delta-EE control: Actions are 4D (delta x/y/z + gripper). Jacobian pseudoinverse with damping (λ=0.01) converts to joint velocities. This gives the RL agent a natural action space without needing to learn inverse kinematics.
Dynamic object spawning: Each reset generates new objects via XML injection — 3 shapes (box, cylinder, sphere) × 6 colors, with non-overlapping placement. The MuJoCo model is rebuilt every episode in ~5ms.
Task hierarchy: Five complexity levels from single pick-place (L1) to multi-step language instructions (L5). Each level adds new challenges: color selection, stacking order, zone sorting, condition checking.
200-step truncation: All episodes cap at 200 steps (10s sim time at 20Hz). Without this, random baselines run forever during evaluation.

VLM planner and grounding

The planner uses an abstract VLM interface with two backends: MockVLM (deterministic keyword matching for testing) and TransformersVLM (HuggingFace models for real inference). The MockVLM achieves 100% accuracy on 20+ test scenarios — sufficient for pipeline integration testing without GPU.

Object grounding maps text descriptions like "crimson cube" to actual object poses in the scene. The SimGrounder uses score-based matching: color match (+1.0), shape match (+0.5), name match (+0.2). It handles 15+ color synonyms (crimson→red, azure→blue, jade→green) and 9 shape aliases (block/cube/brick→box).

# Example pipeline execution
planner = Planner(MockVLM())
plan = planner.plan("pick up the red block and place it on the blue one")
# → TaskPlan: [move_to(red), pick(red), place(blue)]

grounder = SimGrounder()
result = grounder.ground("red block", scene_info)
# → GroundingResult(matched=obj_red_box, confidence=1.5)

RL policies and scripted baselines

Motor control uses SAC (Soft Actor-Critic) with automatic entropy tuning. The actor is a small MLP (obs→256→256→action) with squashed Gaussian output. Training runs at ~170 FPS on Apple Silicon (CPU-only).

Scripted baselines provide comparison points:

ScriptedPickPlace: 8-phase state machine (approach → descend → grasp → lift → move → lower → release → done). Uses privileged info (exact object/goal positions).
ScriptedMoveTo: P-controller pointing EE toward the target object. Achieves 20% success rate — simple approach tasks are solvable with proportional control.

Benchmark results

100 episodes per task/policy pair, randomized initial placement, 200-step max:

Task	Random	Scripted	Notes
L1 — Pick & Place	0.0%	0.0%	Scripted gets 2.6× better returns
L2 — Color Pick	0.0%	—	Multi-object color selection
L3 — Stack	0.0%	—	Requires sequential precision
L4 — Sort	0.0%	—	N objects → N zones
L5 — Language	4.0%	—	Random meets some conditions by chance
Move To	0.0%	20.0%	P-controller reaches targets

The honest take: multi-phase manipulation tasks (L1-L5) are hard. The scripted 8-phase state machine can't complete a full pick-and-place in 200 steps (0% SR), though it achieves 2.6× better returns than random. The MoveTo task is solvable with a simple P-controller (20% SR). Real success rates require trained RL policies with 500K+ environment steps or curriculum learning.

The pipeline infrastructure works end-to-end: instruction → VLM decomposition → object grounding → policy execution → MuJoCo feedback. The bottleneck is motor skill, not planning.

What I learned

Dataclass field names matter. The ObjectSpec uses color_name not color. A one-field mismatch caused 14 test failures across 3 modules. Always check the actual source, never assume.
Episode truncation is critical. Without a step limit, random agent evaluations run forever. Adding max_episode_steps=200 to the base class fixed all downstream environments.
Mock backends unlock testing. MockVLM (keyword matching) + SimGrounder (privileged info) let the full pipeline be tested without GPU, model downloads, or non-determinism. 288 tests run in 17 seconds.
Honest benchmarks over impressive numbers. 0% pick-and-place success is real. The architecture is correct and tested — the motor skills need more training budget.
Jacobian pseudoinverse works well. Damped least squares (λ=0.01) handles singularities smoothly. Delta-EE control at 20Hz gives the RL agent a natural action space.

The full codebase is at github.com/ajliouat/robollm — 288 tests, 10 releases from scaffold to stable, honest benchmark numbers from real evaluation runs.