Current robotic systems use either brittle hand-coded planners or data-hungry end-to-end policies.
RoboLLM combines the strengths of both: a small VLM (PaliGemma-3B, 4-bit quantized) handles high-level
task understanding and decomposition, while SAC-trained motor policies execute precise low-level control.
This follows the SayCan / RT-2 / Code-as-Policies research direction, scoped to be reproducible on a single T4 GPU.
MuJoCo
PaliGemma-3B (4-bit)
DINOv2
SAC
PPO
PyTorch
Task complexity levels
| Level | Task | Objects | Success metric |
|---|---|---|---|
| L1 | Pick and place | 1 | Object at target ± 2 cm |
| L2 | Color-conditioned pick | 3 | Correct object at target |
| L3 | Block stacking | 2–3 | Stable stack, correct order |
| L4 | Color sorting into bins | 4–6 | All in correct bins |
| L5 | Complex language instruction | 3+ | All sub-tasks completed |
Technical approach
- Environment: 7-DoF Franka Panda arm on 60×60 cm tabletop in MuJoCo. 20 Hz control, randomized object spawning.
- VLM planner: Overhead camera image + instruction → JSON sub-task sequence. Prompt-engineered for primitive actions (pick, place, move_to, place_on).
- Object grounding: DINOv2-small features + nearest-neighbor matching to map descriptions to object poses.
- RL policies: SAC with automatic entropy tuning. One policy per primitive. 500K env steps per policy (~4 hrs on T4).
- Baselines: Scripted planner + PD controller (upper bound on planning), random policy (sanity check), end-to-end RL (no hierarchy).
Benchmarks
100 eval episodes per task, randomized initial placement. Max 200 steps per episode. 95% Wilson CI.
| Task | Random | Scripted | Mean Return (Scripted) |
|---|---|---|---|
| L1 — Pick & Place | 0.0% ± 1.8% | 0.0% ± 1.8% | -53.2 (2.6× random) |
| L2 — Color Pick | 0.0% ± 1.8% | — | — |
| L3 — Stack | 0.0% ± 1.8% | — | — |
| L4 — Sort | 0.0% ± 1.8% | — | — |
| L5 — Language | 4.0% ± 4.1% | — | — |
| Move To | 0.0% ± 1.8% | 20.0% ± 7.8% | -260.3 (2.8× random) |
| Pipeline Component | Accuracy | Coverage |
|---|---|---|
| VLM decomposition (MockVLM) | 100% (20+ scenarios) | 6 primitive types |
| Object grounding (SimGrounder) | 100% (20+ queries) | 15+ color aliases, 9 shape aliases |
| Hierarchical executor | End-to-end tested | L1–L5 tasks |
Scripted MoveTo achieves 20% SR with P-controller. Multi-phase tasks (L1-L5) require trained RL policies or curriculum learning for higher success rates. 288 tests pass across the full codebase.
System diagram
flowchart TB
User["User instruction
'Stack red on blue'"] --> VLM["VLM Planner
PaliGemma-3B (4-bit)"] Camera["Overhead camera
128×128 RGB"] --> VLM VLM --> Tasks["Sub-task sequence
[pick(red), place_on(blue)]"] Tasks --> Policy["RL Policy Head
SAC-trained per primitive"] Policy --> Env["MuJoCo Simulation
7-DoF Franka Panda"] Env --> Camera style VLM fill:#eff6ff,stroke:#2563eb,color:#0f172a style Policy fill:#eff6ff,stroke:#2563eb,color:#0f172a style Env fill:#eff6ff,stroke:#2563eb,color:#0f172a
'Stack red on blue'"] --> VLM["VLM Planner
PaliGemma-3B (4-bit)"] Camera["Overhead camera
128×128 RGB"] --> VLM VLM --> Tasks["Sub-task sequence
[pick(red), place_on(blue)]"] Tasks --> Policy["RL Policy Head
SAC-trained per primitive"] Policy --> Env["MuJoCo Simulation
7-DoF Franka Panda"] Env --> Camera style VLM fill:#eff6ff,stroke:#2563eb,color:#0f172a style Policy fill:#eff6ff,stroke:#2563eb,color:#0f172a style Env fill:#eff6ff,stroke:#2563eb,color:#0f172a