Current robotic systems use either brittle hand-coded planners or data-hungry end-to-end policies.
RoboLLM combines the strengths of both: a small VLM (PaliGemma-3B, 4-bit quantized) handles high-level
task understanding and decomposition, while SAC-trained motor policies execute precise low-level control.
This follows the SayCan / RT-2 / Code-as-Policies research direction, scoped to be reproducible on a single T4 GPU.
MuJoCo
PaliGemma-3B (4-bit)
DINOv2
SAC
PPO
PyTorch
Task complexity levels
| Level | Task | Objects | Success metric |
|---|---|---|---|
| L1 | Pick and place | 1 | Object at target ± 2 cm |
| L2 | Color-conditioned pick | 3 | Correct object at target |
| L3 | Block stacking | 2–3 | Stable stack, correct order |
| L4 | Color sorting into bins | 4–6 | All in correct bins |
| L5 | Complex language instruction | 3+ | All sub-tasks completed |
Technical approach
- Environment: 7-DoF Franka Panda arm on 60×60 cm tabletop in MuJoCo. 20 Hz control, randomized object spawning.
- VLM planner: Overhead camera image + instruction → JSON sub-task sequence. Prompt-engineered for primitive actions (pick, place, move_to, place_on).
- Object grounding: DINOv2-small features + nearest-neighbor matching to map descriptions to object poses.
- RL policies: SAC with automatic entropy tuning. One policy per primitive. 500K env steps per policy (~4 hrs on T4).
- Baselines: Scripted planner + PD controller (upper bound on planning), random policy (sanity check), end-to-end RL (no hierarchy).
Benchmarks
100 eval episodes per task, randomized initial placement. Max 200 steps per episode.
| Task | Scripted | End-to-End RL | RoboLLM (ours) |
|---|---|---|---|
| L1 — Pick & Place | — | — | — |
| L2 — Color Pick | — | — | — |
| L3 — Stack | — | — | — |
| L4 — Sort | — | — | — |
| L5 — Language | N/A | — | — |
Results will be populated from real evaluation runs. Demo videos committed to repository.
System diagram
flowchart TB
User["User instruction
'Stack red on blue'"] --> VLM["VLM Planner
PaliGemma-3B (4-bit)"] Camera["Overhead camera
128×128 RGB"] --> VLM VLM --> Tasks["Sub-task sequence
[pick(red), place_on(blue)]"] Tasks --> Policy["RL Policy Head
SAC-trained per primitive"] Policy --> Env["MuJoCo Simulation
7-DoF Franka Panda"] Env --> Camera style VLM fill:#eff6ff,stroke:#2563eb,color:#0f172a style Policy fill:#eff6ff,stroke:#2563eb,color:#0f172a style Env fill:#eff6ff,stroke:#2563eb,color:#0f172a
'Stack red on blue'"] --> VLM["VLM Planner
PaliGemma-3B (4-bit)"] Camera["Overhead camera
128×128 RGB"] --> VLM VLM --> Tasks["Sub-task sequence
[pick(red), place_on(blue)]"] Tasks --> Policy["RL Policy Head
SAC-trained per primitive"] Policy --> Env["MuJoCo Simulation
7-DoF Franka Panda"] Env --> Camera style VLM fill:#eff6ff,stroke:#2563eb,color:#0f172a style Policy fill:#eff6ff,stroke:#2563eb,color:#0f172a style Env fill:#eff6ff,stroke:#2563eb,color:#0f172a