LLM × Robotics × GPU

RoboLLM

A hierarchical system for language-driven robotic manipulation: a vision-language model decomposes natural language instructions into sub-tasks, and RL-trained policies execute each step in MuJoCo simulation.

MuJoCo · PaliGemma-3B · SAC · PyTorch
Current robotic systems use either brittle hand-coded planners or data-hungry end-to-end policies. RoboLLM combines the strengths of both: a small VLM (PaliGemma-3B, 4-bit quantized) handles high-level task understanding and decomposition, while SAC-trained motor policies execute precise low-level control. This follows the SayCan / RT-2 / Code-as-Policies research direction, scoped to be reproducible on a single T4 GPU.
MuJoCo PaliGemma-3B (4-bit) DINOv2 SAC PPO PyTorch

Task complexity levels

LevelTaskObjectsSuccess metric
L1Pick and place1Object at target ± 2 cm
L2Color-conditioned pick3Correct object at target
L3Block stacking2–3Stable stack, correct order
L4Color sorting into bins4–6All in correct bins
L5Complex language instruction3+All sub-tasks completed

Technical approach

  • Environment: 7-DoF Franka Panda arm on 60×60 cm tabletop in MuJoCo. 20 Hz control, randomized object spawning.
  • VLM planner: Overhead camera image + instruction → JSON sub-task sequence. Prompt-engineered for primitive actions (pick, place, move_to, place_on).
  • Object grounding: DINOv2-small features + nearest-neighbor matching to map descriptions to object poses.
  • RL policies: SAC with automatic entropy tuning. One policy per primitive. 500K env steps per policy (~4 hrs on T4).
  • Baselines: Scripted planner + PD controller (upper bound on planning), random policy (sanity check), end-to-end RL (no hierarchy).

Benchmarks

100 eval episodes per task, randomized initial placement. Max 200 steps per episode. 95% Wilson CI.

TaskRandomScriptedMean Return (Scripted)
L1 — Pick & Place0.0% ± 1.8%0.0% ± 1.8%-53.2 (2.6× random)
L2 — Color Pick0.0% ± 1.8%
L3 — Stack0.0% ± 1.8%
L4 — Sort0.0% ± 1.8%
L5 — Language4.0% ± 4.1%
Move To0.0% ± 1.8%20.0% ± 7.8%-260.3 (2.8× random)
Pipeline ComponentAccuracyCoverage
VLM decomposition (MockVLM)100% (20+ scenarios)6 primitive types
Object grounding (SimGrounder)100% (20+ queries)15+ color aliases, 9 shape aliases
Hierarchical executorEnd-to-end testedL1–L5 tasks

Scripted MoveTo achieves 20% SR with P-controller. Multi-phase tasks (L1-L5) require trained RL policies or curriculum learning for higher success rates. 288 tests pass across the full codebase.

System diagram

flowchart TB User["User instruction
'Stack red on blue'"] --> VLM["VLM Planner
PaliGemma-3B (4-bit)"] Camera["Overhead camera
128×128 RGB"] --> VLM VLM --> Tasks["Sub-task sequence
[pick(red), place_on(blue)]"] Tasks --> Policy["RL Policy Head
SAC-trained per primitive"] Policy --> Env["MuJoCo Simulation
7-DoF Franka Panda"] Env --> Camera style VLM fill:#eff6ff,stroke:#2563eb,color:#0f172a style Policy fill:#eff6ff,stroke:#2563eb,color:#0f172a style Env fill:#eff6ff,stroke:#2563eb,color:#0f172a