LLM × Robotics × GPU

RoboLLM

A hierarchical system for language-driven robotic manipulation: a vision-language model decomposes natural language instructions into sub-tasks, and RL-trained policies execute each step in MuJoCo simulation.

MuJoCo · PaliGemma-3B · SAC · PyTorch
Current robotic systems use either brittle hand-coded planners or data-hungry end-to-end policies. RoboLLM combines the strengths of both: a small VLM (PaliGemma-3B, 4-bit quantized) handles high-level task understanding and decomposition, while SAC-trained motor policies execute precise low-level control. This follows the SayCan / RT-2 / Code-as-Policies research direction, scoped to be reproducible on a single T4 GPU.
MuJoCo PaliGemma-3B (4-bit) DINOv2 SAC PPO PyTorch

Task complexity levels

LevelTaskObjectsSuccess metric
L1Pick and place1Object at target ± 2 cm
L2Color-conditioned pick3Correct object at target
L3Block stacking2–3Stable stack, correct order
L4Color sorting into bins4–6All in correct bins
L5Complex language instruction3+All sub-tasks completed

Technical approach

  • Environment: 7-DoF Franka Panda arm on 60×60 cm tabletop in MuJoCo. 20 Hz control, randomized object spawning.
  • VLM planner: Overhead camera image + instruction → JSON sub-task sequence. Prompt-engineered for primitive actions (pick, place, move_to, place_on).
  • Object grounding: DINOv2-small features + nearest-neighbor matching to map descriptions to object poses.
  • RL policies: SAC with automatic entropy tuning. One policy per primitive. 500K env steps per policy (~4 hrs on T4).
  • Baselines: Scripted planner + PD controller (upper bound on planning), random policy (sanity check), end-to-end RL (no hierarchy).

Benchmarks

100 eval episodes per task, randomized initial placement. Max 200 steps per episode.

TaskScriptedEnd-to-End RLRoboLLM (ours)
L1 — Pick & Place
L2 — Color Pick
L3 — Stack
L4 — Sort
L5 — LanguageN/A

Results will be populated from real evaluation runs. Demo videos committed to repository.

System diagram

flowchart TB User["User instruction
'Stack red on blue'"] --> VLM["VLM Planner
PaliGemma-3B (4-bit)"] Camera["Overhead camera
128×128 RGB"] --> VLM VLM --> Tasks["Sub-task sequence
[pick(red), place_on(blue)]"] Tasks --> Policy["RL Policy Head
SAC-trained per primitive"] Policy --> Env["MuJoCo Simulation
7-DoF Franka Panda"] Env --> Camera style VLM fill:#eff6ff,stroke:#2563eb,color:#0f172a style Policy fill:#eff6ff,stroke:#2563eb,color:#0f172a style Env fill:#eff6ff,stroke:#2563eb,color:#0f172a