RoboLLM — Language-Grounded Robotic Manipulation

Current robotic systems use either brittle hand-coded planners or data-hungry end-to-end policies. RoboLLM combines the strengths of both: a small VLM (PaliGemma-3B, 4-bit quantized) handles high-level task understanding and decomposition, while SAC-trained motor policies execute precise low-level control. This follows the SayCan / RT-2 / Code-as-Policies research direction, scoped to be reproducible on a single T4 GPU.

MuJoCo PaliGemma-3B (4-bit) DINOv2 SAC PPO PyTorch

Task complexity levels

Level	Task	Objects	Success metric
L1	Pick and place	1	Object at target ± 2 cm
L2	Color-conditioned pick	3	Correct object at target
L3	Block stacking	2–3	Stable stack, correct order
L4	Color sorting into bins	4–6	All in correct bins
L5	Complex language instruction	3+	All sub-tasks completed

Technical approach

Environment: 7-DoF Franka Panda arm on 60×60 cm tabletop in MuJoCo. 20 Hz control, randomized object spawning.
VLM planner: Overhead camera image + instruction → JSON sub-task sequence. Prompt-engineered for primitive actions (pick, place, move_to, place_on).
Object grounding: DINOv2-small features + nearest-neighbor matching to map descriptions to object poses.
RL policies: SAC with automatic entropy tuning. One policy per primitive. 500K env steps per policy (~4 hrs on T4).
Baselines: Scripted planner + PD controller (upper bound on planning), random policy (sanity check), end-to-end RL (no hierarchy).

Benchmarks

100 eval episodes per task, randomized initial placement. Max 200 steps per episode.

Task	Scripted	End-to-End RL	RoboLLM (ours)
L1 — Pick & Place	—	—	—
L2 — Color Pick	—	—	—
L3 — Stack	—	—	—
L4 — Sort	—	—	—
L5 — Language	N/A	—	—

Results will be populated from real evaluation runs. Demo videos committed to repository.

System diagram

flowchart TB User["User instruction
'Stack red on blue'"] --> VLM["VLM Planner
PaliGemma-3B (4-bit)"] Camera["Overhead camera
128×128 RGB"] --> VLM VLM --> Tasks["Sub-task sequence
[pick(red), place_on(blue)]"] Tasks --> Policy["RL Policy Head
SAC-trained per primitive"] Policy --> Env["MuJoCo Simulation
7-DoF Franka Panda"] Env --> Camera style VLM fill:#eff6ff,stroke:#2563eb,color:#0f172a style Policy fill:#eff6ff,stroke:#2563eb,color:#0f172a style Env fill:#eff6ff,stroke:#2563eb,color:#0f172a