Multimodal Robot Learning with Vision-Language Models

PyTorch ROS2 SAM LLMs Isaac Gym

A comprehensive system integrating vision-language models, segmentation, and reinforcement learning for advanced robotic manipulation. The system uses natural language commands to control robots while understanding visual scenes and adapting to new environments.

Features

Integration of GPT-4V for scene understanding
Segment Anything Model (SAM) for real-time object segmentation
Custom RL policy conditioned on language embeddings
Real-time visual feedback loop
Sim2real transfer with domain adaptation
Custom CUDA kernels for real-time inference

Back View on GitHub