Multimodal Robot Learning with Vision-Language Models
PyTorch
ROS2
SAM
LLMs
Isaac Gym
A comprehensive system integrating vision-language models, segmentation, and reinforcement learning for advanced robotic manipulation. The system uses natural language commands to control robots while understanding visual scenes and adapting to new environments.
Features
- Integration of GPT-4V for scene understanding
- Segment Anything Model (SAM) for real-time object segmentation
- Custom RL policy conditioned on language embeddings
- Real-time visual feedback loop
- Sim2real transfer with domain adaptation
- Custom CUDA kernels for real-time inference