GenRL: Multimodal-foundation world models
for generalization in embodied agents
Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Aaron Courville, Sai Rajeswar
NeurIPS 2024
Multimodal foundation world models allow grounding language and video prompts into embodied domains, by turning them into sequences of latent world model states.
Latent state sequences can be decoded using the decoder of the model, allowing visualization of the expected behavior, before training the agent to execute it.
Article
Code
🤗 Demo
Models
Datasets