Multimodal foundation world models
for generalist embodied agents

Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Aaron Courville, Sai Rajeswar
Multimodal foundation world models allow grounding language and video prompts into embodied domains, by turning them into sequences of latent world model states.
Latent state sequences can be decoded using the decoder of the model, allowing visualization of the expected behavior, before training the agent to execute it.
Article Code 🤗 Demo Models Datasets

  Task behaviors

  Behavior retrieval

The agent is tasked to solve tasks that are in the agent's training set. Tasks are inferred by the agent from text prompts, without any access to the real reward functions.
See the article for the list of tasks and results.



  Multitask generalization

The agent is tasked to solve new tasks that not contained in the training set. Tasks are inferred by the agent from text prompts, without any access to the real reward functions.
See the article for the list of tasks and results.


  Language prompts decoded

Multimodal foundation world models allow grounding language prompts into the embodied domain.
The world model allows to visualize how the language prompt is interpreted by the model, by decoding the latent states that correspond to the prompt.


boxing
doing a cartwheel
crawling
crunch abs
doing a backflip
doing the splits
downward facing dog (yoga pose)
sitting
walking on the knees
karate kick
laying down and kicking
lean backwards
doing the moonwalk
doing push ups

  Video prompts decoded

Multimodal foundation world models allow grounding visual prompts into the embodied domain.
The world model allows to visualize how the video prompt is interpreted by the model, by decoding the latent states that correspond to the prompt.