The agent is tasked to solve tasks that are in the agent's training set. Tasks are inferred by the agent from text prompts, without any access to the real reward functions.
See the article for the list of tasks and results.
The agent is tasked to solve new tasks that not contained in the training set. Tasks are inferred by the agent from text prompts, without any access to the real reward functions.
See the article for the list of tasks and results.
Multimodal foundation world models allow grounding language prompts into the embodied domain.
The world model allows to visualize how the language prompt is interpreted by the model, by decoding the latent states that correspond to the prompt.
Multimodal foundation world models allow grounding visual prompts into the embodied domain.
The world model allows to visualize how the video prompt is interpreted by the model, by decoding the latent states that correspond to the prompt.