Motivation
- Enhance Alfred’s capabilities to perform long-horizon tasks that are currently not possible.
- Understanding long horizon tasks and translating it into robot action is a challenging task.
- We build off the VLMap implementation (Visual-Language maps) that hacks the LSeg architecture.
- Factually generate navigation commands through a rich understanding of the scene.
Literature Review
Most methods that handle open vocabular long horizon navigation/manipulation planning fall in one of the two categories:
- End-to-end approaches:
- Systems based approaches
- arxiv.org/pdf/1711.07280.pdf: Generate navigation graphs
- VLMap
- Hack LSeg to generate a queryable scene representation which can be used by LLM generated code policies.
Lacking in Current Methods

Backlinks to LLM
One of the key challenges in generating long horizon tasks using an LLM is the fact that there is no information backlink ingested by the LLM about the scene description. This can cause several problems which are described as follows:
- Adversarial Queries:
-
e.g, if the environment contains a knife, butter, and a spoon, but the user queries “get me a bag”, the LLM will generate code as follows
pick(bag)
-
However, this is not accurate since the bag does not even exist in the environment.
- Instance ambiguity
- If there are multiple versions of an object present in the scene, querying the scene can lead to multiple instances being returned.
- This leads to ambiguity as to the exact location to navigate the robot towards.
Incomplete Representation