Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning: Related Work

cover
18 Apr 2024

This is paper is available on arxiv under CC 4.0 DEED license.

Authors:

(1) Dhruv Shah, UC Berkeley and he contributed equally;

(2) Michael Equi, UC Berkeley and he contributed equally;

(3) Blazej Osinski, University of Warsaw;

(4) Fei Xia, Google DeepMind;

(5) Brian Ichter, Google DeepMind;

(6) Sergey Levine, UC Berkeley and Google DeepMind.

Vision-based navigation: Navigation is conventionally approached as a largely geometric problem, where the aim is to map an environment and use that map to find a path to a goal location [1]. Learning-based approaches can exploit patterns in the training environments, particularly by learning vision-based navigation strategies through reinforcement or imitation [2–7]. Our work is also related to PONI [7], which uses a learned potential function to prioritize frontier points to explore; instead, we use a language model to rank these points. Notably, these methods do not benefit from prior semantic knowledge (e.g., from the web), and must rely entirely on patterns discovered from offline or online navigational data. Our aim is specifically to bring semantic knowledge into navigation, to enable robots to more effectively search for a goal in a new environment.

Semantic knowledge-guided navigation: Prior knowledge about the semantics of indoor environments can provide significantly richer guidance. With the advent of effective open vocabulary vision models [8, 9], some works have recently explored incorporating their semantic knowledge into models for navigation and other robotic tasks with the express aim of improving performance at instruction following [10–14]. In general within robotics, such methods have either utilized pretrained vision-language representations [15–17], or used language models directly to make decisions [18–23]. Our aim is somewhat different: while we also focus on language-specified goals, we are primarily concerned with utilizing the semantics in pre-trained language models to help a robot figure out how to actually reach the goal, rather than utilizing the language models to more effectively interpret a language instruction. While language models can output reasonable substeps for temporally extended tasks in some settings [24, 25], there is contradictory evidence about their ability to actually plan [26], and because they are unaware of the observations and layout in a particular environment, their “plans” depend entirely on the context that is provided to them. In contrast to prior work, our approach does not rely on the language model producing a good plan, but merely a heuristic that can bias a dedicated planner to reach a goal more effectively. In this way, we use the language models more to produce suggestions rather than actual plans.

LLM-guided navigation: Some works have sought to combine predictions from language models with either planning or probabilistic inference [14, 27], so as to not rely entirely on forward prediction from the language model to take actions. However, these methods are more aimed at filtering out infeasible decisions, for example by disallowing actions that a robot is incapable of performing, and still focus largely on being able to interpret and process instructions, rather than using the language model as a source of semantic hints. In contrast, by incorporating language model suggestions as heuristics into a heuristic planner, our approach can completely override the language model predictions if they are incorrect, while still making use of them if they point the way to the goal.

Another branch of recent research [28–30] has taken a different approach to ground language models, by making it possible for them to read in image observations directly. While this represents a promising alternative approach to make language models more useful for embodied decision making, we believe it is largely orthogonal and complementary to our work: although vision-language models can produce more grounded inferences about the actions a robot should take, they are still limited only to guessing when placed in unfamiliar environments. Therefore, although we use ungrounded language-only models in our evaluation, we expect that our method could be combined with vision-language models easily, and would provide complementary benefits.