Navigation with Large Language Models: LFG: Scoring Subgoals by Polling LLMs

cover
18 Apr 2024

This is paper is available on arxiv under CC 4.0 DEED license.

Authors:

(1) Dhruv Shah, UC Berkeley and he contributed equally;

(2) Michael Equi, UC Berkeley and he contributed equally;

(3) Blazej Osinski, University of Warsaw;

(4) Fei Xia, Google DeepMind;

(5) Brian Ichter, Google DeepMind;

(6) Sergey Levine, UC Berkeley and Google DeepMind.

4 LFG: Scoring Subgoals by Polling LLMs

To overcome these challenges, LFG uses a novel approach to extract task-relevant likelihoods from LLMs. Given candidate subgoal images, LFG uses a VLM to obtain a textual subgoal desriptor si , which must be scored with the LLM. LFG polls the LLMs by sampling the most likely subgoal ns times, conditioned on a task-relevant prompt. We then use these samples to empirically estimate the likelihood of each subgoal. To get informative and robust likelihood estimates, we use a chain-ofthought prompting (CoT) technique [32], to improve the quality and interpretability of the scores, and use a combination of positive and negative prompts to gather unbiased likelihood estimates. Figure 2 outlines our scoring technique, with the full prompt provided in Appendix B. We now describe the details of our scoring technique.

Structured query: We rely on in-context learning by providing an example of a structured query response pair to the LLM, and ask it to pick the most likely subgoal that satisfies the query. To sample a subgoal from S using a language model, we prompt it to generate a structured response, ending with ‘‘Answer: i’’. This structure allows us to always sample a valid subgoal, without having to ground LLM generations in the environment [24].

Positives and negatives: We find that only using positive prompts (e.g., “which subgoal is most likely to reach the goal”) leads to likelihood estimates being uninformative for cases where the LLM is not confident about any subgoal. To overcome this, we also use negative prompts (e.g., “which subgoal is least likely to be relevant for the goal”), which allows us to score subgoals by eliminating/downweighting subgoals that are clearly irrelevant. We then use the difference between the positive and negative likelihoods to rank subgoals.

Figure 3: Overview of LFG for language-guided exploration. Based on the pose and observations, LFG builds an episodic memory (topological or metric), which is used by the heuristic-based exploration policy to rank adjacent clusters, or subgoal frontiers. Navigation to the subgoal frontier is completed by a low-level policy.

Chain-of-thought prompting: A crucial component of getting interpretable and reliable likelihood estimates is to encourage the LLM to justify its choice by chain-of-thought prompting. As demonstrated in prior works, CoT elicits interpretability and reasoning capabilities in LLMs, and while we don’t explicitly use the generated reasonings in our approach (great future work direction), we find that CoT improves the quality and consistency of the likelihood estimates. Additionally, it also helps maintain interpretability, to better understand why the LFG-equipped agent takes certain decisions.

In summary, we score subgoals by sampling the LLM multiple times and empirically estimating the likelihood of each subgoal. We use a combination of positive and negative prompts to get unbiased likelihood estimates, and use chain-of-thought prompting to improve the quality and interpretability of the scores (Figure 2). We will now discuss how these scores can be incorporated into a navigation system as search heuristics.


[1] Most notably, OpenAI’s Chat API for GPT-3.5 and GPT-4, Google’s PaLM API, and Anthropic’s Claude API all do not support logprobs.