Navigation with Large Language Models: System Evaluation

cover
18 Apr 2024

This is paper is available on arxiv under CC 4.0 DEED license.

Authors:

(1) Dhruv Shah, UC Berkeley and he contributed equally;

(2) Michael Equi, UC Berkeley and he contributed equally;

(3) Blazej Osinski, University of Warsaw;

(4) Fei Xia, Google DeepMind;

(5) Brian Ichter, Google DeepMind;

(6) Sergey Levine, UC Berkeley and Google DeepMind.

6 System Evaluation

We now evaluate the performance of LFG for the task of goal-directed exploration in real-world environments, and benchmark its performance against baselines. We instantiate two systems with LFG: a real-world system that uses a topological map and a learned control policy, and a simulated agent that uses a geometric map and a deterministic control policy. Our experiments show that both these systems outperform existing LLM-based exploration algorithms by a wide margin, owing to the high quality scores incorporated as search heuristics.

Figure 4: Qualitative example of a negative score influencing the agent’s decision. LFG discourages the agent from exploring the bedroom and living room, leading to fast convergence toward the goal, whereas FBE fails. The CoT reasoning given by the LLM is shown in purple, justifying its score.

Figure 5: Qualitative example of LFG in real. LFG reasons about floor plans in the environment it is searching. In this apartment experiment, LFG believes that a bathroom is more likely to be found near a bedroom rather than a kitchen, and guides the robot towards the bedroom, successfully reaching the goal.

6.1 Benchmarking ObjectNav Performance

We benchmark the performance of LFG for the task of object-goal navigation on the Habitat ObjectNav Challenge [36], where the agent is placed into a simulated environment with photo realistic graphics, and is tasked with finding a query object from one of 10 categories (e.g., “toilet”, “bed”, “couch” etc.). The simulated agent has access to egocentric RGBD observations and accurate pose information. We run 10 evaluation episodes per scene and report two metrics: the average success rate, and success weighted by optimal path length (SPL), the default metrics for the benchmark. Since LFG requires no training, we do not use the training scenes from HM3D.

We compare to three classes of published baselines: (i) learning-based baselines that learn navigation behavior from demonstrations or online experience in the training scenes [37] on up to 2.5B frames of experience, (ii) search-based baselines [33, 38], and (iii) LLM-based baselines that do not use the training data directly, but leverage the semantic knowledge inside foundation models to guide embodied tasks [18, 39].

Evaluating LFG on the HM3D benchmark, we find that it significantly outperforms search and LLMbased baselines (Table 1). Greedy LLM struggles on the task due to several LLM planning failures, causing the episodes to fail. LFG significantly outperforms the vanilla FBE baseline by leveraging semantic priors from LLMs to score subgoals intelligently. Comparing to learning-based baselines, we find that LFG outperforms most of them and closely matches the state-of-the-art on the task, proving the competence of our polling and heuristic approach. Figure 4 shows an example of the LFG agent successfully reaching the goal by using chain-of-thought and negative prompting.

L3MVN [39], which uses a combination of LLMs and search, performs slightly better than FBE, but is unable to fully leverage the semantics in LLMs. While being similar to LFG, it suffers from two key limitations: (i) it uses a small language model (GPT-2), which likely does not contain strong semantic priors for the agent to leverage, and (ii) it uses a simple likelihood-based scoring scheme, which we show below is not very effective. Another closely related work, LGX [18], uses a variant of greedy LLM scoring, and hence fails to perform reliably on the benchmark.

Probing deeper into the strong performance of LFG, we ablated various components of our scoring pipeline and studied the change in performance. Note that LGX (Greedy) and L3MVN (No CoT, Logprobs) can be seen as ablations of LFG. Table 2 shows that modifying both the prompting and scoring mechanisms used by LFG lead to large drops in performance. Most notably, scoring via polling (+7.8%) and CoT (+6.6%) are both essential to the strong performance of LFG. Furthermore, we find that using only positive prompts also hurts performance (−4.7%). Popular approaches for using LLMs for planning are significantly outperformed by LFG: Greedy (−14.5%) and Logprobs (−8.5%). Figure 4 shows an example of the LFG agent successfully reaching the goal by using CoT and negative prompting.

Setup: For these experiments, we mimic the semantic mapping pipeline of the best-performing baseline on the benchmark [33, 38], and integrate LFG with the geometric map. The simulated agent builds a 2D semantic map of its environment, where grid cells represent both occupancy and semantic labels corresponding to objects detected by the agent. Prior work has shown that stateof-the-art vision models, such as DETIC, work poorly in simulation due to rendering artifacts [33]; hence, we use ground-truth semantic information for all simulated baselines to analyze navigation performance under perfect perception.

6.2 Real-world Exploration with LFG

To show the versatility of the LFG scoring framework, we further integrated it with a heuristic based exploration framework that uses topological graphs for episodic memory [34]. We compare two published baselines: a language-agnostic FBE baseline [40], and an LLM-based baseline that uses the language model to greedily pick the frontier [18].

We evaluate this system in two challenging real-world environments: a cluttered cafeteria and an apartment building (shown in Figures 3 and 5). In each environment, the robot is tasked to reach an object described by a textual string (e.g., “kitchen sink” or “oven”), and we measure the success rate and efficiency of reaching the goal. Episodes that take longer than 30 minutes are marked as failure. While we only tested our system with goal strings corresponding to the 20,000 classes supported by our object detector [35], this can be extended to more flexible goal specifications with the rapid progress in vision-language models.

We find that the complexity of real-world environments causes the language-agnostic FBE baseline to time out, i.e., the robot is unable to reach the goal by randomly exploring the environment. Both LLM baselines are able to leverage the stored semantic knowledge to guide the exploration in novel environments, but LFG achieves 16% better performance. Figure 5 shows an example rollout in a real apartment, where the robot uses LFG to reach the toilet successfully.

Setup: We instantiate LFG in the real-world using a previously published topological navigation framework [34] that builds a topological map of the environment, where nodes correspond to the robot’s visual observations and edges correspond to paths traversed in the environment. This system relies on omnidirectional RGB observations and does not attempt to make a dense geometric map of the environment. To obtain “semantic frontiers” from the omnidirectional camera, we generate nv = 4 views and run an off-the-shelf object detector [35] to generate rich semantic labels describing objects in these views. The robot maintains a topological graph of these views and semantic labels, and picks the frontier view with the highest score (Algorithm 2, Line 21) according to LFG. The robot then uses a Transformer-based policy [6, 41] to reach this subgoal. For more implementation details, see Appendix A.3.

Table 1: LFG outperforms all LLM-based baselines on HM3D ObjectNav benchmark, and can achieve close to SOTA performance without any pre-training.

Table 2: We find that CoT prompting with positives and negatives, combined with polling, are essential to achieve the best performance.