MARPLE: A Benchmark for Long-Horizon Inference

1Department of Computer Science 2Department of Psychology



Abstract

Reconstructing past events requires reasoning across long time horizons, drawing upon diverse evidence such as visual, language, and auditory cues, as well as prior knowledge about the world and human behavior. We introduce MARPLE, a benchmark for evaluating long-horizon inference capabilities using multi-modal evidence. Our benchmark features agents interacting in simulated households, supporting vision, language, and auditory stimuli, as well as procedurally generated environments and agent behaviors. Inspired by classic whodunit stories, we ask AI models and human participants to infer which agent caused a change in the environment based on a step-by-step replay of what actually happened. The goal is to correctly identify the culprit as early as possible. Our findings show that human participants outperform both traditional Monte Carlo simulation methods and an LLM baseline (GPT-4) on this task. Compared to humans, traditional inference models demonstrate lower robustness and performance, while GPT-4 exhibits difficulties in comprehending environmental changes. We further analyze factors that influence inference performance and ablate different modes of evidence, finding that all modes are valuable in improving performance. Overall, our experiments demonstrate that the long-horizon, multimodal inference tasks in our benchmark present a challenge to current models.


MARPLE Overview

MARPLE (in reference to Agatha Christie's Miss Marple) is a benchmark for long-horizon inference based on multimodal evidence. The main goal of MARPLE is to test a model's ability to answer “whodunit”-style questions in daily household scenarios, such as “who turned on the laundry?” The inference problem requires choosing the correct agent from two potential suspects, given knowledge about their prior behaviors and the state of the environment.

Inference Scenario Setup. Two agents, A and B, each perform a mission, such as “do laundry” and "change clothes." To complete their mission, each agent must interact with the environment, causing changes in the world and leaving evidence of its activity. A “whodunit” question is constructed by selecting a state that is unique to one agent’s trajectory. A state that is unique to agent A is “laundry is on,” so we pose the question: "Which agent turned on the laundry?"

To answer “whodunit” questions, models must leverage evidence in the form of multimodal observations from each agent’s activity history.

Inference Process

Evaluating Performance. Inference ability is measured by the probability of correctly choosing the agent responsible for the query state. We are interested in how much evidence is needed to make the correct inference: stronger models require less evidence and achieve high inference accuracy earlier.

Benchmark Overview

The MARPLE Benchmark features 10 diverse, long-horizon missions, which are paired to create 5 challenging inference scenarios that offer a balanced representation of the complexity and diversity offered by pairing missions. Each mission is accompanied by both train and test datasets: two train datasets, each containing 5000 agent trajectories (one for evaluating in-distribution performance and the other for out-of-distribution performance), and a test dataset with 500 diverse agent trajectories.

Household Simulator

To support our benchmark, we introduce the MARPLE Household Simulator, designed to support complex scenarios and generate diverse data with the following key components: Multimodal Environment: fast, procedural generation with visual, language, auditory stimuli Hierarchical Agent Planner: for procedural generation of diverse agent behaviors Human User Interface: intuitive UI to support cognitive science experiments with humans

Simulator Backend

Inference Methods

Mental Simulation with Learned Agent Models. We combine Monte Carlo Tree Search (MCTS) with learned agent policy models for mental simulation. Agent policies are learned through imitation learning on past behaviors, and they are used during inference to predict actions for Monte Carlo rollouts. Different variations leverage visual, audio, and/or language evidence.

LLM. We ask GPT-4 to predict which agent is more likely to have caused the query state given visual observations of both agents at two consecutive timesteps. GPT-4 must reason about changes in the consecutive states and consider how the agent may reach the query state.

Human Baseline. Human participants answer the inference question, given side-by-side visual observations of agent trajectories, presented one step at a time. This allows participants to build an incremental understanding of agent trajectories and compare behaviors within the scenario.

Benchmarking Experiments

We run experiments on all 5 inference scenarios, and we find that MARPLE is very challenging for all baselines. We focus our evaluation on how early the methods make the correct inference, rather than convergence itself, and we observe that: Mental Simulation Models: generally achieve higher accuracy and consistency than GPT-4, demonstrating the benefit of explicitly performing step-by-step mental simulations. GPT-4: performs competitively but sometimes fails to converge due to its bias toward changes in the agents' states rather than the environment. Human Participants: provide a strong upper bound on performance. They outperform all models and achieve higher accuracies given less evidence, even without significant training.

Inference Accuracy

Performance for each baseline across scenarios. Inference scenarios are presented in order of increasing difficulty from left to right, top to bottom. Error bands correspond to 95% CI intervals across tested trajectories.

Generalization Capabilities of Mental Simulation. Multimodal observations improve the mental simulation model’s performance in-distribution, but they struggle to generalize to novel environments. The performance gap between humans and the best mental simulation method increases from 10% to 33% less evidence out-of-distribution, highlighting significant room for improvement in building robust and generalizable inference models.

Generalization Accuracy

Conclusion

We introduced MARPLE, a novel benchmark for evaluating long-horizon, multimodal inference capabilities. We find that current AI models, including Monte Carlo tree search and LLM methods, still fall short of humans in leveraging multimodal stimuli and performing long-horizon inference. We hope that MARPLE facilitates further AI and cognitive science research to bridge the gap between artificial and human cognitive abilities in complex, real-world inference scenarios.

Acknowledgements

This work was in part supported by a grant from the Stanford Institute for Human-Centered Artificial Intelligence (HAI), NSF CCRI #2120095, and ONR MURI N00014-22-1-2740.