Pinpointing the Culprit: A Guide to Automated Failure Attribution in LLM Multi-Agent Systems

Overview

Multi-agent systems powered by large language models (LLMs) are increasingly used to tackle complex tasks through collaborative workflows. Yet, when a multi-agent system fails—and it often does—developers face the daunting challenge of identifying which agent caused the failure and at what point in the process. Manual inspection of lengthy interaction logs is akin to searching for a needle in a haystack, time-consuming and error-prone. To address this, researchers from Penn State University and Duke University, in collaboration with Google DeepMind, the University of Washington, Meta, Nanyang Technological University, and Oregon State University, have formally introduced the problem of Automated Failure Attribution. They have created the first benchmark dataset, named Who&When, and developed several automated attribution methods. Their work, accepted as a Spotlight presentation at ICML 2025, aims to enhance the reliability and debuggability of LLM multi-agent systems. This tutorial will guide you through the core concepts, prerequisites, and practical steps to understand and apply automated failure attribution, with code examples and best practices.

Pinpointing the Culprit: A Guide to Automated Failure Attribution in LLM Multi-Agent Systems — Source: syncedreview.com

Prerequisites

Before diving into automated failure attribution, ensure you have a solid foundation in the following areas:

Understanding of LLM Multi-Agent Systems: Familiarity with how multiple LLM agents collaborate (e.g., via message passing, shared memory, or tool use) is essential. Concepts like agent roles, task decomposition, and inter-agent communication are fundamental.
Basic Machine Learning Knowledge: Knowledge of classification, evaluation metrics (precision, recall, F1-score), and dataset construction will help you appreciate the attribution methods.
Python Programming: The provided open-source code is in Python. You should be comfortable with data structures, libraries like PyTorch or Transformers, and basic scripting.
Access to the Who&When Dataset: Download the dataset from the Hugging Face repository. The dataset contains logs of multi-agent interactions along with ground-truth labels of which agent failed and at what step.
Hardware Requirements: A machine with at least 16GB RAM and a GPU (recommended) for running LLM-based attribution models efficiently.

Step-by-Step Instructions

This section walks you through the process of performing automated failure attribution using the Who&When dataset and the methods described in the paper. The steps are organized under relevant subsections.

1. Setting Up the Environment

Clone the official repository from GitHub:

git clone https://github.com/mingyin1/Agents_Failure_Attribution.git
cd Agents_Failure_Attribution

Create a Python virtual environment (Python 3.8+ recommended) and install dependencies:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Download the Who&When dataset and place it in the data/ directory. The dataset is organized into folds for cross-validation.

2. Understanding the Dataset Structure

The Who&When dataset contains JSON files representing multi-agent task episodes. Each episode includes:

agents: List of agent IDs (e.g., Agent_1, Agent_2).
steps: Chronological sequence of actions and messages.
failure_label: Ground truth indicating whether the task failed.
responsible_agent: ID of the agent that caused the failure (if any).
failure_step: Index of the step where the failure occurred (if any).

Example snippet:

{
  "episode_id": "ep_001",
  "agents": ["Agent_1", "Agent_2", "Agent_3"],
  "steps": [
    {"step": 0, "agent": "Agent_1", "action": "propose_plan", "content": "..."},
    ...
  ],
  "failure_label": 1,
  "responsible_agent": "Agent_2",
  "failure_step": 4
}

3. Preprocessing the Data

Before feeding data into attribution models, you need to convert raw logs into a structured format. The repository includes a preprocessing script. Run:

python preprocess.py --data_path data/raw --output_path data/processed

This script extracts relevant features (e.g., agent utterances, step indices) and splits data into training/validation/test sets. It also generates embeddings using a pre-trained LLM if required by the attribution method.

4. Implementing a Baseline Attribution Method

The paper proposes several baseline approaches. Here we demonstrate a simple pattern-based method: Last Agent to Speak. The assumption is that the agent who sent the last message before failure is likely responsible.

def last_agent_to_speak(episode):
    steps = episode['steps']
    # Find the last step before failure (if failure occurred)
    if episode['failure_label'] == 1:
        failure_step = episode['failure_step']
        # Get the agent who spoke at the step just before failure
        agent = steps[failure_step - 1]['agent']
        return agent, failure_step
    else:
        return None, None

Evaluate this method on the test set using accuracy for who and when.

5. Advanced Attribution with LLM-based Classifiers

The paper’s main contributions are LLM-based methods that contextualize the entire episode. One approach uses a fine-tuned LLM (e.g., RoBERTa or T5) to predict both the responsible agent and failure step simultaneously. The training code is provided in train_llm.py.

To fine-tune a model:

python train.py --model_name roberta-base --num_epochs 10 --batch_size 16 --learning_rate 2e-5

This script:

Loads episodes from the processed dataset.
Tokenizes the concatenated step descriptions (with agent markers).
Trains a multi-task classification model outputting two heads: agent ID and step index.

After training, evaluate:

python evaluate.py --checkpoint checkpoints/best_model.ckpt --test_path data/processed/test.json

The output will report metrics like accuracy, precision, recall, and F1 for both attribution tasks.

6. Interpreting Attribution Results

Once you have predictions, you can analyze failure patterns. The paper also introduces a visualization tool in the repository. Run:

python visualize.py --results results.csv

This generates a heatmap showing which agents are most frequently implicated at which steps. Use this to identify systemic issues, such as a particular agent consistently failing during information handoffs.

Common Mistakes

Here are pitfalls to avoid when performing automated failure attribution:

Ignoring Temporal Dependence: Failures often cascade. Attributing blame to the last agent before a crash may be misleading if earlier agents introduced errors. Ensure your method considers the full sequence.
Overfitting to Dataset Bias: The Who&When dataset may have imbalances in agent roles or failure patterns. Use cross-validation and stratified sampling to avoid biased attribution.
Incorrect Preprocessing: Tokenization that truncates important contextual information (e.g., dropping the first few steps) can degrade performance. Always verify that the entire episode fits within the model’s token limit or use chunking strategies.
Treating Agent IDs as Categorical without Semantics: In many systems, agent names encode their role (e.g., “summarizer”). Incorporate role information as additional features to improve attribution.
Neglecting Evaluation Metrics: Reporting only accuracy can be deceptive. When failures are rare (e.g., only 10% of episodes), a model that always predicts “no failure” achieves 90% accuracy but is useless. Use precision, recall, and F1-score per class.

Summary

Automated failure attribution in LLM multi-agent systems is a critical step toward building reliable and debuggable collaborative AI. This tutorial introduced the concept, prerequisites, and a hands-on guide using the Who&When benchmark created by researchers from Penn State and Duke. By preprocessing interaction logs, implementing baseline and LLM-based attribution methods, and avoiding common pitfalls, you can systematically identify which agent caused a failure and at which step. The open-source code and dataset empower you to apply these techniques to your own multi-agent systems, accelerating debugging and optimization. As multi-agent systems become more prevalent, automated attribution will be an essential tool in every developer’s toolkit.