Official code for "ToM-SSI: Evaluating Theory of Mind in Situated Social Interactions" published at EMNLP'25
Find a file
2026-02-03 13:18:56 +01:00
data up 2026-02-03 13:09:05 +01:00
renders up 2026-02-03 13:09:05 +01:00
src up 2025-09-05 15:34:22 +02:00
environment.yml readme 2026-02-03 13:18:56 +01:00
README.md readme 2026-02-03 13:18:56 +01:00

ToM-SSI: Evaluating Theory of Mind in Situated Social Interactions

Matteo Bortoletto,   Constantin Ruhdorfer,   Andreas Bulling

EMNLP 2025, Suzhou, China

[Paper]

Structure of this repository

├── data/                  # Textual data
├── renders/               # Images
├── src/                   # Source code
├── environment.yml        # Dependencies 
└── README.md              # Instructions

Running the code

To evaluate proprietary models you need to have an OpenRouter account. Once you have one, you need to add the OpenRouter API key to your .bashrc:

export OPENROUTER_API_KEY="PASTE_YOUR_KEY_HERE"

To evaluate an open-source language model:

CUDA_VISIBLE_DEVICES=0,1 nice -n 5 python evaluate_llm.py --data_path $data_path --model_name $model

To evaluate a proprietary language model:

nice -n 5 python evaluate_llm_openrouter.py --data_path $data_path --model_name $model

To evaluate a open-source vision-language model:

CUDA_VISIBLE_DEVICES=0,1 nice -n 5 python evaluate_vlm.py --text_data_path $text_data_path --image_data_path $image_data_path --model_name $model

To evaluate a proprietary vision-language model:

nice -n 5 python evaluate_vlm_openrouter.py --text_data_path $text_data_path --image_data_path $image_data_path --model_name $model

You can add --save_logs to save logs and --verbose to print prompt and generated text. Logs are saved in logs/. The final scores are also saved in a csv file (results.csv or results_vision.csv), which is used just as a double check (those files can be incomplete, the results are computed again later during the analyses by other scripts).

Citation

@inproceedings{bortoletto-etal-2025-tom,
    title = "{T}o{M}-{SSI}: Evaluating Theory of Mind in Situated Social Interactions",
    author = "Bortoletto, Matteo  and
      Ruhdorfer, Constantin  and
      Bulling, Andreas",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1642/",
    doi = "10.18653/v1/2025.emnlp-main.1642",
    pages = "32264--32289",
    ISBN = "979-8-89176-332-6",
    abstract = "Most existing Theory of Mind (ToM) benchmarks for foundation models rely on variations of the Sally-Anne test, offering only a very limited perspective on ToM and neglecting the complexity of human social interactions. To address this gap, we propose ToM-SSI: a new benchmark specifically designed to test ToM capabilities in environments rich with social interactions and spatial dynamics. While current ToM benchmarks are limited to text-only or dyadic interactions, ToM-SSI is multimodal and includes group interactions of up to four agents that communicate and move in situated environments. This unique design allows us to study, for the first time, mixed cooperative-obstructive settings and reasoning about multiple agents' mental state in parallel, thus capturing a wider range of social cognition than existing benchmarks. Our evaluations reveal that the current models' performance is still severely limited, especially in these new tasks, highlighting critical gaps for future research."
}