MST-MIXER : Multi-Modal Video Dialog State Tracking in the Wild
**[Adnen Abdessaied][16], [Lei Shi][17], [Andreas Bulling][18]**
**ECCV 2024, Milan, Italy
**
**[[Paper][19]]**
---------------------------
# Citation
If you find our code useful or use it in your own projects, please cite our paper:
```bibtex
@InProceedings{Abdessaied_2024_eccv,
author = {Abdessaied, Adnen and Shi, Lei and Bulling, Andreas},
title = {{Multi-Modal Video Dialog State Tracking in the Wild}},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024}
}
```
# Table of Contents
* [Setup and Dependencies](#Setup-and-Dependencies)
* [Download Data](#Download-Data)
* [Training](#Training)
* [Response Generation](#Response-Generation)
* [Results](#Results)
* [Acknowledgements](#Acknowledgements)
# Setup and Dependencies
We implemented our model using Python 3.7 and PyTorch 1.12.0 (CUDA 11.3, CuDNN 8.3.2). We recommend to setup a virtual environment using Anaconda.