174 lines
7.2 KiB
Markdown
174 lines
7.2 KiB
Markdown
<div align="center">
|
|
<h1> MST-MIXER <img src="misc/mixer.png" width="3%" align="bottom">: Multi-Modal Video Dialog State Tracking in the Wild </h1>
|
|
|
|
**[Adnen Abdessaied][16], [Lei Shi][17], [Andreas Bulling][18]** <br> <br>
|
|
**ECCV 2024, Milan, Italy <img src="misc/italy.png" width="3%" align="center">** <br>
|
|
**[[Paper][19]]**
|
|
|
|
---------------------------
|
|
<img src="misc/teaser.png" width="70%" align="middle"><br><br>
|
|
|
|
</div>
|
|
|
|
# Citation
|
|
If you find our code useful or use it in your own projects, please cite our paper:
|
|
|
|
```bibtex
|
|
@InProceedings{Abdessaied_2024_eccv,
|
|
author = {Abdessaied, Adnen and Shi, Lei and Bulling, Andreas},
|
|
title = {{Multi-Modal Video Dialog State Tracking in the Wild}},
|
|
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
|
|
year = {2024}
|
|
}
|
|
```
|
|
|
|
# Table of Contents
|
|
* [Setup and Dependencies](#Setup-and-Dependencies)
|
|
* [Download Data](#Download-Data)
|
|
* [Training](#Training)
|
|
* [Response Generation](#Response-Generation)
|
|
* [Results](#Results)
|
|
* [Acknowledgements](#Acknowledgements)
|
|
|
|
# Setup and Dependencies
|
|
We implemented our model using Python 3.7 and PyTorch 1.12.0 (CUDA 11.3, CuDNN 8.3.2). We recommend to setup a virtual environment using Anaconda. <br>
|
|
1. Install [git lfs][1] on your system
|
|
2. Clone our repository to download a checpint of our best model and our code
|
|
```shell
|
|
git lfs install
|
|
git clone this_repo.git
|
|
```
|
|
3. Create a conda environment and install dependencies
|
|
```shell
|
|
conda create -n mst_mixer python=3.7
|
|
conda activate mst_mixer
|
|
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch
|
|
conda install pyg -c pyg
|
|
conda install pytorch-scatter -c pyg # pytorch >= 1.8.0
|
|
conda install pytorch-sparse -c pyg # pytorch >= 1.8.0
|
|
conda install -c huggingface transformers
|
|
pip install evaluate wandb glog pyhocon attrs
|
|
```
|
|
# Download Data
|
|
## AVSD
|
|
1. Download the [AVSD-DSTC7][2], [AVSD-DSTC8][3] and [AVSD-DSTC10][10] data
|
|
2. Place the raw json files in ```raw_data/``` and the features in ```features/```
|
|
3. Prepeocess and save the input features for faster training as indicated in ```custom_datasets/```
|
|
## NExT-QA
|
|
1. For convenience, we included the features/data in this git repo.
|
|
|
|
# Training
|
|
We trained our model on 8 Nvidia Tesla V100-32GB GPUs. The default hyperparameters in ```config/mst_mixer.conf``` need to be adjusted if your setup differs from ours.
|
|
## AVSD
|
|
1. Set ```task=avsd``` in ```config/mst_mixer.conf```
|
|
2. ```shell
|
|
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py \
|
|
--mode train \
|
|
--tag mst_mixer_avsd \
|
|
--wandb_mode online \
|
|
--wandb_project mst_mixer_avsd
|
|
```
|
|
To deactivate [wandb][4] logging, use ```--wandb_mode disabled```.
|
|
On a similar setup to ours, this will take roughly 20h to complete.
|
|
|
|
## NExT-QA
|
|
1. Set ```task=nextqa``` in ```config/mst_mixer.conf```
|
|
2. ```shell
|
|
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py \
|
|
--mode train \
|
|
--tag mst_mixer_nextqa \
|
|
--wandb_mode online \
|
|
--wandb_project mst_mixer_nextqa
|
|
```
|
|
|
|
# Response Generation
|
|
## AVSD-DSTC7
|
|
1. Set ```dstc=7``` in the ```.conf``` file of your trained networks. in The default setting, can find this under ```logs/unique_training_tag/code/config/mst_mixer.conf```
|
|
2. Generate the responses
|
|
```shell
|
|
./generate_parallel_avsd.sh mst_mixer/mixer results_avsd_dstc7 generate logs/mst_mixer_avsd 7
|
|
```
|
|
3. All responses will be saved in ```output/dstc7/```
|
|
## AVSD-DSTC8
|
|
1. Set ```dstc=8``` in the ```.conf``` file of your trained networks. in The default setting, can find this under ```logs/unique_training_tag/code/config/mst_mixer.conf```
|
|
2. Generate the responses
|
|
```shell
|
|
./generate_parallel_avsd.sh mst_mixer/mixer results_avsd_dstc8 generate logs/mst_mixer_avsd 8
|
|
```
|
|
3. All responses will be saved in ```output/dstc8/```
|
|
|
|
## AVSD-DSTC10
|
|
1. Set ```dstc=10``` in the ```.conf``` file of your trained networks. in The default setting, can find this under ```logs/unique_training_tag/code/config/mst_mixer.conf```
|
|
2. Generate the responses
|
|
```shell
|
|
./generate_parallel_avsd.sh mst_mixer/mixer results_avsd_dstc10 generate logs/mst_mixer_avsd 10
|
|
```
|
|
3. All responses will be saved in ```output/dstc10/```
|
|
|
|
## NExT-QA
|
|
1. Generate the responses
|
|
```shell
|
|
./generate_parallel_nextqa.sh mst_mixer/mixer results_nextqa generate logs/mst_mixer_nextqa
|
|
```
|
|
2. All responses will be saved in ```output/nextqa/```
|
|
3. Evalute using this [script][15]
|
|
|
|
|
|
# Results
|
|
To evaluate our best model on
|
|
## AVSD-DSTC7
|
|
Executing the [eval_tool][7] of AVSD-DSTC7 using the generated repsonses will output the following metrics
|
|
| Model | BLUE-1 | BLUE-2 | BLUE-3 | BLUE-4 | METEOR | ROUGE-L | CIDEr |
|
|
|:--------:|:------:|:------:|:------:|:------:|:------:|:-------:|:-----:|
|
|
| Prev. SOTA | 78.2 | 65.5 | 55.2 | 46.9 | 30.8 | 61.9 | 135.2 |
|
|
| MST_MIXER | **78.7** | **66.5** | **56.3** | **47.6** | **31.3** | **62.5** | **138.8**|
|
|
|
|
## AVSD-DSTC8
|
|
1. Set ```dstc=8``` in the ```ckpt/code/mst_mixer.conf```
|
|
2. run
|
|
```shell
|
|
./generate_parallel_avsd.sh mst_mixer/mixer results_avsd_dstc8_best_model generate ckpt/avsd 8
|
|
```
|
|
3. The responses will be saved in ```output/dstc8/```
|
|
4. Executing the [eval_tool][7] of AVSD-DSTC8 using the generated repsonses will output the following metrics
|
|
|
|
| Model | BLUE-1 | BLUE-2 | BLUE-3 | BLUE-4 | METEOR | ROUGE-L | CIDEr |
|
|
|:--------:|:------:|:------:|:------:|:------:|:------:|:-------:|:-----:|
|
|
| Prev. SOTA | 76.4 | 64.1 | 54.3 | 46.0 | 30.1 | 61.0 | 130.4 |
|
|
| MST_MIXER | **77.5** | **66.0** | **56.1** | **47.7** | **30.6** | **62.4** | **135.4**|
|
|
|
|
## AVSD-DSTC10
|
|
Executing the [eval_tool][11] of AVSD-DSTC10 using the generated repsonses will output the following metrics
|
|
|
|
| Model | BLUE-1 | BLUE-2 | BLUE-3 | BLUE-4 | METEOR | ROUGE-L | CIDEr |
|
|
|:--------:|:------:|:------:|:------:|:------:|:------:|:-------:|:-----:|
|
|
| Prev. SOTA | 69.3 | 55.6 | 45.0 | 37.2 | 24.9 | 53.6 | 91.2 |
|
|
| MST_MIXER | **70.0** | **57.4** | **47.6** | **40.0** | **25.7** | **54.5** | **99.8**|
|
|
|
|
## NExT-QA
|
|
Executing the [eval script][15] of NExT-QA using the generated repsonses will output the following metrics
|
|
|
|
| Model | WUPS_C | WUPS_T | WUPS_D | WUPS |
|
|
|:--------:|:------:|:------:|:------:|:------:|
|
|
| Prev. SOTA | 17.98| 17.95 | 50.84 | 28.40 |
|
|
| MST_MIXER | **22.12** | **22.20** | **55.64** | **29.50** |
|
|
|
|
|
|
# Acknowledgements
|
|
We thank the authors of [RLM][8] for providing their [code][9] that greatly influenced this work.
|
|
|
|
[1]: https://git-lfs.com/
|
|
[2]: https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge
|
|
[3]: https://github.com/dialogtekgeek/DSTC8-AVSD_official
|
|
[4]: https://wandb.ai/site
|
|
[5]: https://drive.google.com/drive/folders/1SlZTySJAk_2tiMG5F8ivxCfOl_OWwd_Q
|
|
[7]: https://drive.google.com/file/d/1EKfPtrNBQ5ciKRl6XggImweGRP84XuPi/view?usp=sharing
|
|
[8]: https://arxiv.org/abs/2002.00163
|
|
[9]: https://github.com/ictnlp/DSTC8-AVSD
|
|
[10]: https://drive.google.com/file/d/1zvC6FuPRVRiLQCXZcYpzYUI9r1tiWls6/view
|
|
[11]: https://github.com/ankitshah009/AVSD-DSTC10_baseline
|
|
[15]: https://github.com/doc-doc/NExT-OE/blob/main/eval_oe.py
|
|
[16]: https://perceptualui.org/people/abdessaied/
|
|
[17]: https://perceptualui.org/people/shi/
|
|
[18]: https://perceptualui.org/people/bulling/
|
|
[19]: https://arxiv.org/abs/2407.02218
|