V2Dial : Unification of Video and Visual Dialog via Multimodal Experts

<div align="center">
<h1> V2Dial <img src="misc/logo.png" width="3%" align="bottom">: Unification of Video and Visual Dialog via Multimodal Experts </h1>
    
**[Adnen Abdessaied][16], &nbsp; [Anna Rohrbach][17], &nbsp; [Marcus Rohrbach][20], &nbsp; [Andreas Bulling][18]**  <br> <br>
**CVPR 2025, Nashville, TN, USA <img src="misc/usa.png" width="3%" align="center">** <br>
**[[Paper][19]]**

---------------------------
<img src="misc/method.png" width="70%" align="middle"><br><br>

</div>

# Citation 
If you find our code useful or use it in your own projects, please cite our paper:

```bibtex
@InProceedings{v2dial_abdessaied,
    author    = {Abdessaied, Adnen and Rohrbach, Anna and Rohrbach, Marcus and Bulling, Andreas},
    title     = {{V2Dial: Unification of Video and Visual Dialog via Multimodal Experts}},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year      = {2025}
    }
```

# Table of Contents
* [Setup and Dependencies](#Setup-and-Dependencies)
* [Download Data](#Download-Data)
* [Training](#Training)
* [Response Generation](#Response-Generation)
* [Results](#Results)
* [Acknowledgements](#Acknowledgements)

# Setup and Dependencies

 Create a conda environment and install dependencies
```shell
   conda create -n v2dial python=3.9
   conda activate v2dial
   conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=11.8 -c pytorch -c nvidia
   conda install -c huggingface transformers
   pip install evaluate wandb glog pyhocon
```
# Download Data
❗ We do NOT own any of the data used in this projects. For legal reasons, we only provide links to where they could be downloaded. 
## Champagne
- Textual data can be accessed [here](https://seungjuhan.me/champagne/)
- The video url can be used to download the raw videos if needed. This can done using the folowing [code](https://github.com/rowanz/merlot_reserve/tree/main/data)

## WebVid-2M
- Please follow the instructions/hints from this [repo](https://github.com/m-bain/webvid) the download the dataset

## CC3M
- Please follow these [instructions](https://github.com/salesforce/LAVIS/blob/main/dataset_card/conceptual_captions.md) to download the dataset 

## AVSD
- The textual data of the three versions can be downloaded from [AVSD-DSTC7][2], [AVSD-DSTC8][3] and [AVSD-DSTC10][10], respectively
- The videos can be obtained from [here](http://vuchallenge.org/charades.html)

## VisDial v1.0
- Both textual and image data can be obtained from [here](https://visualdialog.org/data)

After the data is downloaded, you need to set up their paths correctly in the config-files in `\config`
# Training
We trained our model on 8 Nvidia A100 GPUs on all different stages.
## Stage 1
Run 
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main_stage_1.py \
   --mode train \
   --tag stage_1 \
```

## Stage 2
Run 
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main_stage_2.py \
   --mode train \
   --tag stage_2 \
```

## Stage 3
Run 
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main_stage_3.py \
   --mode train \
   --tag stage_3 \
```

# Response Generation
## AVSD-DSTC7
1. Set ```dstc=7``` in the ```.conf``` file of your trained networks. in The default setting, can find this under ```logs/exeriment_tag/code/config/v2_dial_stage_x.conf``` 
2. Generate the responses
```shell
./generate_parallel_avsd.sh v2dial/stage_x results_avsd_dstc7_v2dial_stage_x generate logs/stage_x/tag_to_be_used 7
```
3. All responses will be saved in ```output/dstc7/```
## AVSD-DSTC8
1. Set ```dstc=8``` in the ```.conf``` file of your trained networks. in The default setting, can find this under ```logs/exeriment_tag/code/config/v2_dial_stage_x.conf``` 
2. Generate the responses
```shell
./generate_parallel_avsd.sh v2dial/stage_x results_avsd_dstc8_v2dial_stage_x generate logs/stage_x/tag_to_be_used 8
```
3. All responses will be saved in ```output/dstc8/```

## AVSD-DSTC10
1. Set ```dstc=10``` in the ```.conf``` file of your trained networks. in The default setting, can find this under ```logs/exeriment_tag/code/config/v2_dial_stage_x.conf``` 
2. Generate the responses
```shell
./generate_parallel_avsd.sh v2dial/stage_x results_avsd_dstc10_v2dial_stage_x generate logs/stage_x/tag_to_be_used 10
```
3. All responses will be saved in ```output/dstc10/```

## VisDial
1. Generate the responses
```shell
./generate_parallel_visdial.sh v2dial/stage_x results_visdial_v2dial_stage_x generate logs/stage_x/tag_to_be_used
```
2. All responses will be saved in ```output/visdial/```

# Results
## AVSD
To evaluate the results of AVSD, please use the tool
Executing the [eval_tool_dstc7][7] of AVSD-DSTC7 [eval_tool_dstc][8] and [eval_tool_dstc10][11] on the generated reponses from the previous stage.
## VisDial
Use the script `eval_visdial.py` for evaluation.

# Acknowledgements
We thank the authors of [miniGPT4-Video][8], [VindLU][1], [BLIP-2][4] for providing their codebases that greatly influenced this work.

[1]: https://github.com/klauscc/VindLU
[2]: https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge
[3]: https://github.com/dialogtekgeek/DSTC8-AVSD_official
[4]: https://github.com/salesforce/LAVIS/tree/main/projects/blip2
[7]: https://drive.google.com/file/d/1EKfPtrNBQ5ciKRl6XggImweGRP84XuPi/view?usp=sharing
[8]: https://github.com/Vision-CAIR/MiniGPT4-video
[10]: https://drive.google.com/file/d/1zvC6FuPRVRiLQCXZcYpzYUI9r1tiWls6/view
[11]: https://github.com/ankitshah009/AVSD-DSTC10_baseline
[16]: https://adnenabdessaied.de/
[17]: https://anna-rohrbach.net/
[18]: https://perceptualui.org/people/bulling/
[19]: https://arxiv.org/abs/2503.02063
[20]: https://rohrbach.vision/