V2Dial : Unification of Video and Visual Dialog via Multimodal Experts

**[Adnen Abdessaied][16],   [Anna Rohrbach][17],   [Marcus Rohrbach][20],   [Andreas Bulling][18]**

**CVPR 2025, Nashville, TN, USA **
**[[Paper][19]]** ---------------------------

# Citation If you find our code useful or use it in your own projects, please cite our paper: ```bibtex @InProceedings{v2dial_abdessaied, author = {Abdessaied, Adnen and Rohrbach, Anna and Rohrbach, Marcus and Bulling, Andreas}, title = {{V2Dial: Unification of Video and Visual Dialog via Multimodal Experts}}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2025} } ``` # Table of Contents * [Setup and Dependencies](#Setup-and-Dependencies) * [Download Data](#Download-Data) * [Training](#Training) * [Response Generation](#Response-Generation) * [Results](#Results) * [Acknowledgements](#Acknowledgements) # Setup and Dependencies Create a conda environment and install dependencies ```shell conda create -n v2dial python=3.9 conda activate v2dial conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=11.8 -c pytorch -c nvidia conda install -c huggingface transformers pip install evaluate wandb glog pyhocon ``` # Download Data ❗ We do NOT own any of the data used in this projects. For legal reasons, we only provide links to where they could be downloaded. ## Champagne - Textual data can be accessed [here](https://seungjuhan.me/champagne/) - The video url can be used to download the raw videos if needed. This can done using the folowing [code](https://github.com/rowanz/merlot_reserve/tree/main/data) ## WebVid-2M - Please follow the instructions/hints from this [repo](https://github.com/m-bain/webvid) the download the dataset ## CC3M - Please follow these [instructions](https://github.com/salesforce/LAVIS/blob/main/dataset_card/conceptual_captions.md) to download the dataset ## AVSD - The textual data of the three versions can be downloaded from [AVSD-DSTC7][2], [AVSD-DSTC8][3] and [AVSD-DSTC10][10], respectively - The videos can be obtained from [here](http://vuchallenge.org/charades.html) ## VisDial v1.0 - Both textual and image data can be obtained from [here](https://visualdialog.org/data) After the data is downloaded, you need to set up their paths correctly in the config-files in `\config` # Training We trained our model on 8 Nvidia A100 GPUs on all different stages. ## Stage 1 Run ```shell CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main_stage_1.py \ --mode train \ --tag stage_1 \ ``` ## Stage 2 Run ```shell CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main_stage_2.py \ --mode train \ --tag stage_2 \ ``` ## Stage 3 Run ```shell CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main_stage_3.py \ --mode train \ --tag stage_3 \ ``` # Response Generation ## AVSD-DSTC7 1. Set ```dstc=7``` in the ```.conf``` file of your trained networks. in The default setting, can find this under ```logs/exeriment_tag/code/config/v2_dial_stage_x.conf``` 2. Generate the responses ```shell ./generate_parallel_avsd.sh v2dial/stage_x results_avsd_dstc7_v2dial_stage_x generate logs/stage_x/tag_to_be_used 7 ``` 3. All responses will be saved in ```output/dstc7/``` ## AVSD-DSTC8 1. Set ```dstc=8``` in the ```.conf``` file of your trained networks. in The default setting, can find this under ```logs/exeriment_tag/code/config/v2_dial_stage_x.conf``` 2. Generate the responses ```shell ./generate_parallel_avsd.sh v2dial/stage_x results_avsd_dstc8_v2dial_stage_x generate logs/stage_x/tag_to_be_used 8 ``` 3. All responses will be saved in ```output/dstc8/``` ## AVSD-DSTC10 1. Set ```dstc=10``` in the ```.conf``` file of your trained networks. in The default setting, can find this under ```logs/exeriment_tag/code/config/v2_dial_stage_x.conf``` 2. Generate the responses ```shell ./generate_parallel_avsd.sh v2dial/stage_x results_avsd_dstc10_v2dial_stage_x generate logs/stage_x/tag_to_be_used 10 ``` 3. All responses will be saved in ```output/dstc10/``` ## VisDial 1. Generate the responses ```shell ./generate_parallel_visdial.sh v2dial/stage_x results_visdial_v2dial_stage_x generate logs/stage_x/tag_to_be_used ``` 2. All responses will be saved in ```output/visdial/``` # Results ## AVSD To evaluate the results of AVSD, please use the tool Executing the [eval_tool_dstc7][7] of AVSD-DSTC7 [eval_tool_dstc][8] and [eval_tool_dstc10][11] on the generated reponses from the previous stage. ## VisDial Use the script `eval_visdial.py` for evaluation. # Acknowledgements We thank the authors of [miniGPT4-Video][8], [VindLU][1], [BLIP-2][4] for providing their codebases that greatly influenced this work. [1]: https://github.com/klauscc/VindLU [2]: https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge [3]: https://github.com/dialogtekgeek/DSTC8-AVSD_official [4]: https://github.com/salesforce/LAVIS/tree/main/projects/blip2 [7]: https://drive.google.com/file/d/1EKfPtrNBQ5ciKRl6XggImweGRP84XuPi/view?usp=sharing [8]: https://github.com/Vision-CAIR/MiniGPT4-video [10]: https://drive.google.com/file/d/1zvC6FuPRVRiLQCXZcYpzYUI9r1tiWls6/view [11]: https://github.com/ankitshah009/AVSD-DSTC10_baseline [16]: https://adnenabdessaied.de/ [17]: https://anna-rohrbach.net/ [18]: https://perceptualui.org/people/bulling/ [19]: https://arxiv.org/abs/2503.02063 [20]: https://rohrbach.vision/