Official code for "V2Dial: Unification of Video and Visual Dialog via Multimodal Experts" published at CVPR'25

Find a file

Andreas Bulling 7be61f8c6d initial commit		2025-07-10 07:31:58 +02:00
config	initial commit	2025-07-10 07:31:58 +02:00
data	initial commit	2025-07-10 07:31:58 +02:00
datasets	initial commit	2025-07-10 07:31:58 +02:00
emergency	initial commit	2025-07-10 07:31:58 +02:00
misc	initial commit	2025-07-10 07:31:58 +02:00
models	initial commit	2025-07-10 07:31:58 +02:00
processors	initial commit	2025-07-10 07:31:58 +02:00
tasks	initial commit	2025-07-10 07:31:58 +02:00
tokenizers	initial commit	2025-07-10 07:31:58 +02:00
utils	initial commit	2025-07-10 07:31:58 +02:00
eval_visdial.py	initial commit	2025-07-10 07:31:58 +02:00
generate_parallel_avsd.sh	initial commit	2025-07-10 07:31:58 +02:00
generate_parallel_visdial.sh	initial commit	2025-07-10 07:31:58 +02:00
main_stage_1.py	initial commit	2025-07-10 07:31:58 +02:00
main_stage_2.py	initial commit	2025-07-10 07:31:58 +02:00
main_stage_3.py	initial commit	2025-07-10 07:31:58 +02:00
merge_pred_avsd.py	initial commit	2025-07-10 07:31:58 +02:00
README.md	initial commit	2025-07-10 07:31:58 +02:00

README.md

V2Dial : Unification of Video and Visual Dialog via Multimodal Experts

Adnen Abdessaied, Anna Rohrbach, Marcus Rohrbach, Andreas Bulling

CVPR 2025, Nashville, TN, USA
[Paper]

Citation

If you find our code useful or use it in your own projects, please cite our paper:

@InProceedings{v2dial_abdessaied,
    author    = {Abdessaied, Adnen and Rohrbach, Anna and Rohrbach, Marcus and Bulling, Andreas},
    title     = {{V2Dial: Unification of Video and Visual Dialog via Multimodal Experts}},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year      = {2025}
    }

Setup and Dependencies
Download Data
Training
Response Generation
Results
Acknowledgements

Setup and Dependencies

Create a conda environment and install dependencies

   conda create -n v2dial python=3.9
   conda activate v2dial
   conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=11.8 -c pytorch -c nvidia
   conda install -c huggingface transformers
   pip install evaluate wandb glog pyhocon

Download Data

❗ We do NOT own any of the data used in this projects. For legal reasons, we only provide links to where they could be downloaded.

Champagne

Textual data can be accessed here
The video url can be used to download the raw videos if needed. This can done using the folowing code

WebVid-2M

Please follow the instructions/hints from this repo the download the dataset

CC3M

Please follow these instructions to download the dataset

AVSD

The textual data of the three versions can be downloaded from AVSD-DSTC7, AVSD-DSTC8 and AVSD-DSTC10, respectively
The videos can be obtained from here

VisDial v1.0

Both textual and image data can be obtained from here

After the data is downloaded, you need to set up their paths correctly in the config-files in \config

Training

We trained our model on 8 Nvidia A100 GPUs on all different stages.

Stage 1

Run

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main_stage_1.py \
   --mode train \
   --tag stage_1 \

Stage 2

Run

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main_stage_2.py \
   --mode train \
   --tag stage_2 \

Stage 3

Run

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main_stage_3.py \
   --mode train \
   --tag stage_3 \

Response Generation

AVSD-DSTC7

Set dstc=7 in the .conf file of your trained networks. in The default setting, can find this under logs/exeriment_tag/code/config/v2_dial_stage_x.conf
Generate the responses

./generate_parallel_avsd.sh v2dial/stage_x results_avsd_dstc7_v2dial_stage_x generate logs/stage_x/tag_to_be_used 7

All responses will be saved in output/dstc7/

AVSD-DSTC8

Set dstc=8 in the .conf file of your trained networks. in The default setting, can find this under logs/exeriment_tag/code/config/v2_dial_stage_x.conf
Generate the responses

./generate_parallel_avsd.sh v2dial/stage_x results_avsd_dstc8_v2dial_stage_x generate logs/stage_x/tag_to_be_used 8

All responses will be saved in output/dstc8/

AVSD-DSTC10

Set dstc=10 in the .conf file of your trained networks. in The default setting, can find this under logs/exeriment_tag/code/config/v2_dial_stage_x.conf
Generate the responses

./generate_parallel_avsd.sh v2dial/stage_x results_avsd_dstc10_v2dial_stage_x generate logs/stage_x/tag_to_be_used 10

All responses will be saved in output/dstc10/

VisDial

Generate the responses

./generate_parallel_visdial.sh v2dial/stage_x results_visdial_v2dial_stage_x generate logs/stage_x/tag_to_be_used

All responses will be saved in output/visdial/

Results

AVSD

To evaluate the results of AVSD, please use the tool Executing the eval_tool_dstc7 of AVSD-DSTC7 eval_tool_dstc and eval_tool_dstc10 on the generated reponses from the previous stage.

VisDial

Use the script eval_visdial.py for evaluation.

Acknowledgements

We thank the authors of miniGPT4-Video, VindLU, BLIP-2 for providing their codebases that greatly influenced this work.

README.md

V2Dial : Unification of Video and Visual Dialog via Multimodal Experts

Citation

Table of Contents

Setup and Dependencies

Download Data

Champagne

WebVid-2M

CC3M

AVSD

VisDial v1.0

Training

Stage 1

Stage 2

Stage 3

Response Generation

AVSD-DSTC7

AVSD-DSTC8

AVSD-DSTC10

VisDial

Results

AVSD

VisDial

Acknowledgements