Official code for "V2Dial: Unification of Video and Visual Dialog via Multimodal Experts" published at CVPR'25
Find a file
2025-07-10 07:31:58 +02:00
config initial commit 2025-07-10 07:31:58 +02:00
data initial commit 2025-07-10 07:31:58 +02:00
datasets initial commit 2025-07-10 07:31:58 +02:00
emergency initial commit 2025-07-10 07:31:58 +02:00
misc initial commit 2025-07-10 07:31:58 +02:00
models initial commit 2025-07-10 07:31:58 +02:00
processors initial commit 2025-07-10 07:31:58 +02:00
tasks initial commit 2025-07-10 07:31:58 +02:00
tokenizers initial commit 2025-07-10 07:31:58 +02:00
utils initial commit 2025-07-10 07:31:58 +02:00
eval_visdial.py initial commit 2025-07-10 07:31:58 +02:00
generate_parallel_avsd.sh initial commit 2025-07-10 07:31:58 +02:00
generate_parallel_visdial.sh initial commit 2025-07-10 07:31:58 +02:00
main_stage_1.py initial commit 2025-07-10 07:31:58 +02:00
main_stage_2.py initial commit 2025-07-10 07:31:58 +02:00
main_stage_3.py initial commit 2025-07-10 07:31:58 +02:00
merge_pred_avsd.py initial commit 2025-07-10 07:31:58 +02:00
README.md initial commit 2025-07-10 07:31:58 +02:00

V2Dial : Unification of Video and Visual Dialog via Multimodal Experts

Adnen Abdessaied,   Anna Rohrbach,   Marcus Rohrbach,   Andreas Bulling

CVPR 2025, Nashville, TN, USA
[Paper]




Citation

If you find our code useful or use it in your own projects, please cite our paper:

@InProceedings{v2dial_abdessaied,
    author    = {Abdessaied, Adnen and Rohrbach, Anna and Rohrbach, Marcus and Bulling, Andreas},
    title     = {{V2Dial: Unification of Video and Visual Dialog via Multimodal Experts}},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year      = {2025}
    }

Table of Contents

Setup and Dependencies

Create a conda environment and install dependencies

   conda create -n v2dial python=3.9
   conda activate v2dial
   conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=11.8 -c pytorch -c nvidia
   conda install -c huggingface transformers
   pip install evaluate wandb glog pyhocon

Download Data

We do NOT own any of the data used in this projects. For legal reasons, we only provide links to where they could be downloaded.

Champagne

  • Textual data can be accessed here
  • The video url can be used to download the raw videos if needed. This can done using the folowing code

WebVid-2M

  • Please follow the instructions/hints from this repo the download the dataset

CC3M

AVSD

VisDial v1.0

  • Both textual and image data can be obtained from here

After the data is downloaded, you need to set up their paths correctly in the config-files in \config

Training

We trained our model on 8 Nvidia A100 GPUs on all different stages.

Stage 1

Run

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main_stage_1.py \
   --mode train \
   --tag stage_1 \

Stage 2

Run

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main_stage_2.py \
   --mode train \
   --tag stage_2 \

Stage 3

Run

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main_stage_3.py \
   --mode train \
   --tag stage_3 \

Response Generation

AVSD-DSTC7

  1. Set dstc=7 in the .conf file of your trained networks. in The default setting, can find this under logs/exeriment_tag/code/config/v2_dial_stage_x.conf
  2. Generate the responses
./generate_parallel_avsd.sh v2dial/stage_x results_avsd_dstc7_v2dial_stage_x generate logs/stage_x/tag_to_be_used 7
  1. All responses will be saved in output/dstc7/

AVSD-DSTC8

  1. Set dstc=8 in the .conf file of your trained networks. in The default setting, can find this under logs/exeriment_tag/code/config/v2_dial_stage_x.conf
  2. Generate the responses
./generate_parallel_avsd.sh v2dial/stage_x results_avsd_dstc8_v2dial_stage_x generate logs/stage_x/tag_to_be_used 8
  1. All responses will be saved in output/dstc8/

AVSD-DSTC10

  1. Set dstc=10 in the .conf file of your trained networks. in The default setting, can find this under logs/exeriment_tag/code/config/v2_dial_stage_x.conf
  2. Generate the responses
./generate_parallel_avsd.sh v2dial/stage_x results_avsd_dstc10_v2dial_stage_x generate logs/stage_x/tag_to_be_used 10
  1. All responses will be saved in output/dstc10/

VisDial

  1. Generate the responses
./generate_parallel_visdial.sh v2dial/stage_x results_visdial_v2dial_stage_x generate logs/stage_x/tag_to_be_used
  1. All responses will be saved in output/visdial/

Results

AVSD

To evaluate the results of AVSD, please use the tool Executing the eval_tool_dstc7 of AVSD-DSTC7 eval_tool_dstc and eval_tool_dstc10 on the generated reponses from the previous stage.

VisDial

Use the script eval_visdial.py for evaluation.

Acknowledgements

We thank the authors of miniGPT4-Video, VindLU, BLIP-2 for providing their codebases that greatly influenced this work.