142 lines
5.6 KiB
Markdown
142 lines
5.6 KiB
Markdown
<div align="center">
|
|
<h1> V2Dial <img src="misc/logo.png" width="3%" align="bottom">: Unification of Video and Visual Dialog via Multimodal Experts </h1>
|
|
|
|
**[Adnen Abdessaied][16], [Anna Rohrbach][17], [Marcus Rohrbach][20], [Andreas Bulling][18]** <br> <br>
|
|
**CVPR 2025, Nashville, TN, USA <img src="misc/usa.png" width="3%" align="center">** <br>
|
|
**[[Paper][19]]**
|
|
|
|
---------------------------
|
|
<img src="misc/method.png" width="70%" align="middle"><br><br>
|
|
|
|
</div>
|
|
|
|
# Citation
|
|
If you find our code useful or use it in your own projects, please cite our paper:
|
|
|
|
```bibtex
|
|
@InProceedings{v2dial_abdessaied,
|
|
author = {Abdessaied, Adnen and Rohrbach, Anna and Rohrbach, Marcus and Bulling, Andreas},
|
|
title = {{V2Dial: Unification of Video and Visual Dialog via Multimodal Experts}},
|
|
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
|
|
year = {2025}
|
|
}
|
|
```
|
|
|
|
# Table of Contents
|
|
* [Setup and Dependencies](#Setup-and-Dependencies)
|
|
* [Download Data](#Download-Data)
|
|
* [Training](#Training)
|
|
* [Response Generation](#Response-Generation)
|
|
* [Results](#Results)
|
|
* [Acknowledgements](#Acknowledgements)
|
|
|
|
# Setup and Dependencies
|
|
|
|
Create a conda environment and install dependencies
|
|
```shell
|
|
conda create -n v2dial python=3.9
|
|
conda activate v2dial
|
|
conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=11.8 -c pytorch -c nvidia
|
|
conda install -c huggingface transformers
|
|
pip install evaluate wandb glog pyhocon
|
|
```
|
|
# Download Data
|
|
❗ We do NOT own any of the data used in this projects. For legal reasons, we only provide links to where they could be downloaded.
|
|
## Champagne
|
|
- Textual data can be accessed [here](https://seungjuhan.me/champagne/)
|
|
- The video url can be used to download the raw videos if needed. This can done using the folowing [code](https://github.com/rowanz/merlot_reserve/tree/main/data)
|
|
|
|
## WebVid-2M
|
|
- Please follow the instructions/hints from this [repo](https://github.com/m-bain/webvid) the download the dataset
|
|
|
|
## CC3M
|
|
- Please follow these [instructions](https://github.com/salesforce/LAVIS/blob/main/dataset_card/conceptual_captions.md) to download the dataset
|
|
|
|
## AVSD
|
|
- The textual data of the three versions can be downloaded from [AVSD-DSTC7][2], [AVSD-DSTC8][3] and [AVSD-DSTC10][10], respectively
|
|
- The videos can be obtained from [here](http://vuchallenge.org/charades.html)
|
|
|
|
## VisDial v1.0
|
|
- Both textual and image data can be obtained from [here](https://visualdialog.org/data)
|
|
|
|
After the data is downloaded, you need to set up their paths correctly in the config-files in `\config`
|
|
# Training
|
|
We trained our model on 8 Nvidia A100 GPUs on all different stages.
|
|
## Stage 1
|
|
Run
|
|
```shell
|
|
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main_stage_1.py \
|
|
--mode train \
|
|
--tag stage_1 \
|
|
```
|
|
|
|
## Stage 2
|
|
Run
|
|
```shell
|
|
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main_stage_2.py \
|
|
--mode train \
|
|
--tag stage_2 \
|
|
```
|
|
|
|
## Stage 3
|
|
Run
|
|
```shell
|
|
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main_stage_3.py \
|
|
--mode train \
|
|
--tag stage_3 \
|
|
```
|
|
|
|
# Response Generation
|
|
## AVSD-DSTC7
|
|
1. Set ```dstc=7``` in the ```.conf``` file of your trained networks. in The default setting, can find this under ```logs/exeriment_tag/code/config/v2_dial_stage_x.conf```
|
|
2. Generate the responses
|
|
```shell
|
|
./generate_parallel_avsd.sh v2dial/stage_x results_avsd_dstc7_v2dial_stage_x generate logs/stage_x/tag_to_be_used 7
|
|
```
|
|
3. All responses will be saved in ```output/dstc7/```
|
|
## AVSD-DSTC8
|
|
1. Set ```dstc=8``` in the ```.conf``` file of your trained networks. in The default setting, can find this under ```logs/exeriment_tag/code/config/v2_dial_stage_x.conf```
|
|
2. Generate the responses
|
|
```shell
|
|
./generate_parallel_avsd.sh v2dial/stage_x results_avsd_dstc8_v2dial_stage_x generate logs/stage_x/tag_to_be_used 8
|
|
```
|
|
3. All responses will be saved in ```output/dstc8/```
|
|
|
|
## AVSD-DSTC10
|
|
1. Set ```dstc=10``` in the ```.conf``` file of your trained networks. in The default setting, can find this under ```logs/exeriment_tag/code/config/v2_dial_stage_x.conf```
|
|
2. Generate the responses
|
|
```shell
|
|
./generate_parallel_avsd.sh v2dial/stage_x results_avsd_dstc10_v2dial_stage_x generate logs/stage_x/tag_to_be_used 10
|
|
```
|
|
3. All responses will be saved in ```output/dstc10/```
|
|
|
|
## VisDial
|
|
1. Generate the responses
|
|
```shell
|
|
./generate_parallel_visdial.sh v2dial/stage_x results_visdial_v2dial_stage_x generate logs/stage_x/tag_to_be_used
|
|
```
|
|
2. All responses will be saved in ```output/visdial/```
|
|
|
|
# Results
|
|
## AVSD
|
|
To evaluate the results of AVSD, please use the tool
|
|
Executing the [eval_tool_dstc7][7] of AVSD-DSTC7 [eval_tool_dstc][8] and [eval_tool_dstc10][11] on the generated reponses from the previous stage.
|
|
## VisDial
|
|
Use the script `eval_visdial.py` for evaluation.
|
|
|
|
# Acknowledgements
|
|
We thank the authors of [miniGPT4-Video][8], [VindLU][1], [BLIP-2][4] for providing their codebases that greatly influenced this work.
|
|
|
|
[1]: https://github.com/klauscc/VindLU
|
|
[2]: https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge
|
|
[3]: https://github.com/dialogtekgeek/DSTC8-AVSD_official
|
|
[4]: https://github.com/salesforce/LAVIS/tree/main/projects/blip2
|
|
[7]: https://drive.google.com/file/d/1EKfPtrNBQ5ciKRl6XggImweGRP84XuPi/view?usp=sharing
|
|
[8]: https://github.com/Vision-CAIR/MiniGPT4-video
|
|
[10]: https://drive.google.com/file/d/1zvC6FuPRVRiLQCXZcYpzYUI9r1tiWls6/view
|
|
[11]: https://github.com/ankitshah009/AVSD-DSTC10_baseline
|
|
[16]: https://adnenabdessaied.de/
|
|
[17]: https://anna-rohrbach.net/
|
|
[18]: https://perceptualui.org/people/bulling/
|
|
[19]: https://arxiv.org/abs/2503.02063
|
|
[20]: https://rohrbach.vision/
|