2023-10-25 15:38:09 +02:00
< div align = "center" >
< h1 > VD-GR: Boosting Visual Dialog with Cascaded Spatial-Temporal Multi-Modal GRaphs < / h1 >
**[Adnen Abdessaied][5], [Lei Shi][6], [Andreas Bulling][7]** < br > < br >
**WACV'24, Hawaii, USA** < img src = "misc/usa.png" width = "3%" align = "center" > < br >
**[[Paper][8]]**
-------------------
< img src = "misc/teaser_1.png" width = "100%" align = "middle" > < br > < br >
< / div >
2023-10-25 15:46:41 +02:00
# Citation
If you find our code useful or use it in your own projects, please cite our paper:
```bibtex
@inproceedings {abdessaied_vdgr,
author = {Abdessaied, Adnen and Lei, Shi and Bulling, Andreas},
title = {{VD-GR: Boosting Visual Dialog with Cascaded Spatial-Temporal Multi-Modal GRaphs}},
booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year = {2024},
}
```
2023-10-25 15:38:09 +02:00
# Table of Contents
* [Setup and Dependencies ](#Setup-and-Dependencies )
* [Download Data ](#Download-Data )
* [Pre-trained Checkpoints ](#Pre-trained-Checkpoints )
* [Training ](#Training )
* [Results ](#Results )
# Setup and Dependencies
We implemented our model using Python 3.7 and PyTorch 1.11.0 (CUDA 11.3, CuDNN 8.2.0). We recommend to setup a virtual environment using Anaconda. < br >
1. Install [git lfs][1] on your system
2. Clone our repository to download the data, checkpoints, and code
```shell
git lfs install
2023-10-25 15:46:41 +02:00
git clone https://git.hcics.simtech.uni-stuttgart.de/public-projects/VDGR.git
2023-10-25 15:38:09 +02:00
```
3. Create a conda environment and install dependencies
```shell
conda create -n vdgr python=3.7
conda activate vdgr
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
conda install pyg -c pyg # 2.1.0
pip install pytorch-transformers
pip install pytorch_pretrained_bert
pip install pyhocon glog wandb lmdb
```
4. If you wish to speed-up training, we recommend installing [apex][2]
```shell
git clone https://github.com/NVIDIA/apex
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./
cd ..
```
# Download Data
1. Download the extacted visual features of [VisDial][3] and setup all files we used in our work. We provide a shell script for convenience:
```shell
./setup_data.sh # Please make sure you have enough disk space
```
If everything was correctly setup, the ```data/``` folder should look like this
```
├── history_adj_matrices
│ ├── test
│ ├── *.pkl
│ ├── train
│ ├── *.pkl
│ ├── val
│ ├── *.pkl
├── question_adj_matrices
│ ├── test
│ ├── *.pkl
│ ├── train
│ ├── *.pkl
│ ├── val
│ ├── *.pkl
├── img_adj_matrices
│ ├── *.pkl
├── parse_vocab.pkl
├── test_dense_mapping.json
├── tr_dense_mapping.json
├── val_dense_mapping.json
├── visdial_0.9_test.json
├── visdial_0.9_train.json
├── visdial_0.9_val.json
├── visdial_1.0_test.json
├── visdial_1.0_train_dense_annotations.json
├── visdial_1.0_train_dense.json
├── visdial_1.0_train.json
├── visdial_1.0_val_dense_annotations.json
├── visdial_1.0_val.json
├── visdialconv_dense_annotations.json
├── visdialconv.json
├── vispro_dense_annotations.json
└── vispro.json
```
# Pre-trained Checkpoints
For convenience, we provide checkpoints of our model after the warm-up training stage in ```ckpt/``` for both VisDial v1.0 and VisDial v0.9. < br >
These checkpoints will be downloaded with the code if you use ```git lfs```.
# Training
We trained our model on 8 Nvidia Tesla V100-32GB GPUs. The default hyperparameters in ```config/vdgr.conf``` and ```config/bert_base_6layer_6conect.json``` need to be adjusted if your setup differs from ours.
## Phase 1
### Training
1. In this phase, we train our model on VisDial v1.0 via
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py \
--model vdgr/P1 \
--mode train \
--tag K2_v1.0 \
--wandb_mode online \
--wandb_project your_wandb_project_name
```
⚠️ On a similar setup to ours, this will take roughly 20h to complete using apex for training.
2. To train on VisDial v0.9:
* Set ```visdial_version = 0.9``` in ```config/vdgr.conf```
* Set ```start_path = ckpt/vdgr_visdial_v0.9_after_warmup_K2.ckpt``` in ```config/vdgr.conf```
* Run
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py \
--model vdgr/P1 \
--mode train \
--tag K2_v0.9 \
--wandb_mode online \
--wandb_project your_wandb_project_name
```
### Inference
1. For inference on VisDial v1.0 val, VisDialConv, or VisPro:
* Set ```eval_dataset = {visdial, visdial_conv, visdial_vispro}``` in ```logs/vdgr/P1_K2_v1.0/code/config/vdgr.conf```
* Run
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py \
--model vdgr/P1 \
--mode eval \
--eval_dir logs/vdgr/P1_K2_v1.0 \
--wandb_mode offline \
```
2. For inference on VisDial v0.9:
* Set ```eval_dataset = visdial``` in ```logs/vdgr/P1_K2_v0.9/code/config/vdgr.conf```
* Run
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py \
--model vdgr/P1 \
--mode eval \
--eval_dir logs/vdgr/P1_K2_v0.9 \
--wandb_mode offline \
```
⚠️ This might take some time to finish as the testing data of VisDial v0.9 is large.
3. For inference on the ```visdial_v1.0 test```:
* Run
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py \
--model vdgr/P1 \
--mode predict \
--eval_dir logs/vdgr/P1_K2_v1.0 \
--wandb_mode offline \
```
* The output file will be saved in ```output/```
## Phase 2
In this phase, we finetune on dense annotations to improve the NDCG score (Only supported for VisDial v1.0.)
1. Run
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py \
--model vdgr/P2_CE \
--mode train \
--tag K2_v1.0_CE \
--wandb_mode online \
--wandb_project your_wandb_project_name
```
⚠️ This will take roughly 3-4 hours to complete using the same setup as before and [DP][4] for training.
2. For inference on VisDial v1.0:
* Run:
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py \
--model vdgr/P2_CE \
--mode predict \
--eval_dir logs/vdgr/P1_K2_v1.0_CE \
--wandb_mode offline \
```
* The output file will be saved in ```output/```
## Phase 3
### Training
In the final phase, we train an ensemble method comprising of 8 models using ```K={1,2,3,4}``` and ```dense_loss={ce, listnet}```.
For ```K=k```:
1. Set the value of ```num_v_gnn_layers, num_q_gnn_layers, num_h_gnn_layers``` to ```k```
2. Set ```start_path = ckpt/vdgr_visdial_v1.0_after_warmup_K[k].ckpt``` in ```config/vdgr.conf``` (P1)
3. Phase 1 training:
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py \
--model vdgr/P1 \
--mode train \
--tag K[k]_v1.0 \
--wandb_mode online \
--wandb_project your_wandb_project_name
```
3. Set ```start_path = logs/vdgr/P1_K[k]_v1.0/epoch_best.ckpt``` in ```config/vdgr.conf``` (P2)
4. Phase 2 training:
* Fine-tune with CE:
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py \
--model vdgr/P2_CE \
--mode train \
--tag K[k]_v1.0_CE \
--wandb_mode online \
--wandb_project your_wandb_project_name
```
* Fine-tune with LISTNET:
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py \
--model vdgr/P2_LISTNET \
--mode train \
--tag K[k]_v1.0_LISTNET \
--wandb_mode online \
--wandb_project your_wandb_project_name
```
### Inference
1. For inference on VisDial v1.0 test:
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py \
--model vdgr/P2_[CE,LISTNET] \
--mode predict \
--eval_dir logs/vdgr/P2_K[1,2,3,4]_v1.0_[CE,LISTNET] \
--wandb_mode offline \
```
2. Finally, merge the outputs of all models
```shell
python ensemble.py \
--exp test \
--mode predict \
```
The output file will be saved in ```output/```
# Results
## VisDial v0.9
| Model | MRR | R@1 | R@5 | R@10 | Mean |
|:--------:|:---:|:---:|:---:|:----:|:----:|
| Prev. SOTA | 71.99 | 59.41 | 87.92 | 94.59 | 2.87 |
| VD-GR | **74.50** | **62.10** | **90.49** | **96.37** | **2.45** |
## VisDialConv
| Model | NDCG | MRR | R@1 | R@5 | R@10 | Mean |
|:--------:|:----:|:---:|:---:|:---:|:----:|:----:|
| Prev. SOTA | 61.72 | 61.79 | 48.95 | 77.50 | 86.71 | 4.72 |
| VD-GR | **67.09** | **66.82** | **54.47** | **81.71** | **91.44** | **3.54** |
## VisPro
| Model | NDCG | MRR | R@1 | R@5 | R@10 | Mean |
|:--------:|:----:|:---:|:---:|:---:|:----:|:----:|
| Prev. SOTA | 59.30 | 62.29 | 48.35 | 80.10 | 88.87 | 4.37 |
| VD-GR | **60.35** | **69.89** | **57.21** | **85.97** | **92.68** | **3.15** |
## VisDial V1.0 Val
| Model | NDCG | MRR | R@1 | R@5 | R@10 | Mean |
|:--------:|:----:|:---:|:---:|:---:|:----:|:----:|
| Prev. SOTA | 65.47 | 69.71 | 56.79 | 85.82 | 93.64 | 3.15 |
| VD-GR | 64.32 | **69.91** | **57.01** | **86.14** | **93.74** | **3.13** |
## VisDial V1.0 Test
| Model | NDCG | MRR | R@1 | R@5 | R@10 | Mean |
|:--------:|:----:|:---:|:---:|:---:|:----:|:----:|
| Prev. SOTA | 64.91 | 68.73 | 55.73 | 85.38 | 93.53 | 3.21 |
| VD-GR | 63.49 | 68.65 | 55.33 | **85.58** | **93.85** | **3.20** |
| ♣️ Prev. SOTA | 75.92 | 56.18 | 45.32 | 68.05 | 80.98 | 5.42 |
| ♣️ VD-GR | **75.95** | **58.30** | **46.55** | **71.45** | 84.52 | **5.32** |
| ♣️♦️ Prev. SOTA | 76.17 | 56.42 | 44.75 | 70.23 | 84.52 | 5.47 |
| ♣️♦️ VD-GR | **76.43** | 56.35 | **45.18** | 68.13 | 82.18 | 5.79 |
♣️ = Finetuning on dense annotations, ♦️ = Ensemble model
[1]: https://git-lfs.com/
[2]: https://github.com/NVIDIA/apex
[3]: https://visualdialog.org/
[4]: https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
2023-10-25 15:46:41 +02:00
[5]: https://www.perceptualui.org/people/abdessaied/
2023-10-25 15:38:09 +02:00
[6]: https://www.perceptualui.org/people/shi/
[7]: https://www.perceptualui.org/people/bulling/
2023-10-25 15:47:12 +02:00
[8]: TBD