vlcn/README.md

125 lines
4.3 KiB
Markdown
Raw Normal View History

2022-03-30 10:46:35 +02:00
This is the official code of the paper **Video Language Co-Attention with Fast-Learning Feature Fusion for VideoQA**.
If you find our code useful, please cite our paper:
# Overview
<p align="center"><img src="assets/overview_project_one.png" alt="drawing" width="600" height="400"/></p>
# Results
Our VLCN model achieves **new** state-of-the-art results on two open-ended VideoQA datasets **MSVD-QA** and **MSRVTT-QA**.
#### MSVD-QA
| <center>Model</center> | <center>What</center> | <center>Who</center> | <center>How</center> | <center>When</center> | <center>Where</center> | <center>All</center> |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| ST-VQA | 18.10 | 50.00 | **83.80** | 72.40 | 28.60 | 31.30 |
| Co-Mem | 19.60 | 48.70 | 81.60 | 74.10 | 31.70 | 31.70 |
| HMEMA | 22.40 | 50.00 | 73.00 | 70.70 | 42.90 | 33.70 |
| SSML | - | - | - | - | - | 35.13 |
| QueST | 24.50 | **52.90** | 79.10 | 72.40 | **50.00** | 36.10 |
| HCRN | - | - | - | - | - | 36.10 |
| MA-DRNN | 24.30 | 51.60 | 82.00 | **86.30** | 26.30 | 36.20 |
| **VLCN (Ours)** | **28.42** | 51.29 | 81.08 | 74.13 | 46.43 | **38.06** |
#### MSRVTT-QA
| <center>Model</center> | <center>What</center> | <center>Who</center> | <center>How</center> | <center>When</center> | <center>Where</center> | <center>All</center> |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| ST-VQA | 24.50 | 41.20 | 78.00 | 76.50 | 34.90 | 30.90 |
| Co-Mem | 23.90 | 42.50 | 74.10 | 69.00 | **42.90** | 32.00 |
| HMEMA | 22.40 | **50.10** | 73.00 | 70.70 | 42.90 | 33.70 |
| QueST | 27.90 | 45.60 | **83.00** | 75.70 | 31.60 | 34.60 |
| SSML | - | - | - | - | - | 35.00 |
| HCRN | - | - | - | - | - | 35.60 |
| **VLCN (Ours)** | **30.69** | 44.09 | 79.82 | **78.29** | 36.80 | **36.01** |
# Requirements
- PyTorch 1.3.1<br/>
- Torchvision 0.4.2<br/>
- Python 3.6
# Raw data
The raw data of MSVD-QA and MSRVTT-QA are located in
``
data/MSVD-QA
``
and
``
data/MSRVTT-QA
``
, respectively.<br/>
**Videos:** The raw videos of MSVD-QA and MSRVTT-QA can be downloaded from [](https://www.cs.utexas.edu/users/ml/clamp/videoDescription/) and [](https://www.mediafire.com/folder/h14iarbs62e7p/shared), respectively.<br/>
**Text:** The text data can be downloaded from [](https://github.com/xudejing/video-question-answering).<br/>
After downloading all the raw data, ``
data/MSVD-QA
``
and
``
data/MSRVTT-QA
``
should have the following structure:
<p align="center"><img src="assets/structure.png" alt="PHP Terminal style set text color" /></p>
# Preprocessing
To sample the individual frames and clips and generate the corresponding visual features, we run the script
``
preporocess.py
``
on the raw videos with the appropriate flags. E.g. for MSVD-QA we have to execute
```bash
python core/data/preporocess.py --RAW_VID_PATH /data/MSRVD-QA/videos --C3D_PATH path_to_pretrained_c3d
```
This will save the individual frames and clips in
``
data/MSVD-QA/frames
``
and
``
data/MSVD-QA/clips
``
, respectively, and their visual features in
``
data/MSVD-QA/frame_feat
``
and
``
data/MSVD-QA/clip_feat
``, respectively.
# Config files
Before starting training, one has to update the config path file
``
cfgs/path_cfgs.py
``
with the paths of the raw data as well as the visual feaures.<br/>
All Hyperparameters can be adjusted in
``
cfgs/base_cfgs.py
``.
# Training
To start training, one has to specify an experiment directory
``
EXP_NAME
``
where all the results (log files, checkpoints, tensorboard files etc) will be saved. Futhermore, one needs to specify the
``
MODEL_TYPE
``
of the VLCN to be trained.
| <center>MODEL_TYPE</center> | <center>Description</center> |
| :---: | :---: |
| 1 | VLCN |
| 2 | VLCN-FLF |
| 3 | VLCV+LSTM |
| 4 | MCAN |
These parameters can be set inline. E.g. by executing
```bash
python run.py --EXP_NAME experiment --MODEL_TYPE 1 --DATA_PATH /data/MSRVD-QA --GPU 1 --SEED 42
```
# Pre-trained models
Our pre-trained models are available here [](https://drive.google.com/drive/folders/172yj4iUkF1U1WOPdA5KuKOTQXkgzFEzS)
# Acknowledgements
We thank the Vision and Language Group@ MIL for their [MCAN](https://github.com/MILVLG/mcan-vqa) open source implementation, [DavidA](https://github.com/DavideA/c3d-pytorch/blob/master/C3D_model.py) for his pretrained C3D model and finally [ixaxaar](https://github.com/ixaxaar/pytorch-dnc) for his DNC implementation.