Initial commit

This commit is contained in:
Adnen Abdessaied 2022-03-30 10:46:35 +02:00
commit b5f3b728c3
53 changed files with 7008 additions and 0 deletions

124
README.md Normal file
View file

@ -0,0 +1,124 @@
This is the official code of the paper **Video Language Co-Attention with Fast-Learning Feature Fusion for VideoQA**.
If you find our code useful, please cite our paper:
# Overview
<p align="center"><img src="assets/overview_project_one.png" alt="drawing" width="600" height="400"/></p>
# Results
Our VLCN model achieves **new** state-of-the-art results on two open-ended VideoQA datasets **MSVD-QA** and **MSRVTT-QA**.
#### MSVD-QA
| <center>Model</center> | <center>What</center> | <center>Who</center> | <center>How</center> | <center>When</center> | <center>Where</center> | <center>All</center> |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| ST-VQA | 18.10 | 50.00 | **83.80** | 72.40 | 28.60 | 31.30 |
| Co-Mem | 19.60 | 48.70 | 81.60 | 74.10 | 31.70 | 31.70 |
| HMEMA | 22.40 | 50.00 | 73.00 | 70.70 | 42.90 | 33.70 |
| SSML | - | - | - | - | - | 35.13 |
| QueST | 24.50 | **52.90** | 79.10 | 72.40 | **50.00** | 36.10 |
| HCRN | - | - | - | - | - | 36.10 |
| MA-DRNN | 24.30 | 51.60 | 82.00 | **86.30** | 26.30 | 36.20 |
| **VLCN (Ours)** | **28.42** | 51.29 | 81.08 | 74.13 | 46.43 | **38.06** |
#### MSRVTT-QA
| <center>Model</center> | <center>What</center> | <center>Who</center> | <center>How</center> | <center>When</center> | <center>Where</center> | <center>All</center> |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| ST-VQA | 24.50 | 41.20 | 78.00 | 76.50 | 34.90 | 30.90 |
| Co-Mem | 23.90 | 42.50 | 74.10 | 69.00 | **42.90** | 32.00 |
| HMEMA | 22.40 | **50.10** | 73.00 | 70.70 | 42.90 | 33.70 |
| QueST | 27.90 | 45.60 | **83.00** | 75.70 | 31.60 | 34.60 |
| SSML | - | - | - | - | - | 35.00 |
| HCRN | - | - | - | - | - | 35.60 |
| **VLCN (Ours)** | **30.69** | 44.09 | 79.82 | **78.29** | 36.80 | **36.01** |
# Requirements
- PyTorch 1.3.1<br/>
- Torchvision 0.4.2<br/>
- Python 3.6
# Raw data
The raw data of MSVD-QA and MSRVTT-QA are located in
``
data/MSVD-QA
``
and
``
data/MSRVTT-QA
``
, respectively.<br/>
**Videos:** The raw videos of MSVD-QA and MSRVTT-QA can be downloaded from [](https://www.cs.utexas.edu/users/ml/clamp/videoDescription/) and [](https://www.mediafire.com/folder/h14iarbs62e7p/shared), respectively.<br/>
**Text:** The text data can be downloaded from [](https://github.com/xudejing/video-question-answering).<br/>
After downloading all the raw data, ``
data/MSVD-QA
``
and
``
data/MSRVTT-QA
``
should have the following structure:
<p align="center"><img src="assets/structure.png" alt="PHP Terminal style set text color" /></p>
# Preprocessing
To sample the individual frames and clips and generate the corresponding visual features, we run the script
``
preporocess.py
``
on the raw videos with the appropriate flags. E.g. for MSVD-QA we have to execute
```bash
python core/data/preporocess.py --RAW_VID_PATH /data/MSRVD-QA/videos --C3D_PATH path_to_pretrained_c3d
```
This will save the individual frames and clips in
``
data/MSVD-QA/frames
``
and
``
data/MSVD-QA/clips
``
, respectively, and their visual features in
``
data/MSVD-QA/frame_feat
``
and
``
data/MSVD-QA/clip_feat
``, respectively.
# Config files
Before starting training, one has to update the config path file
``
cfgs/path_cfgs.py
``
with the paths of the raw data as well as the visual feaures.<br/>
All Hyperparameters can be adjusted in
``
cfgs/base_cfgs.py
``.
# Training
To start training, one has to specify an experiment directory
``
EXP_NAME
``
where all the results (log files, checkpoints, tensorboard files etc) will be saved. Futhermore, one needs to specify the
``
MODEL_TYPE
``
of the VLCN to be trained.
| <center>MODEL_TYPE</center> | <center>Description</center> |
| :---: | :---: |
| 1 | VLCN |
| 2 | VLCN-FLF |
| 3 | VLCV+LSTM |
| 4 | MCAN |
These parameters can be set inline. E.g. by executing
```bash
python run.py --EXP_NAME experiment --MODEL_TYPE 1 --DATA_PATH /data/MSRVD-QA --GPU 1 --SEED 42
```
# Pre-trained models
Our pre-trained models are available here [](https://drive.google.com/drive/folders/172yj4iUkF1U1WOPdA5KuKOTQXkgzFEzS)
# Acknowledgements
We thank the Vision and Language Group@ MIL for their [MCAN](https://github.com/MILVLG/mcan-vqa) open source implementation, [DavidA](https://github.com/DavideA/c3d-pytorch/blob/master/C3D_model.py) for his pretrained C3D model and finally [ixaxaar](https://github.com/ixaxaar/pytorch-dnc) for his DNC implementation.

0
assets/.gitkeep Normal file
View file

Binary file not shown.

After

Width:  |  Height:  |  Size: 314 KiB

BIN
assets/structure.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

0
cfgs/.gitkeep Normal file
View file

267
cfgs/base_cfgs.py Normal file
View file

@ -0,0 +1,267 @@
# --------------------------------------------------------
# mcan-vqa (Deep Modular Co-Attention Networks)
# Licensed under The MIT License [see LICENSE for details]
# Written by Yuhao Cui https://github.com/cuiyuhao1996
# --------------------------------------------------------
from cfgs.path_cfgs import PATH
import os, torch, random
import numpy as np
from types import MethodType
class Cfgs(PATH):
def __init__(self, EXP_NAME, DATASET_PATH):
super(Cfgs, self).__init__(EXP_NAME, DATASET_PATH)
# Set Devices
# If use multi-gpu training, set e.g.'0, 1, 2' instead
self.GPU = '0'
# Set RNG For CPU And GPUs
self.SEED = random.randint(0, 99999999)
# -------------------------
# ---- Version Control ----
# -------------------------
# Define a specific name to start new training
# self.VERSION = 'Anonymous_' + str(self.SEED)
self.VERSION = str(self.SEED)
# Resume training
self.RESUME = False
# Used in Resume training and testing
self.CKPT_VERSION = self.VERSION
self.CKPT_EPOCH = 0
# Absolutely checkpoint path, 'CKPT_VERSION' and 'CKPT_EPOCH' will be overridden
self.CKPT_PATH = None
# Print loss every step
self.VERBOSE = True
# ------------------------------
# ---- Data Provider Params ----
# ------------------------------
# {'train', 'val', 'test'}
self.RUN_MODE = 'train'
# Set True to evaluate offline
self.EVAL_EVERY_EPOCH = True
# # Define the 'train' 'val' 'test' data split
# # (EVAL_EVERY_EPOCH triggered when set {'train': 'train'})
# self.SPLIT = {
# 'train': '',
# 'val': 'val',
# 'test': 'test',
# }
# # A external method to set train split
# self.TRAIN_SPLIT = 'train+val+vg'
# Set True to use pretrained word embedding
# (GloVe: spaCy https://spacy.io/)
self.USE_GLOVE = True
# Word embedding matrix size
# (token size x WORD_EMBED_SIZE)
self.WORD_EMBED_SIZE = 300
# Max length of question sentences
self.MAX_TOKEN = 15
# VGG 4096D features
self.FRAME_FEAT_SIZE = 4096
# C3D 4096D features
self.CLIP_FEAT_SIZE = 4096
self.NUM_ANS = 1000
# Default training batch size: 64
self.BATCH_SIZE = 64
# Multi-thread I/O
self.NUM_WORKERS = 8
# Use pin memory
# (Warning: pin memory can accelerate GPU loading but may
# increase the CPU memory usage when NUM_WORKS is large)
self.PIN_MEM = True
# Large model can not training with batch size 64
# Gradient accumulate can split batch to reduce gpu memory usage
# (Warning: BATCH_SIZE should be divided by GRAD_ACCU_STEPS)
self.GRAD_ACCU_STEPS = 1
# Set 'external': use external shuffle method to implement training shuffle
# Set 'internal': use pytorch dataloader default shuffle method
self.SHUFFLE_MODE = 'external'
# ------------------------
# ---- Network Params ----
# ------------------------
# Model deeps
# (Encoder and Decoder will be same deeps)
self.LAYER = 6
# Model hidden size
# (512 as default, bigger will be a sharp increase of gpu memory usage)
self.HIDDEN_SIZE = 512
# Multi-head number in MCA layers
# (Warning: HIDDEN_SIZE should be divided by MULTI_HEAD)
self.MULTI_HEAD = 8
# Dropout rate for all dropout layers
# (dropout can prevent overfitting [Dropout: a simple way to prevent neural networks from overfitting])
self.DROPOUT_R = 0.1
# MLP size in flatten layers
self.FLAT_MLP_SIZE = 512
# Flatten the last hidden to vector with {n} attention glimpses
self.FLAT_GLIMPSES = 1
self.FLAT_OUT_SIZE = 1024
# --------------------------
# ---- Optimizer Params ----
# --------------------------
# The base learning rate
self.LR_BASE = 0.0001
# Learning rate decay ratio
self.LR_DECAY_R = 0.2
# Learning rate decay at {x, y, z...} epoch
self.LR_DECAY_LIST = [10, 12]
# Max training epoch
self.MAX_EPOCH = 30
# Gradient clip
# (default: -1 means not using)
self.GRAD_NORM_CLIP = -1
# Adam optimizer betas and eps
self.OPT_BETAS = (0.9, 0.98)
self.OPT_EPS = 1e-9
self.OPT_WEIGHT_DECAY = 1e-5
# --------------------------
# ---- DNC Hyper-Params ----
# --------------------------
self.IN_SIZE_DNC = self.HIDDEN_SIZE
self.OUT_SIZE_DNC = self.HIDDEN_SIZE
self.WORD_LENGTH_DNC = 512
self.CELL_COUNT_DNC = 64
self.MEM_HIDDEN_SIZE = self.CELL_COUNT_DNC * self.WORD_LENGTH_DNC
self.N_READ_HEADS_DNC = 4
def parse_to_dict(self, args):
args_dict = {}
for arg in dir(args):
if not arg.startswith('_') and not isinstance(getattr(args, arg), MethodType):
if getattr(args, arg) is not None:
args_dict[arg] = getattr(args, arg)
return args_dict
def add_args(self, args_dict):
for arg in args_dict:
setattr(self, arg, args_dict[arg])
def proc(self):
assert self.RUN_MODE in ['train', 'val', 'test']
# ------------ Devices setup
# os.environ['CUDA_VISIBLE_DEVICES'] = self.GPU
self.N_GPU = len(self.GPU.split(','))
self.DEVICES = [_ for _ in range(self.N_GPU)]
torch.set_num_threads(2)
# ------------ Seed setup
# fix pytorch seed
torch.manual_seed(self.SEED)
if self.N_GPU < 2:
torch.cuda.manual_seed(self.SEED)
else:
torch.cuda.manual_seed_all(self.SEED)
torch.backends.cudnn.deterministic = True
# fix numpy seed
np.random.seed(self.SEED)
# fix random seed
random.seed(self.SEED)
if self.CKPT_PATH is not None:
print('Warning: you are now using CKPT_PATH args, '
'CKPT_VERSION and CKPT_EPOCH will not work')
self.CKPT_VERSION = self.CKPT_PATH.split('/')[-1] + '_' + str(random.randint(0, 99999999))
# ------------ Split setup
self.SPLIT['train'] = self.TRAIN_SPLIT
if 'val' in self.SPLIT['train'].split('+') or self.RUN_MODE not in ['train']:
self.EVAL_EVERY_EPOCH = False
if self.RUN_MODE not in ['test']:
self.TEST_SAVE_PRED = False
# ------------ Gradient accumulate setup
assert self.BATCH_SIZE % self.GRAD_ACCU_STEPS == 0
self.SUB_BATCH_SIZE = int(self.BATCH_SIZE / self.GRAD_ACCU_STEPS)
# Use a small eval batch will reduce gpu memory usage
self.EVAL_BATCH_SIZE = 32
# ------------ Networks setup
# FeedForwardNet size in every MCA layer
self.FF_SIZE = int(self.HIDDEN_SIZE * 4)
self.FF_MEM_SIZE = int()
# A pipe line hidden size in attention compute
assert self.HIDDEN_SIZE % self.MULTI_HEAD == 0
self.HIDDEN_SIZE_HEAD = int(self.HIDDEN_SIZE / self.MULTI_HEAD)
def __str__(self):
for attr in dir(self):
if not attr.startswith('__') and not isinstance(getattr(self, attr), MethodType):
print('{ %-17s }->' % attr, getattr(self, attr))
return ''
def check_path(self):
print('Checking dataset ...')
if not os.path.exists(self.FRAMES):
print(self.FRAMES + 'NOT EXIST')
exit(-1)
if not os.path.exists(self.CLIPS):
print(self.CLIPS + 'NOT EXIST')
exit(-1)
for mode in self.QA_PATH:
if not os.path.exists(self.QA_PATH[mode]):
print(self.QA_PATH[mode] + 'NOT EXIST')
exit(-1)
print('Finished')
print('')

6
cfgs/fusion_cfgs.yml Normal file
View file

@ -0,0 +1,6 @@
CONTROLLER_INPUT_SIZE: 512
CONTROLLER_HIDDEN_SIZE: 512
CONTROLLER_NUM_LAYERS: 2
HIDDEN_DIM_COMP: 1024
OUT_DIM_COMP: 512
COMP_NUM_LAYERS: 2

61
cfgs/path_cfgs.py Normal file
View file

@ -0,0 +1,61 @@
# --------------------------------------------------------
# mcan-vqa (Deep Modular Co-Attention Networks)
# Licensed under The MIT License [see LICENSE for details]
# Written by Yuhao Cui https://github.com/cuiyuhao1996
# --------------------------------------------------------
import os
class PATH:
def __init__(self, EXP_NAME, DATASET_PATH):
# name of the experiment
self.EXP_NAME = EXP_NAME
# Dataset root path
self.DATASET_PATH = DATASET_PATH
# Bottom up features root path
self.FRAMES = os.path.join(DATASET_PATH, 'frame_feat/')
self.CLIPS = os.path.join(DATASET_PATH, 'clip_feat/')
def init_path(self):
self.QA_PATH = {
'train': self.DATASET_PATH + 'train_qa.json',
'val': self.DATASET_PATH + 'val_qa.json',
'test': self.DATASET_PATH + 'test_qa.json',
}
self.C3D_PATH = self.DATASET_PATH + 'c3d.pickle'
if self.EXP_NAME not in os.listdir('./'):
os.mkdir('./' + self.EXP_NAME)
os.mkdir('./' + self.EXP_NAME + '/results')
self.RESULT_PATH = './' + self.EXP_NAME + '/results/result_test/'
self.PRED_PATH = './' + self.EXP_NAME + '/results/pred/'
self.CACHE_PATH = './' + self.EXP_NAME + '/results/cache/'
self.LOG_PATH = './' + self.EXP_NAME + '/results/log/'
self.TB_PATH = './' + self.EXP_NAME + '/results/tensorboard/'
self.CKPTS_PATH = './' + self.EXP_NAME + '/ckpts/'
if 'result_test' not in os.listdir('./' + self.EXP_NAME + '/results'):
os.mkdir('./' + self.EXP_NAME + '/results/result_test/')
if 'pred' not in os.listdir('./' + self.EXP_NAME + '/results'):
os.mkdir('./' + self.EXP_NAME + '/results/pred/')
if 'cache' not in os.listdir('./' + self.EXP_NAME + '/results'):
os.mkdir('./' + self.EXP_NAME + '/results/cache')
if 'log' not in os.listdir('./' + self.EXP_NAME + '/results'):
os.mkdir('./' + self.EXP_NAME + '/results/log')
if 'tensorboard' not in os.listdir('./' + self.EXP_NAME + '/results'):
os.mkdir('./' + self.EXP_NAME + '/results/tensorboard')
if 'ckpts' not in os.listdir('./' + self.EXP_NAME):
os.mkdir('./' + self.EXP_NAME + '/ckpts')
def check_path(self):
raise NotImplementedError

13
cfgs/small_model.yml Normal file
View file

@ -0,0 +1,13 @@
LAYER: 6
HIDDEN_SIZE: 512
MEM_HIDDEN_SIZE: 2048
MULTI_HEAD: 8
DROPOUT_R: 0.1
FLAT_MLP_SIZE: 512
FLAT_GLIMPSES: 1
FLAT_OUT_SIZE: 1024
LR_BASE: 0.0001
LR_DECAY_R: 0.2
GRAD_ACCU_STEPS: 1
CKPT_VERSION: 'small'
CKPT_EPOCH: 13

0
code/.gitkeep Normal file
View file

0
code/assets/.gitkeep Normal file
View file

BIN
code/assets/structure.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

0
code/cfgs/.gitkeep Normal file
View file

267
code/cfgs/base_cfgs.py Normal file
View file

@ -0,0 +1,267 @@
# --------------------------------------------------------
# mcan-vqa (Deep Modular Co-Attention Networks)
# Licensed under The MIT License [see LICENSE for details]
# Written by Yuhao Cui https://github.com/cuiyuhao1996
# --------------------------------------------------------
from cfgs.path_cfgs import PATH
import os, torch, random
import numpy as np
from types import MethodType
class Cfgs(PATH):
def __init__(self, EXP_NAME, DATASET_PATH):
super(Cfgs, self).__init__(EXP_NAME, DATASET_PATH)
# Set Devices
# If use multi-gpu training, set e.g.'0, 1, 2' instead
self.GPU = '0'
# Set RNG For CPU And GPUs
self.SEED = random.randint(0, 99999999)
# -------------------------
# ---- Version Control ----
# -------------------------
# Define a specific name to start new training
# self.VERSION = 'Anonymous_' + str(self.SEED)
self.VERSION = str(self.SEED)
# Resume training
self.RESUME = False
# Used in Resume training and testing
self.CKPT_VERSION = self.VERSION
self.CKPT_EPOCH = 0
# Absolutely checkpoint path, 'CKPT_VERSION' and 'CKPT_EPOCH' will be overridden
self.CKPT_PATH = None
# Print loss every step
self.VERBOSE = True
# ------------------------------
# ---- Data Provider Params ----
# ------------------------------
# {'train', 'val', 'test'}
self.RUN_MODE = 'train'
# Set True to evaluate offline
self.EVAL_EVERY_EPOCH = True
# # Define the 'train' 'val' 'test' data split
# # (EVAL_EVERY_EPOCH triggered when set {'train': 'train'})
# self.SPLIT = {
# 'train': '',
# 'val': 'val',
# 'test': 'test',
# }
# # A external method to set train split
# self.TRAIN_SPLIT = 'train+val+vg'
# Set True to use pretrained word embedding
# (GloVe: spaCy https://spacy.io/)
self.USE_GLOVE = True
# Word embedding matrix size
# (token size x WORD_EMBED_SIZE)
self.WORD_EMBED_SIZE = 300
# Max length of question sentences
self.MAX_TOKEN = 15
# VGG 4096D features
self.FRAME_FEAT_SIZE = 4096
# C3D 4096D features
self.CLIP_FEAT_SIZE = 4096
self.NUM_ANS = 1000
# Default training batch size: 64
self.BATCH_SIZE = 64
# Multi-thread I/O
self.NUM_WORKERS = 8
# Use pin memory
# (Warning: pin memory can accelerate GPU loading but may
# increase the CPU memory usage when NUM_WORKS is large)
self.PIN_MEM = True
# Large model can not training with batch size 64
# Gradient accumulate can split batch to reduce gpu memory usage
# (Warning: BATCH_SIZE should be divided by GRAD_ACCU_STEPS)
self.GRAD_ACCU_STEPS = 1
# Set 'external': use external shuffle method to implement training shuffle
# Set 'internal': use pytorch dataloader default shuffle method
self.SHUFFLE_MODE = 'external'
# ------------------------
# ---- Network Params ----
# ------------------------
# Model deeps
# (Encoder and Decoder will be same deeps)
self.LAYER = 6
# Model hidden size
# (512 as default, bigger will be a sharp increase of gpu memory usage)
self.HIDDEN_SIZE = 512
# Multi-head number in MCA layers
# (Warning: HIDDEN_SIZE should be divided by MULTI_HEAD)
self.MULTI_HEAD = 8
# Dropout rate for all dropout layers
# (dropout can prevent overfitting [Dropout: a simple way to prevent neural networks from overfitting])
self.DROPOUT_R = 0.1
# MLP size in flatten layers
self.FLAT_MLP_SIZE = 512
# Flatten the last hidden to vector with {n} attention glimpses
self.FLAT_GLIMPSES = 1
self.FLAT_OUT_SIZE = 1024
# --------------------------
# ---- Optimizer Params ----
# --------------------------
# The base learning rate
self.LR_BASE = 0.0001
# Learning rate decay ratio
self.LR_DECAY_R = 0.2
# Learning rate decay at {x, y, z...} epoch
self.LR_DECAY_LIST = [10, 12]
# Max training epoch
self.MAX_EPOCH = 30
# Gradient clip
# (default: -1 means not using)
self.GRAD_NORM_CLIP = -1
# Adam optimizer betas and eps
self.OPT_BETAS = (0.9, 0.98)
self.OPT_EPS = 1e-9
self.OPT_WEIGHT_DECAY = 1e-5
# --------------------------
# ---- DNC Hyper-Params ----
# --------------------------
self.IN_SIZE_DNC = self.HIDDEN_SIZE
self.OUT_SIZE_DNC = self.HIDDEN_SIZE
self.WORD_LENGTH_DNC = 512
self.CELL_COUNT_DNC = 64
self.MEM_HIDDEN_SIZE = self.CELL_COUNT_DNC * self.WORD_LENGTH_DNC
self.N_READ_HEADS_DNC = 4
def parse_to_dict(self, args):
args_dict = {}
for arg in dir(args):
if not arg.startswith('_') and not isinstance(getattr(args, arg), MethodType):
if getattr(args, arg) is not None:
args_dict[arg] = getattr(args, arg)
return args_dict
def add_args(self, args_dict):
for arg in args_dict:
setattr(self, arg, args_dict[arg])
def proc(self):
assert self.RUN_MODE in ['train', 'val', 'test']
# ------------ Devices setup
# os.environ['CUDA_VISIBLE_DEVICES'] = self.GPU
self.N_GPU = len(self.GPU.split(','))
self.DEVICES = [_ for _ in range(self.N_GPU)]
torch.set_num_threads(2)
# ------------ Seed setup
# fix pytorch seed
torch.manual_seed(self.SEED)
if self.N_GPU < 2:
torch.cuda.manual_seed(self.SEED)
else:
torch.cuda.manual_seed_all(self.SEED)
torch.backends.cudnn.deterministic = True
# fix numpy seed
np.random.seed(self.SEED)
# fix random seed
random.seed(self.SEED)
if self.CKPT_PATH is not None:
print('Warning: you are now using CKPT_PATH args, '
'CKPT_VERSION and CKPT_EPOCH will not work')
self.CKPT_VERSION = self.CKPT_PATH.split('/')[-1] + '_' + str(random.randint(0, 99999999))
# ------------ Split setup
self.SPLIT['train'] = self.TRAIN_SPLIT
if 'val' in self.SPLIT['train'].split('+') or self.RUN_MODE not in ['train']:
self.EVAL_EVERY_EPOCH = False
if self.RUN_MODE not in ['test']:
self.TEST_SAVE_PRED = False
# ------------ Gradient accumulate setup
assert self.BATCH_SIZE % self.GRAD_ACCU_STEPS == 0
self.SUB_BATCH_SIZE = int(self.BATCH_SIZE / self.GRAD_ACCU_STEPS)
# Use a small eval batch will reduce gpu memory usage
self.EVAL_BATCH_SIZE = 32
# ------------ Networks setup
# FeedForwardNet size in every MCA layer
self.FF_SIZE = int(self.HIDDEN_SIZE * 4)
self.FF_MEM_SIZE = int()
# A pipe line hidden size in attention compute
assert self.HIDDEN_SIZE % self.MULTI_HEAD == 0
self.HIDDEN_SIZE_HEAD = int(self.HIDDEN_SIZE / self.MULTI_HEAD)
def __str__(self):
for attr in dir(self):
if not attr.startswith('__') and not isinstance(getattr(self, attr), MethodType):
print('{ %-17s }->' % attr, getattr(self, attr))
return ''
def check_path(self):
print('Checking dataset ...')
if not os.path.exists(self.FRAMES):
print(self.FRAMES + 'NOT EXIST')
exit(-1)
if not os.path.exists(self.CLIPS):
print(self.CLIPS + 'NOT EXIST')
exit(-1)
for mode in self.QA_PATH:
if not os.path.exists(self.QA_PATH[mode]):
print(self.QA_PATH[mode] + 'NOT EXIST')
exit(-1)
print('Finished')
print('')

View file

@ -0,0 +1,6 @@
CONTROLLER_INPUT_SIZE: 512
CONTROLLER_HIDDEN_SIZE: 512
CONTROLLER_NUM_LAYERS: 2
HIDDEN_DIM_COMP: 1024
OUT_DIM_COMP: 512
COMP_NUM_LAYERS: 2

61
code/cfgs/path_cfgs.py Normal file
View file

@ -0,0 +1,61 @@
# --------------------------------------------------------
# mcan-vqa (Deep Modular Co-Attention Networks)
# Licensed under The MIT License [see LICENSE for details]
# Written by Yuhao Cui https://github.com/cuiyuhao1996
# --------------------------------------------------------
import os
class PATH:
def __init__(self, EXP_NAME, DATASET_PATH):
# name of the experiment
self.EXP_NAME = EXP_NAME
# Dataset root path
self.DATASET_PATH = DATASET_PATH
# Bottom up features root path
self.FRAMES = os.path.join(DATASET_PATH, 'frame_feat/')
self.CLIPS = os.path.join(DATASET_PATH, 'clip_feat/')
def init_path(self):
self.QA_PATH = {
'train': self.DATASET_PATH + 'train_qa.json',
'val': self.DATASET_PATH + 'val_qa.json',
'test': self.DATASET_PATH + 'test_qa.json',
}
self.C3D_PATH = self.DATASET_PATH + 'c3d.pickle'
if self.EXP_NAME not in os.listdir('./'):
os.mkdir('./' + self.EXP_NAME)
os.mkdir('./' + self.EXP_NAME + '/results')
self.RESULT_PATH = './' + self.EXP_NAME + '/results/result_test/'
self.PRED_PATH = './' + self.EXP_NAME + '/results/pred/'
self.CACHE_PATH = './' + self.EXP_NAME + '/results/cache/'
self.LOG_PATH = './' + self.EXP_NAME + '/results/log/'
self.TB_PATH = './' + self.EXP_NAME + '/results/tensorboard/'
self.CKPTS_PATH = './' + self.EXP_NAME + '/ckpts/'
if 'result_test' not in os.listdir('./' + self.EXP_NAME + '/results'):
os.mkdir('./' + self.EXP_NAME + '/results/result_test/')
if 'pred' not in os.listdir('./' + self.EXP_NAME + '/results'):
os.mkdir('./' + self.EXP_NAME + '/results/pred/')
if 'cache' not in os.listdir('./' + self.EXP_NAME + '/results'):
os.mkdir('./' + self.EXP_NAME + '/results/cache')
if 'log' not in os.listdir('./' + self.EXP_NAME + '/results'):
os.mkdir('./' + self.EXP_NAME + '/results/log')
if 'tensorboard' not in os.listdir('./' + self.EXP_NAME + '/results'):
os.mkdir('./' + self.EXP_NAME + '/results/tensorboard')
if 'ckpts' not in os.listdir('./' + self.EXP_NAME):
os.mkdir('./' + self.EXP_NAME + '/ckpts')
def check_path(self):
raise NotImplementedError

13
code/cfgs/small_model.yml Normal file
View file

@ -0,0 +1,13 @@
LAYER: 6
HIDDEN_SIZE: 512
MEM_HIDDEN_SIZE: 2048
MULTI_HEAD: 8
DROPOUT_R: 0.1
FLAT_MLP_SIZE: 512
FLAT_GLIMPSES: 1
FLAT_OUT_SIZE: 1024
LR_BASE: 0.0001
LR_DECAY_R: 0.2
GRAD_ACCU_STEPS: 1
CKPT_VERSION: 'small'
CKPT_EPOCH: 13

0
code/core/.gitkeep Normal file
View file

0
code/core/data/.gitkeep Normal file
View file

103
code/core/data/dataset.py Normal file
View file

@ -0,0 +1,103 @@
import glob, os, json, pickle
import numpy as np
from collections import defaultdict
import torch
from torch.utils.data import Dataset
import torchvision.transforms as transforms
from core.data.utils import tokenize, ans_stat, proc_ques, qlen_to_key, ans_to_key
class VideoQA_Dataset(Dataset):
def __init__(self, __C):
super(VideoQA_Dataset, self).__init__()
self.__C = __C
self.ans_size = __C.NUM_ANS
# load raw data
with open(__C.QA_PATH[__C.RUN_MODE], 'r') as f:
self.raw_data = json.load(f)
self.data_size = len(self.raw_data)
splits = __C.SPLIT[__C.RUN_MODE].split('+')
frames_list = glob.glob(__C.FRAMES + '*.pt')
clips_list = glob.glob(__C.CLIPS + '*.pt')
if 'msvd' in self.C.DATASET_PATH.lower():
vid_ids = [int(s.split('/')[-1].split('.')[0][3:]) for s in frames_list]
else:
vid_ids = [int(s.split('/')[-1].split('.')[0][5:]) for s in frames_list]
self.frames_dict = {k: v for (k,v) in zip(vid_ids, frames_list)}
self.clips_dict = {k: v for (k,v) in zip(vid_ids, clips_list)}
del frames_list, clips_list
q_list = []
a_list = []
a_dict = defaultdict(lambda: 0)
for split in ['train', 'val']:
with open(__C.QA_PATH[split], 'r') as f:
qa_data = json.load(f)
for d in qa_data:
q_list.append(d['question'])
a_list = d['answer']
if d['answer'] not in a_dict:
a_dict[d['answer']] = 1
else:
a_dict[d['answer']] += 1
top_answers = sorted(a_dict, key=a_dict.get, reverse=True)
self.qlen_bins_to_idx = {
'1-3': 0,
'4-8': 1,
'9-15': 2,
}
self.ans_rare_to_idx = {
'0-99': 0,
'100-299': 1,
'300-999': 2,
}
self.qtypes_to_idx = {
'what': 0,
'who': 1,
'how': 2,
'when': 3,
'where': 4,
}
if __C.RUN_MODE == 'train':
self.ans_list = top_answers[:self.ans_size]
self.ans_to_ix, self.ix_to_ans = ans_stat(self.ans_list)
self.token_to_ix, self.pretrained_emb = tokenize(q_list, __C.USE_GLOVE)
self.token_size = self.token_to_ix.__len__()
print('== Question token vocab size:', self.token_size)
self.idx_to_qtypes = {v: k for (k, v) in self.qtypes_to_idx.items()}
self.idx_to_qlen_bins = {v: k for (k, v) in self.qlen_bins_to_idx.items()}
self.idx_to_ans_rare = {v: k for (k, v) in self.ans_rare_to_idx.items()}
def __getitem__(self, idx):
sample = self.raw_data[idx]
ques = sample['question']
q_type = self.qtypes_to_idx[ques.split(' ')[0]]
ques_idx, qlen, _ = proc_ques(ques, self.token_to_ix, self.__C.MAX_TOKEN)
qlen_bin = self.qlen_bins_to_idx[qlen_to_key(qlen)]
answer = sample['answer']
answer = self.ans_to_ix.get(answer, np.random.randint(0, high=len(self.ans_list)))
ans_rarity = self.ans_rare_to_idx[ans_to_key(answer)]
answer_one_hot = torch.zeros(self.ans_size)
answer_one_hot[answer] = 1.0
vid_id = sample['video_id']
frames = torch.load(open(self.frames_dict[vid_id], 'rb')).cpu()
clips = torch.load(open(self.clips_dict[vid_id], 'rb')).cpu()
return torch.from_numpy(ques_idx).long(), frames, clips, answer_one_hot, torch.tensor(answer).long(), \
torch.tensor(q_type).long(), torch.tensor(qlen_bin).long(), torch.tensor(ans_rarity).long()
def __len__(self):
return self.data_size

View file

@ -0,0 +1,182 @@
import os
import sys
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import skvideo.io as skv
import torch
import pickle
from PIL import Image
import tqdm
import numpy as np
from model.C3D import C3D
import json
from torchvision.models import vgg19
import torchvision.transforms as transforms
import torch.nn as nn
import argparse
def _select_frames(path, frame_num):
"""Select representative frames for video.
Ignore some frames both at begin and end of video.
Args:
path: Path of video.
Returns:
frames: list of frames.
"""
frames = list()
video_data = skv.vread(path)
total_frames = video_data.shape[0]
# Ignore some frame at begin and end.
for i in np.linspace(0, total_frames, frame_num + 2)[1:frame_num + 1]:
frame_data = video_data[int(i)]
img = Image.fromarray(frame_data)
img = img.resize((224, 224), Image.BILINEAR)
frame_data = np.array(img)
frames.append(frame_data)
return frames
def _select_clips(path, clip_num):
"""Select self.batch_size clips for video. Each clip has 16 frames.
Args:
path: Path of video.
Returns:
clips: list of clips.
"""
clips = list()
# video_info = skvideo.io.ffprobe(path)
video_data = skv.vread(path)
total_frames = video_data.shape[0]
height = video_data[1]
width = video_data.shape[2]
for i in np.linspace(0, total_frames, clip_num + 2)[1:clip_num + 1]:
# Select center frame first, then include surrounding frames
clip_start = int(i) - 8
clip_end = int(i) + 8
if clip_start < 0:
clip_end = clip_end - clip_start
clip_start = 0
if clip_end > total_frames:
clip_start = clip_start - (clip_end - total_frames)
clip_end = total_frames
clip = video_data[clip_start:clip_end]
new_clip = []
for j in range(16):
frame_data = clip[j]
img = Image.fromarray(frame_data)
img = img.resize((112, 112), Image.BILINEAR)
frame_data = np.array(img) * 1.0
# frame_data -= self.mean[j]
new_clip.append(frame_data)
clips.append(new_clip)
return clips
def preprocess_videos(video_dir, frame_num, clip_num):
frames_dir = os.path.join(os.path.dirname(video_dir), 'frames')
os.mkdir(frames_dir)
clips_dir = os.path.join(os.path.dirname(video_dir), 'clips')
os.mkdir(clips_dir)
for video_name in tqdm.tqdm(os.listdir(video_dir)):
video_path = os.path.join(video_dir, video_name)
frames = _select_frames(video_path, frame_num)
clips = _select_clips(video_path, clip_num)
with open(os.path.join(frames_dir, video_name.split('.')[0] + '.pkl'), "wb") as f:
pickle.dump(frames, f, protocol=pickle.HIGHEST_PROTOCOL)
with open(os.path.join(clips_dir, video_name.split('.')[0] + '.pkl'), "wb") as f:
pickle.dump(clips, f, protocol=pickle.HIGHEST_PROTOCOL)
def generate_video_features(path_frames, path_clips, c3d_path):
device = torch.device('cuda:0')
frame_feat_dir = os.path.join(os.path.dirname(path_frames), 'frame_feat')
os.makedirs(frame_feat_dir, exist_ok=True)
clip_feat_dir = os.path.join(os.path.dirname(path_frames), 'clip_feat')
os.makedirs(clip_feat_dir, exist_ok=True)
cnn = vgg19(pretrained=True)
in_features = cnn.classifier[-1].in_features
cnn.classifier = nn.Sequential(
*list(cnn.classifier.children())[:-1]) # remove last fc layer
cnn.to(device).eval()
c3d = C3D()
c3d.load_state_dict(torch.load(c3d_path))
c3d.to(device).eval()
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406),
(0.229, 0.224, 0.225))])
for vid_name in tqdm.tqdm(os.listdir(path_frames)):
frame_path = os.path.join(path_frames, vid_name)
clip_path = os.path.join(path_clips, vid_name)
frames = pickle.load(open(frame_path, 'rb'))
clips = pickle.load(open(clip_path, 'rb'))
frames = [transform(f) for f in frames]
frame_feat = []
clip_feat = []
for frame in frames:
with torch.no_grad():
feat = cnn(frame.unsqueeze(0).to(device))
frame_feat.append(feat)
for clip in clips:
# clip has shape (c x f x h x w)
clip = torch.from_numpy(np.float32(np.array(clip)))
clip = clip.transpose(3, 0)
clip = clip.transpose(3, 1)
clip = clip.transpose(3, 2).unsqueeze(0).to(device)
with torch.no_grad():
feat = c3d(clip)
clip_feat.append(feat)
frame_feat = torch.cat(frame_feat, dim=0)
clip_feat = torch.cat(clip_feat, dim=0)
torch.save(frame_feat, os.path.join(frame_feat_dir, vid_name.split('.')[0] + '.pt'))
torch.save(clip_feat, os.path.join(clip_feat_dir, vid_name.split('.')[0] + '.pt'))
def parse_args():
'''
Parse input arguments
'''
parser = argparse.ArgumentParser(description='Preprocessing Args')
parser.add_argument('--RAW_VID_PATH', dest='RAW_VID_PATH',
help='The path to the raw videos',
required=True,
type=str)
parser.add_argument('--FRAMES_OUTPUT_DIR', dest='FRAMES_OUTPUT_DIR',
help='The directory where the processed frames and their features will be stored',
required=True,
type=str)
parser.add_argument('--CLIPS_OUTPUT_DIR', dest='FRAMES_OUTPUT_DIR',
help='The directory where the processed frames and their features will be stored',
required=True,
type=str)
parser.add_argument('--C3D_PATH', dest='C3D_PATH',
help='Pretrained C3D path',
required=True,
type=str)
parser.add_argument('--NUM_SAMPLES', dest='NUM_SAMPLES',
help='The number of frames/clips to be sampled from the video',
default=20,
type=int)
args = parser.parse_args()
return args
if __name__ == '__main__':
args = parse_args()
preprocess_videos(args.RAW_VID_PATH, args.NUM_SAMPLES, args.NUM_SAMPLES)
frames_dir = os.path.join(os.path.dirname(args.RAW_VID_PATH), 'frames')
clips_dir = os.path.join(os.path.dirname(args.RAW_VID_PATH), 'clips')
generate_video_features(frames_dir, clips_dir)

81
code/core/data/utils.py Normal file
View file

@ -0,0 +1,81 @@
import en_vectors_web_lg, random, re, json
import numpy as np
def tokenize(ques_list, use_glove):
token_to_ix = {
'PAD': 0,
'UNK': 1,
}
spacy_tool = None
pretrained_emb = []
if use_glove:
spacy_tool = en_vectors_web_lg.load()
pretrained_emb.append(spacy_tool('PAD').vector)
pretrained_emb.append(spacy_tool('UNK').vector)
for ques in ques_list:
words = re.sub(
r"([.,'!?\"()*#:;])",
'',
ques.lower()
).replace('-', ' ').replace('/', ' ').split()
for word in words:
if word not in token_to_ix:
token_to_ix[word] = len(token_to_ix)
if use_glove:
pretrained_emb.append(spacy_tool(word).vector)
pretrained_emb = np.array(pretrained_emb)
return token_to_ix, pretrained_emb
def proc_ques(ques, token_to_ix, max_token):
ques_ix = np.zeros(max_token, np.int64)
words = re.sub(
r"([.,'!?\"()*#:;])",
'',
ques.lower()
).replace('-', ' ').replace('/', ' ').split()
q_len = 0
for ix, word in enumerate(words):
if word in token_to_ix:
ques_ix[ix] = token_to_ix[word]
q_len += 1
else:
ques_ix[ix] = token_to_ix['UNK']
if ix + 1 == max_token:
break
return ques_ix, q_len, len(words)
def ans_stat(ans_list):
ans_to_ix, ix_to_ans = {}, {}
for i, ans in enumerate(ans_list):
ans_to_ix[ans] = i
ix_to_ans[i] = ans
return ans_to_ix, ix_to_ans
def shuffle_list(ans_list):
random.shuffle(ans_list)
def qlen_to_key(q_len):
if 1<= q_len <=3:
return '1-3'
if 4<= q_len <=8:
return '4-8'
if 9<= q_len:
return '9-15'
def ans_to_key(ans_idx):
if 0 <= ans_idx <= 99 :
return '0-99'
if 100 <= ans_idx <= 299 :
return '100-299'
if 300 <= ans_idx <= 999 :
return '300-999'

523
code/core/exec.py Normal file
View file

@ -0,0 +1,523 @@
# --------------------------------------------------------
# mcan-vqa (Deep Modular Co-Attention Networks)
# Licensed under The MIT License [see LICENSE for details]
# Written by Yuhao Cui https://github.com/cuiyuhao1996
# --------------------------------------------------------
from core.data.dataset import VideoQA_Dataset
from core.model.net import Net1, Net2, Net3, Net4
from core.model.optim import get_optim, adjust_lr
from core.metrics import get_acc
from tqdm import tqdm
from core.data.utils import shuffle_list
import os, json, torch, datetime, pickle, copy, shutil, time, math
import numpy as np
import torch.nn as nn
import torch.utils.data as Data
from tensorboardX import SummaryWriter
from torch.autograd import Variable as var
class Execution:
def __init__(self, __C):
self.__C = __C
print('Loading training set ........')
__C_train = copy.deepcopy(self.__C)
setattr(__C_train, 'RUN_MODE', 'train')
self.dataset = VideoQA_Dataset(__C_train)
self.dataset_eval = None
if self.__C.EVAL_EVERY_EPOCH:
__C_eval = copy.deepcopy(self.__C)
setattr(__C_eval, 'RUN_MODE', 'val')
print('Loading validation set for per-epoch evaluation ........')
self.dataset_eval = VideoQA_Dataset(__C_eval)
self.dataset_eval.ans_list = self.dataset.ans_list
self.dataset_eval.ans_to_ix, self.dataset_eval.ix_to_ans = self.dataset.ans_to_ix, self.dataset.ix_to_ans
self.dataset_eval.token_to_ix, self.dataset_eval.pretrained_emb = self.dataset.token_to_ix, self.dataset.pretrained_emb
__C_test = copy.deepcopy(self.__C)
setattr(__C_test, 'RUN_MODE', 'test')
self.dataset_test = VideoQA_Dataset(__C_test)
self.dataset_test.ans_list = self.dataset.ans_list
self.dataset_test.ans_to_ix, self.dataset_test.ix_to_ans = self.dataset.ans_to_ix, self.dataset.ix_to_ans
self.dataset_test.token_to_ix, self.dataset_test.pretrained_emb = self.dataset.token_to_ix, self.dataset.pretrained_emb
self.writer = SummaryWriter(self.__C.TB_PATH)
def train(self, dataset, dataset_eval=None):
# Obtain needed information
data_size = dataset.data_size
token_size = dataset.token_size
ans_size = dataset.ans_size
pretrained_emb = dataset.pretrained_emb
net = self.construct_net(self.__C.MODEL_TYPE)
if os.path.isfile(self.__C.PRETRAINED_PATH) and self.__C.MODEL_TYPE == 11:
print('Loading pretrained DNC-weigths')
net.load_pretrained_weights()
net.cuda()
net.train()
# Define the multi-gpu training if needed
if self.__C.N_GPU > 1:
net = nn.DataParallel(net, device_ids=self.__C.DEVICES)
# Define the binary cross entropy loss
# loss_fn = torch.nn.BCELoss(size_average=False).cuda()
loss_fn = torch.nn.BCELoss(reduction='sum').cuda()
# Load checkpoint if resume training
if self.__C.RESUME:
print(' ========== Resume training')
if self.__C.CKPT_PATH is not None:
print('Warning: you are now using CKPT_PATH args, '
'CKPT_VERSION and CKPT_EPOCH will not work')
path = self.__C.CKPT_PATH
else:
path = self.__C.CKPTS_PATH + \
'ckpt_' + self.__C.CKPT_VERSION + \
'/epoch' + str(self.__C.CKPT_EPOCH) + '.pkl'
# Load the network parameters
print('Loading ckpt {}'.format(path))
ckpt = torch.load(path)
print('Finish!')
net.load_state_dict(ckpt['state_dict'])
# Load the optimizer paramters
optim = get_optim(self.__C, net, data_size, ckpt['optim'], lr_base=ckpt['lr_base'])
optim._step = int(data_size / self.__C.BATCH_SIZE * self.__C.CKPT_EPOCH)
optim.optimizer.load_state_dict(ckpt['optimizer'])
start_epoch = self.__C.CKPT_EPOCH
else:
if ('ckpt_' + self.__C.VERSION) in os.listdir(self.__C.CKPTS_PATH):
shutil.rmtree(self.__C.CKPTS_PATH + 'ckpt_' + self.__C.VERSION)
os.mkdir(self.__C.CKPTS_PATH + 'ckpt_' + self.__C.VERSION)
optim = get_optim(self.__C, net, data_size, self.__C.OPTIM)
start_epoch = 0
loss_sum = 0
named_params = list(net.named_parameters())
grad_norm = np.zeros(len(named_params))
# Define multi-thread dataloader
if self.__C.SHUFFLE_MODE in ['external']:
dataloader = Data.DataLoader(
dataset,
batch_size=self.__C.BATCH_SIZE,
shuffle=False,
num_workers=self.__C.NUM_WORKERS,
pin_memory=self.__C.PIN_MEM,
drop_last=True
)
else:
dataloader = Data.DataLoader(
dataset,
batch_size=self.__C.BATCH_SIZE,
shuffle=True,
num_workers=self.__C.NUM_WORKERS,
pin_memory=self.__C.PIN_MEM,
drop_last=True
)
# Training script
for epoch in range(start_epoch, self.__C.MAX_EPOCH):
# Save log information
logfile = open(
self.__C.LOG_PATH +
'log_run_' + self.__C.VERSION + '.txt',
'a+'
)
logfile.write(
'nowTime: ' +
datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') +
'\n'
)
logfile.close()
# Learning Rate Decay
if epoch in self.__C.LR_DECAY_LIST:
adjust_lr(optim, self.__C.LR_DECAY_R)
# Externally shuffle
if self.__C.SHUFFLE_MODE == 'external':
shuffle_list(dataset.ans_list)
time_start = time.time()
# Iteration
for step, (
ques_ix_iter,
frames_feat_iter,
clips_feat_iter,
ans_iter,
_,
_,
_,
_
) in enumerate(dataloader):
ques_ix_iter = ques_ix_iter.cuda()
frames_feat_iter = frames_feat_iter.cuda()
clips_feat_iter = clips_feat_iter.cuda()
ans_iter = ans_iter.cuda()
optim.zero_grad()
for accu_step in range(self.__C.GRAD_ACCU_STEPS):
sub_frames_feat_iter = \
frames_feat_iter[accu_step * self.__C.SUB_BATCH_SIZE:
(accu_step + 1) * self.__C.SUB_BATCH_SIZE]
sub_clips_feat_iter = \
clips_feat_iter[accu_step * self.__C.SUB_BATCH_SIZE:
(accu_step + 1) * self.__C.SUB_BATCH_SIZE]
sub_ques_ix_iter = \
ques_ix_iter[accu_step * self.__C.SUB_BATCH_SIZE:
(accu_step + 1) * self.__C.SUB_BATCH_SIZE]
sub_ans_iter = \
ans_iter[accu_step * self.__C.SUB_BATCH_SIZE:
(accu_step + 1) * self.__C.SUB_BATCH_SIZE]
pred = net(
sub_frames_feat_iter,
sub_clips_feat_iter,
sub_ques_ix_iter
)
loss = loss_fn