Initial commit
This commit is contained in:
commit
b5f3b728c3
124
README.md
Normal file
124
README.md
Normal file
|
@ -0,0 +1,124 @@
|
|||
This is the official code of the paper **Video Language Co-Attention with Fast-Learning Feature Fusion for VideoQA**.
|
||||
If you find our code useful, please cite our paper:
|
||||
|
||||
# Overview
|
||||
<p align="center"><img src="assets/overview_project_one.png" alt="drawing" width="600" height="400"/></p>
|
||||
|
||||
# Results
|
||||
Our VLCN model achieves **new** state-of-the-art results on two open-ended VideoQA datasets **MSVD-QA** and **MSRVTT-QA**.
|
||||
#### MSVD-QA
|
||||
| <center>Model</center> | <center>What</center> | <center>Who</center> | <center>How</center> | <center>When</center> | <center>Where</center> | <center>All</center> |
|
||||
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
||||
| ST-VQA | 18.10 | 50.00 | **83.80** | 72.40 | 28.60 | 31.30 |
|
||||
| Co-Mem | 19.60 | 48.70 | 81.60 | 74.10 | 31.70 | 31.70 |
|
||||
| HMEMA | 22.40 | 50.00 | 73.00 | 70.70 | 42.90 | 33.70 |
|
||||
| SSML | - | - | - | - | - | 35.13 |
|
||||
| QueST | 24.50 | **52.90** | 79.10 | 72.40 | **50.00** | 36.10 |
|
||||
| HCRN | - | - | - | - | - | 36.10 |
|
||||
| MA-DRNN | 24.30 | 51.60 | 82.00 | **86.30** | 26.30 | 36.20 |
|
||||
| **VLCN (Ours)** | **28.42** | 51.29 | 81.08 | 74.13 | 46.43 | **38.06** |
|
||||
|
||||
#### MSRVTT-QA
|
||||
| <center>Model</center> | <center>What</center> | <center>Who</center> | <center>How</center> | <center>When</center> | <center>Where</center> | <center>All</center> |
|
||||
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
||||
| ST-VQA | 24.50 | 41.20 | 78.00 | 76.50 | 34.90 | 30.90 |
|
||||
| Co-Mem | 23.90 | 42.50 | 74.10 | 69.00 | **42.90** | 32.00 |
|
||||
| HMEMA | 22.40 | **50.10** | 73.00 | 70.70 | 42.90 | 33.70 |
|
||||
| QueST | 27.90 | 45.60 | **83.00** | 75.70 | 31.60 | 34.60 |
|
||||
| SSML | - | - | - | - | - | 35.00 |
|
||||
| HCRN | - | - | - | - | - | 35.60 |
|
||||
| **VLCN (Ours)** | **30.69** | 44.09 | 79.82 | **78.29** | 36.80 | **36.01** |
|
||||
|
||||
# Requirements
|
||||
- PyTorch 1.3.1<br/>
|
||||
- Torchvision 0.4.2<br/>
|
||||
- Python 3.6
|
||||
|
||||
# Raw data
|
||||
The raw data of MSVD-QA and MSRVTT-QA are located in
|
||||
``
|
||||
data/MSVD-QA
|
||||
``
|
||||
and
|
||||
``
|
||||
data/MSRVTT-QA
|
||||
``
|
||||
, respectively.<br/>
|
||||
|
||||
**Videos:** The raw videos of MSVD-QA and MSRVTT-QA can be downloaded from [⬇](https://www.cs.utexas.edu/users/ml/clamp/videoDescription/) and [⬇](https://www.mediafire.com/folder/h14iarbs62e7p/shared), respectively.<br/>
|
||||
**Text:** The text data can be downloaded from [⬇](https://github.com/xudejing/video-question-answering).<br/>
|
||||
|
||||
After downloading all the raw data, ``
|
||||
data/MSVD-QA
|
||||
``
|
||||
and
|
||||
``
|
||||
data/MSRVTT-QA
|
||||
``
|
||||
should have the following structure:
|
||||
<p align="center"><img src="assets/structure.png" alt="PHP Terminal style set text color" /></p>
|
||||
|
||||
# Preprocessing
|
||||
To sample the individual frames and clips and generate the corresponding visual features, we run the script
|
||||
``
|
||||
preporocess.py
|
||||
``
|
||||
on the raw videos with the appropriate flags. E.g. for MSVD-QA we have to execute
|
||||
```bash
|
||||
python core/data/preporocess.py --RAW_VID_PATH /data/MSRVD-QA/videos --C3D_PATH path_to_pretrained_c3d
|
||||
```
|
||||
This will save the individual frames and clips in
|
||||
``
|
||||
data/MSVD-QA/frames
|
||||
``
|
||||
and
|
||||
``
|
||||
data/MSVD-QA/clips
|
||||
``
|
||||
, respectively, and their visual features in
|
||||
|
||||
``
|
||||
data/MSVD-QA/frame_feat
|
||||
``
|
||||
and
|
||||
``
|
||||
data/MSVD-QA/clip_feat
|
||||
``, respectively.
|
||||
|
||||
# Config files
|
||||
Before starting training, one has to update the config path file
|
||||
``
|
||||
cfgs/path_cfgs.py
|
||||
``
|
||||
with the paths of the raw data as well as the visual feaures.<br/>
|
||||
All Hyperparameters can be adjusted in
|
||||
``
|
||||
cfgs/base_cfgs.py
|
||||
``.
|
||||
|
||||
# Training
|
||||
To start training, one has to specify an experiment directory
|
||||
``
|
||||
EXP_NAME
|
||||
``
|
||||
where all the results (log files, checkpoints, tensorboard files etc) will be saved. Futhermore, one needs to specify the
|
||||
``
|
||||
MODEL_TYPE
|
||||
``
|
||||
of the VLCN to be trained.
|
||||
| <center>MODEL_TYPE</center> | <center>Description</center> |
|
||||
| :---: | :---: |
|
||||
| 1 | VLCN |
|
||||
| 2 | VLCN-FLF |
|
||||
| 3 | VLCV+LSTM |
|
||||
| 4 | MCAN |
|
||||
|
||||
These parameters can be set inline. E.g. by executing
|
||||
```bash
|
||||
python run.py --EXP_NAME experiment --MODEL_TYPE 1 --DATA_PATH /data/MSRVD-QA --GPU 1 --SEED 42
|
||||
```
|
||||
# Pre-trained models
|
||||
Our pre-trained models are available here [⬇](https://drive.google.com/drive/folders/172yj4iUkF1U1WOPdA5KuKOTQXkgzFEzS)
|
||||
|
||||
# Acknowledgements
|
||||
We thank the Vision and Language Group@ MIL for their [MCAN](https://github.com/MILVLG/mcan-vqa) open source implementation, [DavidA](https://github.com/DavideA/c3d-pytorch/blob/master/C3D_model.py) for his pretrained C3D model and finally [ixaxaar](https://github.com/ixaxaar/pytorch-dnc) for his DNC implementation.
|
0
assets/.gitkeep
Normal file
0
assets/.gitkeep
Normal file
BIN
assets/overview_project_one.png
Normal file
BIN
assets/overview_project_one.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 314 KiB |
BIN
assets/structure.png
Normal file
BIN
assets/structure.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 18 KiB |
0
cfgs/.gitkeep
Normal file
0
cfgs/.gitkeep
Normal file
267
cfgs/base_cfgs.py
Normal file
267
cfgs/base_cfgs.py
Normal file
|
@ -0,0 +1,267 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
from cfgs.path_cfgs import PATH
|
||||
|
||||
import os, torch, random
|
||||
import numpy as np
|
||||
from types import MethodType
|
||||
|
||||
|
||||
class Cfgs(PATH):
|
||||
def __init__(self, EXP_NAME, DATASET_PATH):
|
||||
super(Cfgs, self).__init__(EXP_NAME, DATASET_PATH)
|
||||
|
||||
# Set Devices
|
||||
# If use multi-gpu training, set e.g.'0, 1, 2' instead
|
||||
self.GPU = '0'
|
||||
|
||||
# Set RNG For CPU And GPUs
|
||||
self.SEED = random.randint(0, 99999999)
|
||||
|
||||
# -------------------------
|
||||
# ---- Version Control ----
|
||||
# -------------------------
|
||||
|
||||
# Define a specific name to start new training
|
||||
# self.VERSION = 'Anonymous_' + str(self.SEED)
|
||||
self.VERSION = str(self.SEED)
|
||||
|
||||
# Resume training
|
||||
self.RESUME = False
|
||||
|
||||
# Used in Resume training and testing
|
||||
self.CKPT_VERSION = self.VERSION
|
||||
self.CKPT_EPOCH = 0
|
||||
|
||||
# Absolutely checkpoint path, 'CKPT_VERSION' and 'CKPT_EPOCH' will be overridden
|
||||
self.CKPT_PATH = None
|
||||
|
||||
# Print loss every step
|
||||
self.VERBOSE = True
|
||||
|
||||
|
||||
# ------------------------------
|
||||
# ---- Data Provider Params ----
|
||||
# ------------------------------
|
||||
|
||||
# {'train', 'val', 'test'}
|
||||
self.RUN_MODE = 'train'
|
||||
|
||||
# Set True to evaluate offline
|
||||
self.EVAL_EVERY_EPOCH = True
|
||||
|
||||
# # Define the 'train' 'val' 'test' data split
|
||||
# # (EVAL_EVERY_EPOCH triggered when set {'train': 'train'})
|
||||
# self.SPLIT = {
|
||||
# 'train': '',
|
||||
# 'val': 'val',
|
||||
# 'test': 'test',
|
||||
# }
|
||||
# # A external method to set train split
|
||||
# self.TRAIN_SPLIT = 'train+val+vg'
|
||||
|
||||
# Set True to use pretrained word embedding
|
||||
# (GloVe: spaCy https://spacy.io/)
|
||||
self.USE_GLOVE = True
|
||||
|
||||
# Word embedding matrix size
|
||||
# (token size x WORD_EMBED_SIZE)
|
||||
self.WORD_EMBED_SIZE = 300
|
||||
|
||||
# Max length of question sentences
|
||||
self.MAX_TOKEN = 15
|
||||
|
||||
# VGG 4096D features
|
||||
self.FRAME_FEAT_SIZE = 4096
|
||||
|
||||
# C3D 4096D features
|
||||
self.CLIP_FEAT_SIZE = 4096
|
||||
|
||||
self.NUM_ANS = 1000
|
||||
|
||||
# Default training batch size: 64
|
||||
self.BATCH_SIZE = 64
|
||||
|
||||
# Multi-thread I/O
|
||||
self.NUM_WORKERS = 8
|
||||
|
||||
# Use pin memory
|
||||
# (Warning: pin memory can accelerate GPU loading but may
|
||||
# increase the CPU memory usage when NUM_WORKS is large)
|
||||
self.PIN_MEM = True
|
||||
|
||||
# Large model can not training with batch size 64
|
||||
# Gradient accumulate can split batch to reduce gpu memory usage
|
||||
# (Warning: BATCH_SIZE should be divided by GRAD_ACCU_STEPS)
|
||||
self.GRAD_ACCU_STEPS = 1
|
||||
|
||||
# Set 'external': use external shuffle method to implement training shuffle
|
||||
# Set 'internal': use pytorch dataloader default shuffle method
|
||||
self.SHUFFLE_MODE = 'external'
|
||||
|
||||
|
||||
# ------------------------
|
||||
# ---- Network Params ----
|
||||
# ------------------------
|
||||
|
||||
# Model deeps
|
||||
# (Encoder and Decoder will be same deeps)
|
||||
self.LAYER = 6
|
||||
|
||||
# Model hidden size
|
||||
# (512 as default, bigger will be a sharp increase of gpu memory usage)
|
||||
self.HIDDEN_SIZE = 512
|
||||
|
||||
# Multi-head number in MCA layers
|
||||
# (Warning: HIDDEN_SIZE should be divided by MULTI_HEAD)
|
||||
self.MULTI_HEAD = 8
|
||||
|
||||
# Dropout rate for all dropout layers
|
||||
# (dropout can prevent overfitting: [Dropout: a simple way to prevent neural networks from overfitting])
|
||||
self.DROPOUT_R = 0.1
|
||||
|
||||
# MLP size in flatten layers
|
||||
self.FLAT_MLP_SIZE = 512
|
||||
|
||||
# Flatten the last hidden to vector with {n} attention glimpses
|
||||
self.FLAT_GLIMPSES = 1
|
||||
self.FLAT_OUT_SIZE = 1024
|
||||
|
||||
|
||||
# --------------------------
|
||||
# ---- Optimizer Params ----
|
||||
# --------------------------
|
||||
|
||||
# The base learning rate
|
||||
self.LR_BASE = 0.0001
|
||||
|
||||
# Learning rate decay ratio
|
||||
self.LR_DECAY_R = 0.2
|
||||
|
||||
# Learning rate decay at {x, y, z...} epoch
|
||||
self.LR_DECAY_LIST = [10, 12]
|
||||
|
||||
# Max training epoch
|
||||
self.MAX_EPOCH = 30
|
||||
|
||||
# Gradient clip
|
||||
# (default: -1 means not using)
|
||||
self.GRAD_NORM_CLIP = -1
|
||||
|
||||
# Adam optimizer betas and eps
|
||||
self.OPT_BETAS = (0.9, 0.98)
|
||||
self.OPT_EPS = 1e-9
|
||||
self.OPT_WEIGHT_DECAY = 1e-5
|
||||
# --------------------------
|
||||
# ---- DNC Hyper-Params ----
|
||||
# --------------------------
|
||||
self.IN_SIZE_DNC = self.HIDDEN_SIZE
|
||||
self.OUT_SIZE_DNC = self.HIDDEN_SIZE
|
||||
self.WORD_LENGTH_DNC = 512
|
||||
self.CELL_COUNT_DNC = 64
|
||||
self.MEM_HIDDEN_SIZE = self.CELL_COUNT_DNC * self.WORD_LENGTH_DNC
|
||||
self.N_READ_HEADS_DNC = 4
|
||||
|
||||
def parse_to_dict(self, args):
|
||||
args_dict = {}
|
||||
for arg in dir(args):
|
||||
if not arg.startswith('_') and not isinstance(getattr(args, arg), MethodType):
|
||||
if getattr(args, arg) is not None:
|
||||
args_dict[arg] = getattr(args, arg)
|
||||
|
||||
return args_dict
|
||||
|
||||
|
||||
def add_args(self, args_dict):
|
||||
for arg in args_dict:
|
||||
setattr(self, arg, args_dict[arg])
|
||||
|
||||
|
||||
def proc(self):
|
||||
assert self.RUN_MODE in ['train', 'val', 'test']
|
||||
|
||||
# ------------ Devices setup
|
||||
# os.environ['CUDA_VISIBLE_DEVICES'] = self.GPU
|
||||
self.N_GPU = len(self.GPU.split(','))
|
||||
self.DEVICES = [_ for _ in range(self.N_GPU)]
|
||||
torch.set_num_threads(2)
|
||||
|
||||
|
||||
# ------------ Seed setup
|
||||
# fix pytorch seed
|
||||
torch.manual_seed(self.SEED)
|
||||
if self.N_GPU < 2:
|
||||
torch.cuda.manual_seed(self.SEED)
|
||||
else:
|
||||
torch.cuda.manual_seed_all(self.SEED)
|
||||
torch.backends.cudnn.deterministic = True
|
||||
|
||||
# fix numpy seed
|
||||
np.random.seed(self.SEED)
|
||||
|
||||
# fix random seed
|
||||
random.seed(self.SEED)
|
||||
|
||||
if self.CKPT_PATH is not None:
|
||||
print('Warning: you are now using CKPT_PATH args, '
|
||||
'CKPT_VERSION and CKPT_EPOCH will not work')
|
||||
self.CKPT_VERSION = self.CKPT_PATH.split('/')[-1] + '_' + str(random.randint(0, 99999999))
|
||||
|
||||
|
||||
# ------------ Split setup
|
||||
self.SPLIT['train'] = self.TRAIN_SPLIT
|
||||
if 'val' in self.SPLIT['train'].split('+') or self.RUN_MODE not in ['train']:
|
||||
self.EVAL_EVERY_EPOCH = False
|
||||
|
||||
if self.RUN_MODE not in ['test']:
|
||||
self.TEST_SAVE_PRED = False
|
||||
|
||||
|
||||
# ------------ Gradient accumulate setup
|
||||
assert self.BATCH_SIZE % self.GRAD_ACCU_STEPS == 0
|
||||
self.SUB_BATCH_SIZE = int(self.BATCH_SIZE / self.GRAD_ACCU_STEPS)
|
||||
|
||||
# Use a small eval batch will reduce gpu memory usage
|
||||
self.EVAL_BATCH_SIZE = 32
|
||||
|
||||
|
||||
# ------------ Networks setup
|
||||
# FeedForwardNet size in every MCA layer
|
||||
self.FF_SIZE = int(self.HIDDEN_SIZE * 4)
|
||||
self.FF_MEM_SIZE = int()
|
||||
|
||||
# A pipe line hidden size in attention compute
|
||||
assert self.HIDDEN_SIZE % self.MULTI_HEAD == 0
|
||||
self.HIDDEN_SIZE_HEAD = int(self.HIDDEN_SIZE / self.MULTI_HEAD)
|
||||
|
||||
|
||||
def __str__(self):
|
||||
for attr in dir(self):
|
||||
if not attr.startswith('__') and not isinstance(getattr(self, attr), MethodType):
|
||||
print('{ %-17s }->' % attr, getattr(self, attr))
|
||||
|
||||
return ''
|
||||
|
||||
def check_path(self):
|
||||
print('Checking dataset ...')
|
||||
|
||||
|
||||
if not os.path.exists(self.FRAMES):
|
||||
print(self.FRAMES + 'NOT EXIST')
|
||||
exit(-1)
|
||||
|
||||
if not os.path.exists(self.CLIPS):
|
||||
print(self.CLIPS + 'NOT EXIST')
|
||||
exit(-1)
|
||||
|
||||
for mode in self.QA_PATH:
|
||||
if not os.path.exists(self.QA_PATH[mode]):
|
||||
print(self.QA_PATH[mode] + 'NOT EXIST')
|
||||
exit(-1)
|
||||
|
||||
print('Finished')
|
||||
print('')
|
6
cfgs/fusion_cfgs.yml
Normal file
6
cfgs/fusion_cfgs.yml
Normal file
|
@ -0,0 +1,6 @@
|
|||
CONTROLLER_INPUT_SIZE: 512
|
||||
CONTROLLER_HIDDEN_SIZE: 512
|
||||
CONTROLLER_NUM_LAYERS: 2
|
||||
HIDDEN_DIM_COMP: 1024
|
||||
OUT_DIM_COMP: 512
|
||||
COMP_NUM_LAYERS: 2
|
61
cfgs/path_cfgs.py
Normal file
61
cfgs/path_cfgs.py
Normal file
|
@ -0,0 +1,61 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
import os
|
||||
|
||||
class PATH:
|
||||
def __init__(self, EXP_NAME, DATASET_PATH):
|
||||
# name of the experiment
|
||||
self.EXP_NAME = EXP_NAME
|
||||
|
||||
# Dataset root path
|
||||
self.DATASET_PATH = DATASET_PATH
|
||||
|
||||
# Bottom up features root path
|
||||
self.FRAMES = os.path.join(DATASET_PATH, 'frame_feat/')
|
||||
self.CLIPS = os.path.join(DATASET_PATH, 'clip_feat/')
|
||||
|
||||
|
||||
def init_path(self):
|
||||
self.QA_PATH = {
|
||||
'train': self.DATASET_PATH + 'train_qa.json',
|
||||
'val': self.DATASET_PATH + 'val_qa.json',
|
||||
'test': self.DATASET_PATH + 'test_qa.json',
|
||||
}
|
||||
self.C3D_PATH = self.DATASET_PATH + 'c3d.pickle'
|
||||
|
||||
if self.EXP_NAME not in os.listdir('./'):
|
||||
os.mkdir('./' + self.EXP_NAME)
|
||||
os.mkdir('./' + self.EXP_NAME + '/results')
|
||||
self.RESULT_PATH = './' + self.EXP_NAME + '/results/result_test/'
|
||||
self.PRED_PATH = './' + self.EXP_NAME + '/results/pred/'
|
||||
self.CACHE_PATH = './' + self.EXP_NAME + '/results/cache/'
|
||||
self.LOG_PATH = './' + self.EXP_NAME + '/results/log/'
|
||||
self.TB_PATH = './' + self.EXP_NAME + '/results/tensorboard/'
|
||||
self.CKPTS_PATH = './' + self.EXP_NAME + '/ckpts/'
|
||||
|
||||
if 'result_test' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/result_test/')
|
||||
|
||||
if 'pred' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/pred/')
|
||||
|
||||
if 'cache' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/cache')
|
||||
|
||||
if 'log' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/log')
|
||||
|
||||
if 'tensorboard' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/tensorboard')
|
||||
|
||||
if 'ckpts' not in os.listdir('./' + self.EXP_NAME):
|
||||
os.mkdir('./' + self.EXP_NAME + '/ckpts')
|
||||
|
||||
|
||||
def check_path(self):
|
||||
raise NotImplementedError
|
||||
|
13
cfgs/small_model.yml
Normal file
13
cfgs/small_model.yml
Normal file
|
@ -0,0 +1,13 @@
|
|||
LAYER: 6
|
||||
HIDDEN_SIZE: 512
|
||||
MEM_HIDDEN_SIZE: 2048
|
||||
MULTI_HEAD: 8
|
||||
DROPOUT_R: 0.1
|
||||
FLAT_MLP_SIZE: 512
|
||||
FLAT_GLIMPSES: 1
|
||||
FLAT_OUT_SIZE: 1024
|
||||
LR_BASE: 0.0001
|
||||
LR_DECAY_R: 0.2
|
||||
GRAD_ACCU_STEPS: 1
|
||||
CKPT_VERSION: 'small'
|
||||
CKPT_EPOCH: 13
|
0
code/.gitkeep
Normal file
0
code/.gitkeep
Normal file
0
code/assets/.gitkeep
Normal file
0
code/assets/.gitkeep
Normal file
BIN
code/assets/structure.png
Normal file
BIN
code/assets/structure.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 18 KiB |
0
code/cfgs/.gitkeep
Normal file
0
code/cfgs/.gitkeep
Normal file
267
code/cfgs/base_cfgs.py
Normal file
267
code/cfgs/base_cfgs.py
Normal file
|
@ -0,0 +1,267 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
from cfgs.path_cfgs import PATH
|
||||
|
||||
import os, torch, random
|
||||
import numpy as np
|
||||
from types import MethodType
|
||||
|
||||
|
||||
class Cfgs(PATH):
|
||||
def __init__(self, EXP_NAME, DATASET_PATH):
|
||||
super(Cfgs, self).__init__(EXP_NAME, DATASET_PATH)
|
||||
|
||||
# Set Devices
|
||||
# If use multi-gpu training, set e.g.'0, 1, 2' instead
|
||||
self.GPU = '0'
|
||||
|
||||
# Set RNG For CPU And GPUs
|
||||
self.SEED = random.randint(0, 99999999)
|
||||
|
||||
# -------------------------
|
||||
# ---- Version Control ----
|
||||
# -------------------------
|
||||
|
||||
# Define a specific name to start new training
|
||||
# self.VERSION = 'Anonymous_' + str(self.SEED)
|
||||
self.VERSION = str(self.SEED)
|
||||
|
||||
# Resume training
|
||||
self.RESUME = False
|
||||
|
||||
# Used in Resume training and testing
|
||||
self.CKPT_VERSION = self.VERSION
|
||||
self.CKPT_EPOCH = 0
|
||||
|
||||
# Absolutely checkpoint path, 'CKPT_VERSION' and 'CKPT_EPOCH' will be overridden
|
||||
self.CKPT_PATH = None
|
||||
|
||||
# Print loss every step
|
||||
self.VERBOSE = True
|
||||
|
||||
|
||||
# ------------------------------
|
||||
# ---- Data Provider Params ----
|
||||
# ------------------------------
|
||||
|
||||
# {'train', 'val', 'test'}
|
||||
self.RUN_MODE = 'train'
|
||||
|
||||
# Set True to evaluate offline
|
||||
self.EVAL_EVERY_EPOCH = True
|
||||
|
||||
# # Define the 'train' 'val' 'test' data split
|
||||
# # (EVAL_EVERY_EPOCH triggered when set {'train': 'train'})
|
||||
# self.SPLIT = {
|
||||
# 'train': '',
|
||||
# 'val': 'val',
|
||||
# 'test': 'test',
|
||||
# }
|
||||
# # A external method to set train split
|
||||
# self.TRAIN_SPLIT = 'train+val+vg'
|
||||
|
||||
# Set True to use pretrained word embedding
|
||||
# (GloVe: spaCy https://spacy.io/)
|
||||
self.USE_GLOVE = True
|
||||
|
||||
# Word embedding matrix size
|
||||
# (token size x WORD_EMBED_SIZE)
|
||||
self.WORD_EMBED_SIZE = 300
|
||||
|
||||
# Max length of question sentences
|
||||
self.MAX_TOKEN = 15
|
||||
|
||||
# VGG 4096D features
|
||||
self.FRAME_FEAT_SIZE = 4096
|
||||
|
||||
# C3D 4096D features
|
||||
self.CLIP_FEAT_SIZE = 4096
|
||||
|
||||
self.NUM_ANS = 1000
|
||||
|
||||
# Default training batch size: 64
|
||||
self.BATCH_SIZE = 64
|
||||
|
||||
# Multi-thread I/O
|
||||
self.NUM_WORKERS = 8
|
||||
|
||||
# Use pin memory
|
||||
# (Warning: pin memory can accelerate GPU loading but may
|
||||
# increase the CPU memory usage when NUM_WORKS is large)
|
||||
self.PIN_MEM = True
|
||||
|
||||
# Large model can not training with batch size 64
|
||||
# Gradient accumulate can split batch to reduce gpu memory usage
|
||||
# (Warning: BATCH_SIZE should be divided by GRAD_ACCU_STEPS)
|
||||
self.GRAD_ACCU_STEPS = 1
|
||||
|
||||
# Set 'external': use external shuffle method to implement training shuffle
|
||||
# Set 'internal': use pytorch dataloader default shuffle method
|
||||
self.SHUFFLE_MODE = 'external'
|
||||
|
||||
|
||||
# ------------------------
|
||||
# ---- Network Params ----
|
||||
# ------------------------
|
||||
|
||||
# Model deeps
|
||||
# (Encoder and Decoder will be same deeps)
|
||||
self.LAYER = 6
|
||||
|
||||
# Model hidden size
|
||||
# (512 as default, bigger will be a sharp increase of gpu memory usage)
|
||||
self.HIDDEN_SIZE = 512
|
||||
|
||||
# Multi-head number in MCA layers
|
||||
# (Warning: HIDDEN_SIZE should be divided by MULTI_HEAD)
|
||||
self.MULTI_HEAD = 8
|
||||
|
||||
# Dropout rate for all dropout layers
|
||||
# (dropout can prevent overfitting: [Dropout: a simple way to prevent neural networks from overfitting])
|
||||
self.DROPOUT_R = 0.1
|
||||
|
||||
# MLP size in flatten layers
|
||||
self.FLAT_MLP_SIZE = 512
|
||||
|
||||
# Flatten the last hidden to vector with {n} attention glimpses
|
||||
self.FLAT_GLIMPSES = 1
|
||||
self.FLAT_OUT_SIZE = 1024
|
||||
|
||||
|
||||
# --------------------------
|
||||
# ---- Optimizer Params ----
|
||||
# --------------------------
|
||||
|
||||
# The base learning rate
|
||||
self.LR_BASE = 0.0001
|
||||
|
||||
# Learning rate decay ratio
|
||||
self.LR_DECAY_R = 0.2
|
||||
|
||||
# Learning rate decay at {x, y, z...} epoch
|
||||
self.LR_DECAY_LIST = [10, 12]
|
||||
|
||||
# Max training epoch
|
||||
self.MAX_EPOCH = 30
|
||||
|
||||
# Gradient clip
|
||||
# (default: -1 means not using)
|
||||
self.GRAD_NORM_CLIP = -1
|
||||
|
||||
# Adam optimizer betas and eps
|
||||
self.OPT_BETAS = (0.9, 0.98)
|
||||
self.OPT_EPS = 1e-9
|
||||
self.OPT_WEIGHT_DECAY = 1e-5
|
||||
# --------------------------
|
||||
# ---- DNC Hyper-Params ----
|
||||
# --------------------------
|
||||
self.IN_SIZE_DNC = self.HIDDEN_SIZE
|
||||
self.OUT_SIZE_DNC = self.HIDDEN_SIZE
|
||||
self.WORD_LENGTH_DNC = 512
|
||||
self.CELL_COUNT_DNC = 64
|
||||
self.MEM_HIDDEN_SIZE = self.CELL_COUNT_DNC * self.WORD_LENGTH_DNC
|
||||
self.N_READ_HEADS_DNC = 4
|
||||
|
||||
def parse_to_dict(self, args):
|
||||
args_dict = {}
|
||||
for arg in dir(args):
|
||||
if not arg.startswith('_') and not isinstance(getattr(args, arg), MethodType):
|
||||
if getattr(args, arg) is not None:
|
||||
args_dict[arg] = getattr(args, arg)
|
||||
|
||||
return args_dict
|
||||
|
||||
|
||||
def add_args(self, args_dict):
|
||||
for arg in args_dict:
|
||||
setattr(self, arg, args_dict[arg])
|
||||
|
||||
|
||||
def proc(self):
|
||||
assert self.RUN_MODE in ['train', 'val', 'test']
|
||||
|
||||
# ------------ Devices setup
|
||||
# os.environ['CUDA_VISIBLE_DEVICES'] = self.GPU
|
||||
self.N_GPU = len(self.GPU.split(','))
|
||||
self.DEVICES = [_ for _ in range(self.N_GPU)]
|
||||
torch.set_num_threads(2)
|
||||
|
||||
|
||||
# ------------ Seed setup
|
||||
# fix pytorch seed
|
||||
torch.manual_seed(self.SEED)
|
||||
if self.N_GPU < 2:
|
||||
torch.cuda.manual_seed(self.SEED)
|
||||
else:
|
||||
torch.cuda.manual_seed_all(self.SEED)
|
||||
torch.backends.cudnn.deterministic = True
|
||||
|
||||
# fix numpy seed
|
||||
np.random.seed(self.SEED)
|
||||
|
||||
# fix random seed
|
||||
random.seed(self.SEED)
|
||||
|
||||
if self.CKPT_PATH is not None:
|
||||
print('Warning: you are now using CKPT_PATH args, '
|
||||
'CKPT_VERSION and CKPT_EPOCH will not work')
|
||||
self.CKPT_VERSION = self.CKPT_PATH.split('/')[-1] + '_' + str(random.randint(0, 99999999))
|
||||
|
||||
|
||||
# ------------ Split setup
|
||||
self.SPLIT['train'] = self.TRAIN_SPLIT
|
||||
if 'val' in self.SPLIT['train'].split('+') or self.RUN_MODE not in ['train']:
|
||||
self.EVAL_EVERY_EPOCH = False
|
||||
|
||||
if self.RUN_MODE not in ['test']:
|
||||
self.TEST_SAVE_PRED = False
|
||||
|
||||
|
||||
# ------------ Gradient accumulate setup
|
||||
assert self.BATCH_SIZE % self.GRAD_ACCU_STEPS == 0
|
||||
self.SUB_BATCH_SIZE = int(self.BATCH_SIZE / self.GRAD_ACCU_STEPS)
|
||||
|
||||
# Use a small eval batch will reduce gpu memory usage
|
||||
self.EVAL_BATCH_SIZE = 32
|
||||
|
||||
|
||||
# ------------ Networks setup
|
||||
# FeedForwardNet size in every MCA layer
|
||||
self.FF_SIZE = int(self.HIDDEN_SIZE * 4)
|
||||
self.FF_MEM_SIZE = int()
|
||||
|
||||
# A pipe line hidden size in attention compute
|
||||
assert self.HIDDEN_SIZE % self.MULTI_HEAD == 0
|
||||
self.HIDDEN_SIZE_HEAD = int(self.HIDDEN_SIZE / self.MULTI_HEAD)
|
||||
|
||||
|
||||
def __str__(self):
|
||||
for attr in dir(self):
|
||||
if not attr.startswith('__') and not isinstance(getattr(self, attr), MethodType):
|
||||
print('{ %-17s }->' % attr, getattr(self, attr))
|
||||
|
||||
return ''
|
||||
|
||||
def check_path(self):
|
||||
print('Checking dataset ...')
|
||||
|
||||
|
||||
if not os.path.exists(self.FRAMES):
|
||||
print(self.FRAMES + 'NOT EXIST')
|
||||
exit(-1)
|
||||
|
||||
if not os.path.exists(self.CLIPS):
|
||||
print(self.CLIPS + 'NOT EXIST')
|
||||
exit(-1)
|
||||
|
||||
for mode in self.QA_PATH:
|
||||
if not os.path.exists(self.QA_PATH[mode]):
|
||||
print(self.QA_PATH[mode] + 'NOT EXIST')
|
||||
exit(-1)
|
||||
|
||||
print('Finished')
|
||||
print('')
|
6
code/cfgs/fusion_cfgs.yml
Normal file
6
code/cfgs/fusion_cfgs.yml
Normal file
|
@ -0,0 +1,6 @@
|
|||
CONTROLLER_INPUT_SIZE: 512
|
||||
CONTROLLER_HIDDEN_SIZE: 512
|
||||
CONTROLLER_NUM_LAYERS: 2
|
||||
HIDDEN_DIM_COMP: 1024
|
||||
OUT_DIM_COMP: 512
|
||||
COMP_NUM_LAYERS: 2
|
61
code/cfgs/path_cfgs.py
Normal file
61
code/cfgs/path_cfgs.py
Normal file
|
@ -0,0 +1,61 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
import os
|
||||
|
||||
class PATH:
|
||||
def __init__(self, EXP_NAME, DATASET_PATH):
|
||||
# name of the experiment
|
||||
self.EXP_NAME = EXP_NAME
|
||||
|
||||
# Dataset root path
|
||||
self.DATASET_PATH = DATASET_PATH
|
||||
|
||||
# Bottom up features root path
|
||||
self.FRAMES = os.path.join(DATASET_PATH, 'frame_feat/')
|
||||
self.CLIPS = os.path.join(DATASET_PATH, 'clip_feat/')
|
||||
|
||||
|
||||
def init_path(self):
|
||||
self.QA_PATH = {
|
||||
'train': self.DATASET_PATH + 'train_qa.json',
|
||||
'val': self.DATASET_PATH + 'val_qa.json',
|
||||
'test': self.DATASET_PATH + 'test_qa.json',
|
||||
}
|
||||
self.C3D_PATH = self.DATASET_PATH + 'c3d.pickle'
|
||||
|
||||
if self.EXP_NAME not in os.listdir('./'):
|
||||
os.mkdir('./' + self.EXP_NAME)
|
||||
os.mkdir('./' + self.EXP_NAME + '/results')
|
||||
self.RESULT_PATH = './' + self.EXP_NAME + '/results/result_test/'
|
||||
self.PRED_PATH = './' + self.EXP_NAME + '/results/pred/'
|
||||
self.CACHE_PATH = './' + self.EXP_NAME + '/results/cache/'
|
||||
self.LOG_PATH = './' + self.EXP_NAME + '/results/log/'
|
||||
self.TB_PATH = './' + self.EXP_NAME + '/results/tensorboard/'
|
||||
self.CKPTS_PATH = './' + self.EXP_NAME + '/ckpts/'
|
||||
|
||||
if 'result_test' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/result_test/')
|
||||
|
||||
if 'pred' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/pred/')
|
||||
|
||||
if 'cache' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/cache')
|
||||
|
||||
if 'log' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/log')
|
||||
|
||||
if 'tensorboard' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/tensorboard')
|
||||
|
||||
if 'ckpts' not in os.listdir('./' + self.EXP_NAME):
|
||||
os.mkdir('./' + self.EXP_NAME + '/ckpts')
|
||||
|
||||
|
||||
def check_path(self):
|
||||
raise NotImplementedError
|
||||
|
13
code/cfgs/small_model.yml
Normal file
13
code/cfgs/small_model.yml
Normal file
|
@ -0,0 +1,13 @@
|
|||
LAYER: 6
|
||||
HIDDEN_SIZE: 512
|
||||
MEM_HIDDEN_SIZE: 2048
|
||||
MULTI_HEAD: 8
|
||||
DROPOUT_R: 0.1
|
||||
FLAT_MLP_SIZE: 512
|
||||
FLAT_GLIMPSES: 1
|
||||
FLAT_OUT_SIZE: 1024
|
||||
LR_BASE: 0.0001
|
||||
LR_DECAY_R: 0.2
|
||||
GRAD_ACCU_STEPS: 1
|
||||
CKPT_VERSION: 'small'
|
||||
CKPT_EPOCH: 13
|
0
code/core/.gitkeep
Normal file
0
code/core/.gitkeep
Normal file
0
code/core/data/.gitkeep
Normal file
0
code/core/data/.gitkeep
Normal file
103
code/core/data/dataset.py
Normal file
103
code/core/data/dataset.py
Normal file
|
@ -0,0 +1,103 @@
|
|||
import glob, os, json, pickle
|
||||
import numpy as np
|
||||
from collections import defaultdict
|
||||
|
||||
import torch
|
||||
from torch.utils.data import Dataset
|
||||
import torchvision.transforms as transforms
|
||||
|
||||
from core.data.utils import tokenize, ans_stat, proc_ques, qlen_to_key, ans_to_key
|
||||
|
||||
|
||||
class VideoQA_Dataset(Dataset):
|
||||
def __init__(self, __C):
|
||||
super(VideoQA_Dataset, self).__init__()
|
||||
self.__C = __C
|
||||
self.ans_size = __C.NUM_ANS
|
||||
# load raw data
|
||||
with open(__C.QA_PATH[__C.RUN_MODE], 'r') as f:
|
||||
self.raw_data = json.load(f)
|
||||
self.data_size = len(self.raw_data)
|
||||
|
||||
splits = __C.SPLIT[__C.RUN_MODE].split('+')
|
||||
|
||||
frames_list = glob.glob(__C.FRAMES + '*.pt')
|
||||
clips_list = glob.glob(__C.CLIPS + '*.pt')
|
||||
if 'msvd' in self.C.DATASET_PATH.lower():
|
||||
vid_ids = [int(s.split('/')[-1].split('.')[0][3:]) for s in frames_list]
|
||||
else:
|
||||
vid_ids = [int(s.split('/')[-1].split('.')[0][5:]) for s in frames_list]
|
||||
self.frames_dict = {k: v for (k,v) in zip(vid_ids, frames_list)}
|
||||
self.clips_dict = {k: v for (k,v) in zip(vid_ids, clips_list)}
|
||||
del frames_list, clips_list
|
||||
|
||||
q_list = []
|
||||
a_list = []
|
||||
a_dict = defaultdict(lambda: 0)
|
||||
for split in ['train', 'val']:
|
||||
with open(__C.QA_PATH[split], 'r') as f:
|
||||
qa_data = json.load(f)
|
||||
for d in qa_data:
|
||||
q_list.append(d['question'])
|
||||
a_list = d['answer']
|
||||
if d['answer'] not in a_dict:
|
||||
a_dict[d['answer']] = 1
|
||||
else:
|
||||
a_dict[d['answer']] += 1
|
||||
|
||||
top_answers = sorted(a_dict, key=a_dict.get, reverse=True)
|
||||
self.qlen_bins_to_idx = {
|
||||
'1-3': 0,
|
||||
'4-8': 1,
|
||||
'9-15': 2,
|
||||
}
|
||||
self.ans_rare_to_idx = {
|
||||
'0-99': 0,
|
||||
'100-299': 1,
|
||||
'300-999': 2,
|
||||
|
||||
}
|
||||
self.qtypes_to_idx = {
|
||||
'what': 0,
|
||||
'who': 1,
|
||||
'how': 2,
|
||||
'when': 3,
|
||||
'where': 4,
|
||||
}
|
||||
|
||||
if __C.RUN_MODE == 'train':
|
||||
self.ans_list = top_answers[:self.ans_size]
|
||||
|
||||
self.ans_to_ix, self.ix_to_ans = ans_stat(self.ans_list)
|
||||
|
||||
self.token_to_ix, self.pretrained_emb = tokenize(q_list, __C.USE_GLOVE)
|
||||
self.token_size = self.token_to_ix.__len__()
|
||||
print('== Question token vocab size:', self.token_size)
|
||||
|
||||
self.idx_to_qtypes = {v: k for (k, v) in self.qtypes_to_idx.items()}
|
||||
self.idx_to_qlen_bins = {v: k for (k, v) in self.qlen_bins_to_idx.items()}
|
||||
self.idx_to_ans_rare = {v: k for (k, v) in self.ans_rare_to_idx.items()}
|
||||
|
||||
def __getitem__(self, idx):
|
||||
sample = self.raw_data[idx]
|
||||
ques = sample['question']
|
||||
q_type = self.qtypes_to_idx[ques.split(' ')[0]]
|
||||
ques_idx, qlen, _ = proc_ques(ques, self.token_to_ix, self.__C.MAX_TOKEN)
|
||||
qlen_bin = self.qlen_bins_to_idx[qlen_to_key(qlen)]
|
||||
|
||||
answer = sample['answer']
|
||||
answer = self.ans_to_ix.get(answer, np.random.randint(0, high=len(self.ans_list)))
|
||||
ans_rarity = self.ans_rare_to_idx[ans_to_key(answer)]
|
||||
|
||||
answer_one_hot = torch.zeros(self.ans_size)
|
||||
answer_one_hot[answer] = 1.0
|
||||
|
||||
vid_id = sample['video_id']
|
||||
frames = torch.load(open(self.frames_dict[vid_id], 'rb')).cpu()
|
||||
clips = torch.load(open(self.clips_dict[vid_id], 'rb')).cpu()
|
||||
|
||||
return torch.from_numpy(ques_idx).long(), frames, clips, answer_one_hot, torch.tensor(answer).long(), \
|
||||
torch.tensor(q_type).long(), torch.tensor(qlen_bin).long(), torch.tensor(ans_rarity).long()
|
||||
|
||||
def __len__(self):
|
||||
return self.data_size
|
182
code/core/data/preprocess.py
Normal file
182
code/core/data/preprocess.py
Normal file
|
@ -0,0 +1,182 @@
|
|||
import os
|
||||
import sys
|
||||
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
|
||||
import skvideo.io as skv
|
||||
import torch
|
||||
import pickle
|
||||
from PIL import Image
|
||||
import tqdm
|
||||
import numpy as np
|
||||
from model.C3D import C3D
|
||||
import json
|
||||
from torchvision.models import vgg19
|
||||
import torchvision.transforms as transforms
|
||||
import torch.nn as nn
|
||||
import argparse
|
||||
|
||||
|
||||
def _select_frames(path, frame_num):
|
||||
"""Select representative frames for video.
|
||||
Ignore some frames both at begin and end of video.
|
||||
Args:
|
||||
path: Path of video.
|
||||
Returns:
|
||||
frames: list of frames.
|
||||
"""
|
||||
frames = list()
|
||||
video_data = skv.vread(path)
|
||||
total_frames = video_data.shape[0]
|
||||
# Ignore some frame at begin and end.
|
||||
for i in np.linspace(0, total_frames, frame_num + 2)[1:frame_num + 1]:
|
||||
frame_data = video_data[int(i)]
|
||||
img = Image.fromarray(frame_data)
|
||||
img = img.resize((224, 224), Image.BILINEAR)
|
||||
frame_data = np.array(img)
|
||||
frames.append(frame_data)
|
||||
return frames
|
||||
|
||||
def _select_clips(path, clip_num):
|
||||
"""Select self.batch_size clips for video. Each clip has 16 frames.
|
||||
Args:
|
||||
path: Path of video.
|
||||
Returns:
|
||||
clips: list of clips.
|
||||
"""
|
||||
clips = list()
|
||||
# video_info = skvideo.io.ffprobe(path)
|
||||
video_data = skv.vread(path)
|
||||
total_frames = video_data.shape[0]
|
||||
height = video_data[1]
|
||||
width = video_data.shape[2]
|
||||
for i in np.linspace(0, total_frames, clip_num + 2)[1:clip_num + 1]:
|
||||
# Select center frame first, then include surrounding frames
|
||||
clip_start = int(i) - 8
|
||||
clip_end = int(i) + 8
|
||||
if clip_start < 0:
|
||||
clip_end = clip_end - clip_start
|
||||
clip_start = 0
|
||||
if clip_end > total_frames:
|
||||
clip_start = clip_start - (clip_end - total_frames)
|
||||
clip_end = total_frames
|
||||
clip = video_data[clip_start:clip_end]
|
||||
new_clip = []
|
||||
for j in range(16):
|
||||
frame_data = clip[j]
|
||||
img = Image.fromarray(frame_data)
|
||||
img = img.resize((112, 112), Image.BILINEAR)
|
||||
frame_data = np.array(img) * 1.0
|
||||
# frame_data -= self.mean[j]
|
||||
new_clip.append(frame_data)
|
||||
clips.append(new_clip)
|
||||
return clips
|
||||
|
||||
def preprocess_videos(video_dir, frame_num, clip_num):
|
||||
frames_dir = os.path.join(os.path.dirname(video_dir), 'frames')
|
||||
os.mkdir(frames_dir)
|
||||
|
||||
clips_dir = os.path.join(os.path.dirname(video_dir), 'clips')
|
||||
os.mkdir(clips_dir)
|
||||
|
||||
for video_name in tqdm.tqdm(os.listdir(video_dir)):
|
||||
video_path = os.path.join(video_dir, video_name)
|
||||
frames = _select_frames(video_path, frame_num)
|
||||
clips = _select_clips(video_path, clip_num)
|
||||
|
||||
with open(os.path.join(frames_dir, video_name.split('.')[0] + '.pkl'), "wb") as f:
|
||||
pickle.dump(frames, f, protocol=pickle.HIGHEST_PROTOCOL)
|
||||
|
||||
with open(os.path.join(clips_dir, video_name.split('.')[0] + '.pkl'), "wb") as f:
|
||||
pickle.dump(clips, f, protocol=pickle.HIGHEST_PROTOCOL)
|
||||
|
||||
|
||||
def generate_video_features(path_frames, path_clips, c3d_path):
|
||||
device = torch.device('cuda:0')
|
||||
frame_feat_dir = os.path.join(os.path.dirname(path_frames), 'frame_feat')
|
||||
os.makedirs(frame_feat_dir, exist_ok=True)
|
||||
|
||||
clip_feat_dir = os.path.join(os.path.dirname(path_frames), 'clip_feat')
|
||||
os.makedirs(clip_feat_dir, exist_ok=True)
|
||||
|
||||
cnn = vgg19(pretrained=True)
|
||||
in_features = cnn.classifier[-1].in_features
|
||||
cnn.classifier = nn.Sequential(
|
||||
*list(cnn.classifier.children())[:-1]) # remove last fc layer
|
||||
cnn.to(device).eval()
|
||||
c3d = C3D()
|
||||
c3d.load_state_dict(torch.load(c3d_path))
|
||||
c3d.to(device).eval()
|
||||
transform = transforms.Compose([transforms.ToTensor(),
|
||||
transforms.Normalize((0.485, 0.456, 0.406),
|
||||
(0.229, 0.224, 0.225))])
|
||||
for vid_name in tqdm.tqdm(os.listdir(path_frames)):
|
||||
frame_path = os.path.join(path_frames, vid_name)
|
||||
clip_path = os.path.join(path_clips, vid_name)
|
||||
|
||||
frames = pickle.load(open(frame_path, 'rb'))
|
||||
clips = pickle.load(open(clip_path, 'rb'))
|
||||
|
||||
frames = [transform(f) for f in frames]
|
||||
frame_feat = []
|
||||
clip_feat = []
|
||||
|
||||
for frame in frames:
|
||||
with torch.no_grad():
|
||||
feat = cnn(frame.unsqueeze(0).to(device))
|
||||
frame_feat.append(feat)
|
||||
for clip in clips:
|
||||
# clip has shape (c x f x h x w)
|
||||
clip = torch.from_numpy(np.float32(np.array(clip)))
|
||||
clip = clip.transpose(3, 0)
|
||||
clip = clip.transpose(3, 1)
|
||||
clip = clip.transpose(3, 2).unsqueeze(0).to(device)
|
||||
with torch.no_grad():
|
||||
feat = c3d(clip)
|
||||
clip_feat.append(feat)
|
||||
frame_feat = torch.cat(frame_feat, dim=0)
|
||||
clip_feat = torch.cat(clip_feat, dim=0)
|
||||
|
||||
torch.save(frame_feat, os.path.join(frame_feat_dir, vid_name.split('.')[0] + '.pt'))
|
||||
torch.save(clip_feat, os.path.join(clip_feat_dir, vid_name.split('.')[0] + '.pt'))
|
||||
|
||||
def parse_args():
|
||||
'''
|
||||
Parse input arguments
|
||||
'''
|
||||
parser = argparse.ArgumentParser(description='Preprocessing Args')
|
||||
|
||||
parser.add_argument('--RAW_VID_PATH', dest='RAW_VID_PATH',
|
||||
help='The path to the raw videos',
|
||||
required=True,
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--FRAMES_OUTPUT_DIR', dest='FRAMES_OUTPUT_DIR',
|
||||
help='The directory where the processed frames and their features will be stored',
|
||||
required=True,
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--CLIPS_OUTPUT_DIR', dest='FRAMES_OUTPUT_DIR',
|
||||
help='The directory where the processed frames and their features will be stored',
|
||||
required=True,
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--C3D_PATH', dest='C3D_PATH',
|
||||
help='Pretrained C3D path',
|
||||
required=True,
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--NUM_SAMPLES', dest='NUM_SAMPLES',
|
||||
help='The number of frames/clips to be sampled from the video',
|
||||
default=20,
|
||||
type=int)
|
||||
|
||||
args = parser.parse_args()
|
||||
return args
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
args = parse_args()
|
||||
preprocess_videos(args.RAW_VID_PATH, args.NUM_SAMPLES, args.NUM_SAMPLES)
|
||||
frames_dir = os.path.join(os.path.dirname(args.RAW_VID_PATH), 'frames')
|
||||
clips_dir = os.path.join(os.path.dirname(args.RAW_VID_PATH), 'clips')
|
||||
generate_video_features(frames_dir, clips_dir)
|
81
code/core/data/utils.py
Normal file
81
code/core/data/utils.py
Normal file
|
@ -0,0 +1,81 @@
|
|||
import en_vectors_web_lg, random, re, json
|
||||
import numpy as np
|
||||
|
||||
def tokenize(ques_list, use_glove):
|
||||
token_to_ix = {
|
||||
'PAD': 0,
|
||||
'UNK': 1,
|
||||
}
|
||||
|
||||
spacy_tool = None
|
||||
pretrained_emb = []
|
||||
if use_glove:
|
||||
spacy_tool = en_vectors_web_lg.load()
|
||||
pretrained_emb.append(spacy_tool('PAD').vector)
|
||||
pretrained_emb.append(spacy_tool('UNK').vector)
|
||||
|
||||
for ques in ques_list:
|
||||
words = re.sub(
|
||||
r"([.,'!?\"()*#:;])",
|
||||
'',
|
||||
ques.lower()
|
||||
).replace('-', ' ').replace('/', ' ').split()
|
||||
|
||||
for word in words:
|
||||
if word not in token_to_ix:
|
||||
token_to_ix[word] = len(token_to_ix)
|
||||
if use_glove:
|
||||
pretrained_emb.append(spacy_tool(word).vector)
|
||||
|
||||
pretrained_emb = np.array(pretrained_emb)
|
||||
|
||||
return token_to_ix, pretrained_emb
|
||||
|
||||
|
||||
def proc_ques(ques, token_to_ix, max_token):
|
||||
ques_ix = np.zeros(max_token, np.int64)
|
||||
|
||||
words = re.sub(
|
||||
r"([.,'!?\"()*#:;])",
|
||||
'',
|
||||
ques.lower()
|
||||
).replace('-', ' ').replace('/', ' ').split()
|
||||
q_len = 0
|
||||
for ix, word in enumerate(words):
|
||||
if word in token_to_ix:
|
||||
ques_ix[ix] = token_to_ix[word]
|
||||
q_len += 1
|
||||
else:
|
||||
ques_ix[ix] = token_to_ix['UNK']
|
||||
|
||||
if ix + 1 == max_token:
|
||||
break
|
||||
|
||||
return ques_ix, q_len, len(words)
|
||||
|
||||
def ans_stat(ans_list):
|
||||
ans_to_ix, ix_to_ans = {}, {}
|
||||
for i, ans in enumerate(ans_list):
|
||||
ans_to_ix[ans] = i
|
||||
ix_to_ans[i] = ans
|
||||
|
||||
return ans_to_ix, ix_to_ans
|
||||
|
||||
def shuffle_list(ans_list):
|
||||
random.shuffle(ans_list)
|
||||
|
||||
def qlen_to_key(q_len):
|
||||
if 1<= q_len <=3:
|
||||
return '1-3'
|
||||
if 4<= q_len <=8:
|
||||
return '4-8'
|
||||
if 9<= q_len:
|
||||
return '9-15'
|
||||
|
||||
def ans_to_key(ans_idx):
|
||||
if 0 <= ans_idx <= 99 :
|
||||
return '0-99'
|
||||
if 100 <= ans_idx <= 299 :
|
||||
return '100-299'
|
||||
if 300 <= ans_idx <= 999 :
|
||||
return '300-999'
|
523
code/core/exec.py
Normal file
523
code/core/exec.py
Normal file
|
@ -0,0 +1,523 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
from core.data.dataset import VideoQA_Dataset
|
||||
from core.model.net import Net1, Net2, Net3, Net4
|
||||
from core.model.optim import get_optim, adjust_lr
|
||||
from core.metrics import get_acc
|
||||
from tqdm import tqdm
|
||||
from core.data.utils import shuffle_list
|
||||
|
||||
import os, json, torch, datetime, pickle, copy, shutil, time, math
|
||||
import numpy as np
|
||||
import torch.nn as nn
|
||||
import torch.utils.data as Data
|
||||
from tensorboardX import SummaryWriter
|
||||
from torch.autograd import Variable as var
|
||||
|
||||
class Execution:
|
||||
def __init__(self, __C):
|
||||
self.__C = __C
|
||||
print('Loading training set ........')
|
||||
__C_train = copy.deepcopy(self.__C)
|
||||
setattr(__C_train, 'RUN_MODE', 'train')
|
||||
self.dataset = VideoQA_Dataset(__C_train)
|
||||
|
||||
self.dataset_eval = None
|
||||
if self.__C.EVAL_EVERY_EPOCH:
|
||||
__C_eval = copy.deepcopy(self.__C)
|
||||
setattr(__C_eval, 'RUN_MODE', 'val')
|
||||
|
||||
print('Loading validation set for per-epoch evaluation ........')
|
||||
self.dataset_eval = VideoQA_Dataset(__C_eval)
|
||||
self.dataset_eval.ans_list = self.dataset.ans_list
|
||||
self.dataset_eval.ans_to_ix, self.dataset_eval.ix_to_ans = self.dataset.ans_to_ix, self.dataset.ix_to_ans
|
||||
self.dataset_eval.token_to_ix, self.dataset_eval.pretrained_emb = self.dataset.token_to_ix, self.dataset.pretrained_emb
|
||||
|
||||
__C_test = copy.deepcopy(self.__C)
|
||||
setattr(__C_test, 'RUN_MODE', 'test')
|
||||
|
||||
self.dataset_test = VideoQA_Dataset(__C_test)
|
||||
self.dataset_test.ans_list = self.dataset.ans_list
|
||||
self.dataset_test.ans_to_ix, self.dataset_test.ix_to_ans = self.dataset.ans_to_ix, self.dataset.ix_to_ans
|
||||
self.dataset_test.token_to_ix, self.dataset_test.pretrained_emb = self.dataset.token_to_ix, self.dataset.pretrained_emb
|
||||
|
||||
self.writer = SummaryWriter(self.__C.TB_PATH)
|
||||
|
||||
def train(self, dataset, dataset_eval=None):
|
||||
# Obtain needed information
|
||||
data_size = dataset.data_size
|
||||
token_size = dataset.token_size
|
||||
ans_size = dataset.ans_size
|
||||
pretrained_emb = dataset.pretrained_emb
|
||||
net = self.construct_net(self.__C.MODEL_TYPE)
|
||||
if os.path.isfile(self.__C.PRETRAINED_PATH) and self.__C.MODEL_TYPE == 11:
|
||||
print('Loading pretrained DNC-weigths')
|
||||
net.load_pretrained_weights()
|
||||
net.cuda()
|
||||
net.train()
|
||||
|
||||
# Define the multi-gpu training if needed
|
||||
if self.__C.N_GPU > 1:
|
||||
net = nn.DataParallel(net, device_ids=self.__C.DEVICES)
|
||||
|
||||
# Define the binary cross entropy loss
|
||||
# loss_fn = torch.nn.BCELoss(size_average=False).cuda()
|
||||
loss_fn = torch.nn.BCELoss(reduction='sum').cuda()
|
||||
# Load checkpoint if resume training
|
||||
if self.__C.RESUME:
|
||||
print(' ========== Resume training')
|
||||
|
||||
if self.__C.CKPT_PATH is not None:
|
||||
print('Warning: you are now using CKPT_PATH args, '
|
||||
'CKPT_VERSION and CKPT_EPOCH will not work')
|
||||
|
||||
path = self.__C.CKPT_PATH
|
||||
else:
|
||||
path = self.__C.CKPTS_PATH + \
|
||||
'ckpt_' + self.__C.CKPT_VERSION + \
|
||||
'/epoch' + str(self.__C.CKPT_EPOCH) + '.pkl'
|
||||
|
||||
# Load the network parameters
|
||||
print('Loading ckpt {}'.format(path))
|
||||
ckpt = torch.load(path)
|
||||
print('Finish!')
|
||||
net.load_state_dict(ckpt['state_dict'])
|
||||
|
||||
# Load the optimizer paramters
|
||||
optim = get_optim(self.__C, net, data_size, ckpt['optim'], lr_base=ckpt['lr_base'])
|
||||
optim._step = int(data_size / self.__C.BATCH_SIZE * self.__C.CKPT_EPOCH)
|
||||
optim.optimizer.load_state_dict(ckpt['optimizer'])
|
||||
|
||||
start_epoch = self.__C.CKPT_EPOCH
|
||||
|
||||
else:
|
||||
if ('ckpt_' + self.__C.VERSION) in os.listdir(self.__C.CKPTS_PATH):
|
||||
shutil.rmtree(self.__C.CKPTS_PATH + 'ckpt_' + self.__C.VERSION)
|
||||
|
||||
os.mkdir(self.__C.CKPTS_PATH + 'ckpt_' + self.__C.VERSION)
|
||||
|
||||
optim = get_optim(self.__C, net, data_size, self.__C.OPTIM)
|
||||
start_epoch = 0
|
||||
|
||||
loss_sum = 0
|
||||
named_params = list(net.named_parameters())
|
||||
grad_norm = np.zeros(len(named_params))
|
||||
|
||||
# Define multi-thread dataloader
|
||||
if self.__C.SHUFFLE_MODE in ['external']:
|
||||
dataloader = Data.DataLoader(
|
||||
dataset,
|
||||
batch_size=self.__C.BATCH_SIZE,
|
||||
shuffle=False,
|
||||
num_workers=self.__C.NUM_WORKERS,
|
||||
pin_memory=self.__C.PIN_MEM,
|
||||
drop_last=True
|
||||
)
|
||||
else:
|
||||
dataloader = Data.DataLoader(
|
||||
dataset,
|
||||
batch_size=self.__C.BATCH_SIZE,
|
||||
shuffle=True,
|
||||
num_workers=self.__C.NUM_WORKERS,
|
||||
pin_memory=self.__C.PIN_MEM,
|
||||
drop_last=True
|
||||
)
|
||||
|
||||
# Training script
|
||||
for epoch in range(start_epoch, self.__C.MAX_EPOCH):
|
||||
|
||||
# Save log information
|
||||
logfile = open(
|
||||
self.__C.LOG_PATH +
|
||||
'log_run_' + self.__C.VERSION + '.txt',
|
||||
'a+'
|
||||
)
|
||||
logfile.write(
|
||||
'nowTime: ' +
|
||||
datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') +
|
||||
'\n'
|
||||
)
|
||||
logfile.close()
|
||||
|
||||
# Learning Rate Decay
|
||||
if epoch in self.__C.LR_DECAY_LIST:
|
||||
adjust_lr(optim, self.__C.LR_DECAY_R)
|
||||
|
||||
# Externally shuffle
|
||||
if self.__C.SHUFFLE_MODE == 'external':
|
||||
shuffle_list(dataset.ans_list)
|
||||
|
||||
time_start = time.time()
|
||||
# Iteration
|
||||
for step, (
|
||||
ques_ix_iter,
|
||||
frames_feat_iter,
|
||||
clips_feat_iter,
|
||||
ans_iter,
|
||||
_,
|
||||
_,
|
||||
_,
|
||||
_
|
||||
) in enumerate(dataloader):
|
||||
|
||||
ques_ix_iter = ques_ix_iter.cuda()
|
||||
frames_feat_iter = frames_feat_iter.cuda()
|
||||
clips_feat_iter = clips_feat_iter.cuda()
|
||||
ans_iter = ans_iter.cuda()
|
||||
|
||||
optim.zero_grad()
|
||||
|
||||
for accu_step in range(self.__C.GRAD_ACCU_STEPS):
|
||||
|
||||
sub_frames_feat_iter = \
|
||||
frames_feat_iter[accu_step * self.__C.SUB_BATCH_SIZE:
|
||||
(accu_step + 1) * self.__C.SUB_BATCH_SIZE]
|
||||
sub_clips_feat_iter = \
|
||||
clips_feat_iter[accu_step * self.__C.SUB_BATCH_SIZE:
|
||||
(accu_step + 1) * self.__C.SUB_BATCH_SIZE]
|
||||
sub_ques_ix_iter = \
|
||||
ques_ix_iter[accu_step * self.__C.SUB_BATCH_SIZE:
|
||||
(accu_step + 1) * self.__C.SUB_BATCH_SIZE]
|
||||
sub_ans_iter = \
|
||||
ans_iter[accu_step * self.__C.SUB_BATCH_SIZE:
|
||||
(accu_step + 1) * self.__C.SUB_BATCH_SIZE]
|
||||
|
||||
pred = net(
|
||||
sub_frames_feat_iter,
|
||||
sub_clips_feat_iter,
|
||||
sub_ques_ix_iter
|
||||
)
|
||||
|
||||
loss = loss_fn |