Initial commit
This commit is contained in:
commit
b5f3b728c3
53 changed files with 7008 additions and 0 deletions
124
README.md
Normal file
124
README.md
Normal file
|
@ -0,0 +1,124 @@
|
|||
This is the official code of the paper **Video Language Co-Attention with Fast-Learning Feature Fusion for VideoQA**.
|
||||
If you find our code useful, please cite our paper:
|
||||
|
||||
# Overview
|
||||
<p align="center"><img src="assets/overview_project_one.png" alt="drawing" width="600" height="400"/></p>
|
||||
|
||||
# Results
|
||||
Our VLCN model achieves **new** state-of-the-art results on two open-ended VideoQA datasets **MSVD-QA** and **MSRVTT-QA**.
|
||||
#### MSVD-QA
|
||||
| <center>Model</center> | <center>What</center> | <center>Who</center> | <center>How</center> | <center>When</center> | <center>Where</center> | <center>All</center> |
|
||||
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
||||
| ST-VQA | 18.10 | 50.00 | **83.80** | 72.40 | 28.60 | 31.30 |
|
||||
| Co-Mem | 19.60 | 48.70 | 81.60 | 74.10 | 31.70 | 31.70 |
|
||||
| HMEMA | 22.40 | 50.00 | 73.00 | 70.70 | 42.90 | 33.70 |
|
||||
| SSML | - | - | - | - | - | 35.13 |
|
||||
| QueST | 24.50 | **52.90** | 79.10 | 72.40 | **50.00** | 36.10 |
|
||||
| HCRN | - | - | - | - | - | 36.10 |
|
||||
| MA-DRNN | 24.30 | 51.60 | 82.00 | **86.30** | 26.30 | 36.20 |
|
||||
| **VLCN (Ours)** | **28.42** | 51.29 | 81.08 | 74.13 | 46.43 | **38.06** |
|
||||
|
||||
#### MSRVTT-QA
|
||||
| <center>Model</center> | <center>What</center> | <center>Who</center> | <center>How</center> | <center>When</center> | <center>Where</center> | <center>All</center> |
|
||||
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
||||
| ST-VQA | 24.50 | 41.20 | 78.00 | 76.50 | 34.90 | 30.90 |
|
||||
| Co-Mem | 23.90 | 42.50 | 74.10 | 69.00 | **42.90** | 32.00 |
|
||||
| HMEMA | 22.40 | **50.10** | 73.00 | 70.70 | 42.90 | 33.70 |
|
||||
| QueST | 27.90 | 45.60 | **83.00** | 75.70 | 31.60 | 34.60 |
|
||||
| SSML | - | - | - | - | - | 35.00 |
|
||||
| HCRN | - | - | - | - | - | 35.60 |
|
||||
| **VLCN (Ours)** | **30.69** | 44.09 | 79.82 | **78.29** | 36.80 | **36.01** |
|
||||
|
||||
# Requirements
|
||||
- PyTorch 1.3.1<br/>
|
||||
- Torchvision 0.4.2<br/>
|
||||
- Python 3.6
|
||||
|
||||
# Raw data
|
||||
The raw data of MSVD-QA and MSRVTT-QA are located in
|
||||
``
|
||||
data/MSVD-QA
|
||||
``
|
||||
and
|
||||
``
|
||||
data/MSRVTT-QA
|
||||
``
|
||||
, respectively.<br/>
|
||||
|
||||
**Videos:** The raw videos of MSVD-QA and MSRVTT-QA can be downloaded from [⬇](https://www.cs.utexas.edu/users/ml/clamp/videoDescription/) and [⬇](https://www.mediafire.com/folder/h14iarbs62e7p/shared), respectively.<br/>
|
||||
**Text:** The text data can be downloaded from [⬇](https://github.com/xudejing/video-question-answering).<br/>
|
||||
|
||||
After downloading all the raw data, ``
|
||||
data/MSVD-QA
|
||||
``
|
||||
and
|
||||
``
|
||||
data/MSRVTT-QA
|
||||
``
|
||||
should have the following structure:
|
||||
<p align="center"><img src="assets/structure.png" alt="PHP Terminal style set text color" /></p>
|
||||
|
||||
# Preprocessing
|
||||
To sample the individual frames and clips and generate the corresponding visual features, we run the script
|
||||
``
|
||||
preporocess.py
|
||||
``
|
||||
on the raw videos with the appropriate flags. E.g. for MSVD-QA we have to execute
|
||||
```bash
|
||||
python core/data/preporocess.py --RAW_VID_PATH /data/MSRVD-QA/videos --C3D_PATH path_to_pretrained_c3d
|
||||
```
|
||||
This will save the individual frames and clips in
|
||||
``
|
||||
data/MSVD-QA/frames
|
||||
``
|
||||
and
|
||||
``
|
||||
data/MSVD-QA/clips
|
||||
``
|
||||
, respectively, and their visual features in
|
||||
|
||||
``
|
||||
data/MSVD-QA/frame_feat
|
||||
``
|
||||
and
|
||||
``
|
||||
data/MSVD-QA/clip_feat
|
||||
``, respectively.
|
||||
|
||||
# Config files
|
||||
Before starting training, one has to update the config path file
|
||||
``
|
||||
cfgs/path_cfgs.py
|
||||
``
|
||||
with the paths of the raw data as well as the visual feaures.<br/>
|
||||
All Hyperparameters can be adjusted in
|
||||
``
|
||||
cfgs/base_cfgs.py
|
||||
``.
|
||||
|
||||
# Training
|
||||
To start training, one has to specify an experiment directory
|
||||
``
|
||||
EXP_NAME
|
||||
``
|
||||
where all the results (log files, checkpoints, tensorboard files etc) will be saved. Futhermore, one needs to specify the
|
||||
``
|
||||
MODEL_TYPE
|
||||
``
|
||||
of the VLCN to be trained.
|
||||
| <center>MODEL_TYPE</center> | <center>Description</center> |
|
||||
| :---: | :---: |
|
||||
| 1 | VLCN |
|
||||
| 2 | VLCN-FLF |
|
||||
| 3 | VLCV+LSTM |
|
||||
| 4 | MCAN |
|
||||
|
||||
These parameters can be set inline. E.g. by executing
|
||||
```bash
|
||||
python run.py --EXP_NAME experiment --MODEL_TYPE 1 --DATA_PATH /data/MSRVD-QA --GPU 1 --SEED 42
|
||||
```
|
||||
# Pre-trained models
|
||||
Our pre-trained models are available here [⬇](https://drive.google.com/drive/folders/172yj4iUkF1U1WOPdA5KuKOTQXkgzFEzS)
|
||||
|
||||
# Acknowledgements
|
||||
We thank the Vision and Language Group@ MIL for their [MCAN](https://github.com/MILVLG/mcan-vqa) open source implementation, [DavidA](https://github.com/DavideA/c3d-pytorch/blob/master/C3D_model.py) for his pretrained C3D model and finally [ixaxaar](https://github.com/ixaxaar/pytorch-dnc) for his DNC implementation.
|
0
assets/.gitkeep
Normal file
0
assets/.gitkeep
Normal file
BIN
assets/overview_project_one.png
Normal file
BIN
assets/overview_project_one.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 314 KiB |
BIN
assets/structure.png
Normal file
BIN
assets/structure.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 18 KiB |
0
cfgs/.gitkeep
Normal file
0
cfgs/.gitkeep
Normal file
267
cfgs/base_cfgs.py
Normal file
267
cfgs/base_cfgs.py
Normal file
|
@ -0,0 +1,267 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
from cfgs.path_cfgs import PATH
|
||||
|
||||
import os, torch, random
|
||||
import numpy as np
|
||||
from types import MethodType
|
||||
|
||||
|
||||
class Cfgs(PATH):
|
||||
def __init__(self, EXP_NAME, DATASET_PATH):
|
||||
super(Cfgs, self).__init__(EXP_NAME, DATASET_PATH)
|
||||
|
||||
# Set Devices
|
||||
# If use multi-gpu training, set e.g.'0, 1, 2' instead
|
||||
self.GPU = '0'
|
||||
|
||||
# Set RNG For CPU And GPUs
|
||||
self.SEED = random.randint(0, 99999999)
|
||||
|
||||
# -------------------------
|
||||
# ---- Version Control ----
|
||||
# -------------------------
|
||||
|
||||
# Define a specific name to start new training
|
||||
# self.VERSION = 'Anonymous_' + str(self.SEED)
|
||||
self.VERSION = str(self.SEED)
|
||||
|
||||
# Resume training
|
||||
self.RESUME = False
|
||||
|
||||
# Used in Resume training and testing
|
||||
self.CKPT_VERSION = self.VERSION
|
||||
self.CKPT_EPOCH = 0
|
||||
|
||||
# Absolutely checkpoint path, 'CKPT_VERSION' and 'CKPT_EPOCH' will be overridden
|
||||
self.CKPT_PATH = None
|
||||
|
||||
# Print loss every step
|
||||
self.VERBOSE = True
|
||||
|
||||
|
||||
# ------------------------------
|
||||
# ---- Data Provider Params ----
|
||||
# ------------------------------
|
||||
|
||||
# {'train', 'val', 'test'}
|
||||
self.RUN_MODE = 'train'
|
||||
|
||||
# Set True to evaluate offline
|
||||
self.EVAL_EVERY_EPOCH = True
|
||||
|
||||
# # Define the 'train' 'val' 'test' data split
|
||||
# # (EVAL_EVERY_EPOCH triggered when set {'train': 'train'})
|
||||
# self.SPLIT = {
|
||||
# 'train': '',
|
||||
# 'val': 'val',
|
||||
# 'test': 'test',
|
||||
# }
|
||||
# # A external method to set train split
|
||||
# self.TRAIN_SPLIT = 'train+val+vg'
|
||||
|
||||
# Set True to use pretrained word embedding
|
||||
# (GloVe: spaCy https://spacy.io/)
|
||||
self.USE_GLOVE = True
|
||||
|
||||
# Word embedding matrix size
|
||||
# (token size x WORD_EMBED_SIZE)
|
||||
self.WORD_EMBED_SIZE = 300
|
||||
|
||||
# Max length of question sentences
|
||||
self.MAX_TOKEN = 15
|
||||
|
||||
# VGG 4096D features
|
||||
self.FRAME_FEAT_SIZE = 4096
|
||||
|
||||
# C3D 4096D features
|
||||
self.CLIP_FEAT_SIZE = 4096
|
||||
|
||||
self.NUM_ANS = 1000
|
||||
|
||||
# Default training batch size: 64
|
||||
self.BATCH_SIZE = 64
|
||||
|
||||
# Multi-thread I/O
|
||||
self.NUM_WORKERS = 8
|
||||
|
||||
# Use pin memory
|
||||
# (Warning: pin memory can accelerate GPU loading but may
|
||||
# increase the CPU memory usage when NUM_WORKS is large)
|
||||
self.PIN_MEM = True
|
||||
|
||||
# Large model can not training with batch size 64
|
||||
# Gradient accumulate can split batch to reduce gpu memory usage
|
||||
# (Warning: BATCH_SIZE should be divided by GRAD_ACCU_STEPS)
|
||||
self.GRAD_ACCU_STEPS = 1
|
||||
|
||||
# Set 'external': use external shuffle method to implement training shuffle
|
||||
# Set 'internal': use pytorch dataloader default shuffle method
|
||||
self.SHUFFLE_MODE = 'external'
|
||||
|
||||
|
||||
# ------------------------
|
||||
# ---- Network Params ----
|
||||
# ------------------------
|
||||
|
||||
# Model deeps
|
||||
# (Encoder and Decoder will be same deeps)
|
||||
self.LAYER = 6
|
||||
|
||||
# Model hidden size
|
||||
# (512 as default, bigger will be a sharp increase of gpu memory usage)
|
||||
self.HIDDEN_SIZE = 512
|
||||
|
||||
# Multi-head number in MCA layers
|
||||
# (Warning: HIDDEN_SIZE should be divided by MULTI_HEAD)
|
||||
self.MULTI_HEAD = 8
|
||||
|
||||
# Dropout rate for all dropout layers
|
||||
# (dropout can prevent overfitting: [Dropout: a simple way to prevent neural networks from overfitting])
|
||||
self.DROPOUT_R = 0.1
|
||||
|
||||
# MLP size in flatten layers
|
||||
self.FLAT_MLP_SIZE = 512
|
||||
|
||||
# Flatten the last hidden to vector with {n} attention glimpses
|
||||
self.FLAT_GLIMPSES = 1
|
||||
self.FLAT_OUT_SIZE = 1024
|
||||
|
||||
|
||||
# --------------------------
|
||||
# ---- Optimizer Params ----
|
||||
# --------------------------
|
||||
|
||||
# The base learning rate
|
||||
self.LR_BASE = 0.0001
|
||||
|
||||
# Learning rate decay ratio
|
||||
self.LR_DECAY_R = 0.2
|
||||
|
||||
# Learning rate decay at {x, y, z...} epoch
|
||||
self.LR_DECAY_LIST = [10, 12]
|
||||
|
||||
# Max training epoch
|
||||
self.MAX_EPOCH = 30
|
||||
|
||||
# Gradient clip
|
||||
# (default: -1 means not using)
|
||||
self.GRAD_NORM_CLIP = -1
|
||||
|
||||
# Adam optimizer betas and eps
|
||||
self.OPT_BETAS = (0.9, 0.98)
|
||||
self.OPT_EPS = 1e-9
|
||||
self.OPT_WEIGHT_DECAY = 1e-5
|
||||
# --------------------------
|
||||
# ---- DNC Hyper-Params ----
|
||||
# --------------------------
|
||||
self.IN_SIZE_DNC = self.HIDDEN_SIZE
|
||||
self.OUT_SIZE_DNC = self.HIDDEN_SIZE
|
||||
self.WORD_LENGTH_DNC = 512
|
||||
self.CELL_COUNT_DNC = 64
|
||||
self.MEM_HIDDEN_SIZE = self.CELL_COUNT_DNC * self.WORD_LENGTH_DNC
|
||||
self.N_READ_HEADS_DNC = 4
|
||||
|
||||
def parse_to_dict(self, args):
|
||||
args_dict = {}
|
||||
for arg in dir(args):
|
||||
if not arg.startswith('_') and not isinstance(getattr(args, arg), MethodType):
|
||||
if getattr(args, arg) is not None:
|
||||
args_dict[arg] = getattr(args, arg)
|
||||
|
||||
return args_dict
|
||||
|
||||
|
||||
def add_args(self, args_dict):
|
||||
for arg in args_dict:
|
||||
setattr(self, arg, args_dict[arg])
|
||||
|
||||
|
||||
def proc(self):
|
||||
assert self.RUN_MODE in ['train', 'val', 'test']
|
||||
|
||||
# ------------ Devices setup
|
||||
# os.environ['CUDA_VISIBLE_DEVICES'] = self.GPU
|
||||
self.N_GPU = len(self.GPU.split(','))
|
||||
self.DEVICES = [_ for _ in range(self.N_GPU)]
|
||||
torch.set_num_threads(2)
|
||||
|
||||
|
||||
# ------------ Seed setup
|
||||
# fix pytorch seed
|
||||
torch.manual_seed(self.SEED)
|
||||
if self.N_GPU < 2:
|
||||
torch.cuda.manual_seed(self.SEED)
|
||||
else:
|
||||
torch.cuda.manual_seed_all(self.SEED)
|
||||
torch.backends.cudnn.deterministic = True
|
||||
|
||||
# fix numpy seed
|
||||
np.random.seed(self.SEED)
|
||||
|
||||
# fix random seed
|
||||
random.seed(self.SEED)
|
||||
|
||||
if self.CKPT_PATH is not None:
|
||||
print('Warning: you are now using CKPT_PATH args, '
|
||||
'CKPT_VERSION and CKPT_EPOCH will not work')
|
||||
self.CKPT_VERSION = self.CKPT_PATH.split('/')[-1] + '_' + str(random.randint(0, 99999999))
|
||||
|
||||
|
||||
# ------------ Split setup
|
||||
self.SPLIT['train'] = self.TRAIN_SPLIT
|
||||
if 'val' in self.SPLIT['train'].split('+') or self.RUN_MODE not in ['train']:
|
||||
self.EVAL_EVERY_EPOCH = False
|
||||
|
||||
if self.RUN_MODE not in ['test']:
|
||||
self.TEST_SAVE_PRED = False
|
||||
|
||||
|
||||
# ------------ Gradient accumulate setup
|
||||
assert self.BATCH_SIZE % self.GRAD_ACCU_STEPS == 0
|
||||
self.SUB_BATCH_SIZE = int(self.BATCH_SIZE / self.GRAD_ACCU_STEPS)
|
||||
|
||||
# Use a small eval batch will reduce gpu memory usage
|
||||
self.EVAL_BATCH_SIZE = 32
|
||||
|
||||
|
||||
# ------------ Networks setup
|
||||
# FeedForwardNet size in every MCA layer
|
||||
self.FF_SIZE = int(self.HIDDEN_SIZE * 4)
|
||||
self.FF_MEM_SIZE = int()
|
||||
|
||||
# A pipe line hidden size in attention compute
|
||||
assert self.HIDDEN_SIZE % self.MULTI_HEAD == 0
|
||||
self.HIDDEN_SIZE_HEAD = int(self.HIDDEN_SIZE / self.MULTI_HEAD)
|
||||
|
||||
|
||||
def __str__(self):
|
||||
for attr in dir(self):
|
||||
if not attr.startswith('__') and not isinstance(getattr(self, attr), MethodType):
|
||||
print('{ %-17s }->' % attr, getattr(self, attr))
|
||||
|
||||
return ''
|
||||
|
||||
def check_path(self):
|
||||
print('Checking dataset ...')
|
||||
|
||||
|
||||
if not os.path.exists(self.FRAMES):
|
||||
print(self.FRAMES + 'NOT EXIST')
|
||||
exit(-1)
|
||||
|
||||
if not os.path.exists(self.CLIPS):
|
||||
print(self.CLIPS + 'NOT EXIST')
|
||||
exit(-1)
|
||||
|
||||
for mode in self.QA_PATH:
|
||||
if not os.path.exists(self.QA_PATH[mode]):
|
||||
print(self.QA_PATH[mode] + 'NOT EXIST')
|
||||
exit(-1)
|
||||
|
||||
print('Finished')
|
||||
print('')
|
6
cfgs/fusion_cfgs.yml
Normal file
6
cfgs/fusion_cfgs.yml
Normal file
|
@ -0,0 +1,6 @@
|
|||
CONTROLLER_INPUT_SIZE: 512
|
||||
CONTROLLER_HIDDEN_SIZE: 512
|
||||
CONTROLLER_NUM_LAYERS: 2
|
||||
HIDDEN_DIM_COMP: 1024
|
||||
OUT_DIM_COMP: 512
|
||||
COMP_NUM_LAYERS: 2
|
61
cfgs/path_cfgs.py
Normal file
61
cfgs/path_cfgs.py
Normal file
|
@ -0,0 +1,61 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
import os
|
||||
|
||||
class PATH:
|
||||
def __init__(self, EXP_NAME, DATASET_PATH):
|
||||
# name of the experiment
|
||||
self.EXP_NAME = EXP_NAME
|
||||
|
||||
# Dataset root path
|
||||
self.DATASET_PATH = DATASET_PATH
|
||||
|
||||
# Bottom up features root path
|
||||
self.FRAMES = os.path.join(DATASET_PATH, 'frame_feat/')
|
||||
self.CLIPS = os.path.join(DATASET_PATH, 'clip_feat/')
|
||||
|
||||
|
||||
def init_path(self):
|
||||
self.QA_PATH = {
|
||||
'train': self.DATASET_PATH + 'train_qa.json',
|
||||
'val': self.DATASET_PATH + 'val_qa.json',
|
||||
'test': self.DATASET_PATH + 'test_qa.json',
|
||||
}
|
||||
self.C3D_PATH = self.DATASET_PATH + 'c3d.pickle'
|
||||
|
||||
if self.EXP_NAME not in os.listdir('./'):
|
||||
os.mkdir('./' + self.EXP_NAME)
|
||||
os.mkdir('./' + self.EXP_NAME + '/results')
|
||||
self.RESULT_PATH = './' + self.EXP_NAME + '/results/result_test/'
|
||||
self.PRED_PATH = './' + self.EXP_NAME + '/results/pred/'
|
||||
self.CACHE_PATH = './' + self.EXP_NAME + '/results/cache/'
|
||||
self.LOG_PATH = './' + self.EXP_NAME + '/results/log/'
|
||||
self.TB_PATH = './' + self.EXP_NAME + '/results/tensorboard/'
|
||||
self.CKPTS_PATH = './' + self.EXP_NAME + '/ckpts/'
|
||||
|
||||
if 'result_test' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/result_test/')
|
||||
|
||||
if 'pred' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/pred/')
|
||||
|
||||
if 'cache' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/cache')
|
||||
|
||||
if 'log' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/log')
|
||||
|
||||
if 'tensorboard' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/tensorboard')
|
||||
|
||||
if 'ckpts' not in os.listdir('./' + self.EXP_NAME):
|
||||
os.mkdir('./' + self.EXP_NAME + '/ckpts')
|
||||
|
||||
|
||||
def check_path(self):
|
||||
raise NotImplementedError
|
||||
|
13
cfgs/small_model.yml
Normal file
13
cfgs/small_model.yml
Normal file
|
@ -0,0 +1,13 @@
|
|||
LAYER: 6
|
||||
HIDDEN_SIZE: 512
|
||||
MEM_HIDDEN_SIZE: 2048
|
||||
MULTI_HEAD: 8
|
||||
DROPOUT_R: 0.1
|
||||
FLAT_MLP_SIZE: 512
|
||||
FLAT_GLIMPSES: 1
|
||||
FLAT_OUT_SIZE: 1024
|
||||
LR_BASE: 0.0001
|
||||
LR_DECAY_R: 0.2
|
||||
GRAD_ACCU_STEPS: 1
|
||||
CKPT_VERSION: 'small'
|
||||
CKPT_EPOCH: 13
|
0
code/.gitkeep
Normal file
0
code/.gitkeep
Normal file
0
code/assets/.gitkeep
Normal file
0
code/assets/.gitkeep
Normal file
BIN
code/assets/structure.png
Normal file
BIN
code/assets/structure.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 18 KiB |
0
code/cfgs/.gitkeep
Normal file
0
code/cfgs/.gitkeep
Normal file
267
code/cfgs/base_cfgs.py
Normal file
267
code/cfgs/base_cfgs.py
Normal file
|
@ -0,0 +1,267 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
from cfgs.path_cfgs import PATH
|
||||
|
||||
import os, torch, random
|
||||
import numpy as np
|
||||
from types import MethodType
|
||||
|
||||
|
||||
class Cfgs(PATH):
|
||||
def __init__(self, EXP_NAME, DATASET_PATH):
|
||||
super(Cfgs, self).__init__(EXP_NAME, DATASET_PATH)
|
||||
|
||||
# Set Devices
|
||||
# If use multi-gpu training, set e.g.'0, 1, 2' instead
|
||||
self.GPU = '0'
|
||||
|
||||
# Set RNG For CPU And GPUs
|
||||
self.SEED = random.randint(0, 99999999)
|
||||
|
||||
# -------------------------
|
||||
# ---- Version Control ----
|
||||
# -------------------------
|
||||
|
||||
# Define a specific name to start new training
|
||||
# self.VERSION = 'Anonymous_' + str(self.SEED)
|
||||
self.VERSION = str(self.SEED)
|
||||
|
||||
# Resume training
|
||||
self.RESUME = False
|
||||
|
||||
# Used in Resume training and testing
|
||||
self.CKPT_VERSION = self.VERSION
|
||||
self.CKPT_EPOCH = 0
|
||||
|
||||
# Absolutely checkpoint path, 'CKPT_VERSION' and 'CKPT_EPOCH' will be overridden
|
||||
self.CKPT_PATH = None
|
||||
|
||||
# Print loss every step
|
||||
self.VERBOSE = True
|
||||
|
||||
|
||||
# ------------------------------
|
||||
# ---- Data Provider Params ----
|
||||
# ------------------------------
|
||||
|
||||
# {'train', 'val', 'test'}
|
||||
self.RUN_MODE = 'train'
|
||||
|
||||
# Set True to evaluate offline
|
||||
self.EVAL_EVERY_EPOCH = True
|
||||
|
||||
# # Define the 'train' 'val' 'test' data split
|
||||
# # (EVAL_EVERY_EPOCH triggered when set {'train': 'train'})
|
||||
# self.SPLIT = {
|
||||
# 'train': '',
|
||||
# 'val': 'val',
|
||||
# 'test': 'test',
|
||||
# }
|
||||
# # A external method to set train split
|
||||
# self.TRAIN_SPLIT = 'train+val+vg'
|
||||
|
||||
# Set True to use pretrained word embedding
|
||||
# (GloVe: spaCy https://spacy.io/)
|
||||
self.USE_GLOVE = True
|
||||
|
||||
# Word embedding matrix size
|
||||
# (token size x WORD_EMBED_SIZE)
|
||||
self.WORD_EMBED_SIZE = 300
|
||||
|
||||
# Max length of question sentences
|
||||
self.MAX_TOKEN = 15
|
||||
|
||||
# VGG 4096D features
|
||||
self.FRAME_FEAT_SIZE = 4096
|
||||
|
||||
# C3D 4096D features
|
||||
self.CLIP_FEAT_SIZE = 4096
|
||||
|
||||
self.NUM_ANS = 1000
|
||||
|
||||
# Default training batch size: 64
|
||||
self.BATCH_SIZE = 64
|
||||
|
||||
# Multi-thread I/O
|
||||
self.NUM_WORKERS = 8
|
||||
|
||||
# Use pin memory
|
||||
# (Warning: pin memory can accelerate GPU loading but may
|
||||
# increase the CPU memory usage when NUM_WORKS is large)
|
||||
self.PIN_MEM = True
|
||||
|
||||
# Large model can not training with batch size 64
|
||||
# Gradient accumulate can split batch to reduce gpu memory usage
|
||||
# (Warning: BATCH_SIZE should be divided by GRAD_ACCU_STEPS)
|
||||
self.GRAD_ACCU_STEPS = 1
|
||||
|
||||
# Set 'external': use external shuffle method to implement training shuffle
|
||||
# Set 'internal': use pytorch dataloader default shuffle method
|
||||
self.SHUFFLE_MODE = 'external'
|
||||
|
||||
|
||||
# ------------------------
|
||||
# ---- Network Params ----
|
||||
# ------------------------
|
||||
|
||||
# Model deeps
|
||||
# (Encoder and Decoder will be same deeps)
|
||||
self.LAYER = 6
|
||||
|
||||
# Model hidden size
|
||||
# (512 as default, bigger will be a sharp increase of gpu memory usage)
|
||||
self.HIDDEN_SIZE = 512
|
||||
|
||||
# Multi-head number in MCA layers
|
||||
# (Warning: HIDDEN_SIZE should be divided by MULTI_HEAD)
|
||||
self.MULTI_HEAD = 8
|
||||
|
||||
# Dropout rate for all dropout layers
|
||||
# (dropout can prevent overfitting: [Dropout: a simple way to prevent neural networks from overfitting])
|
||||
self.DROPOUT_R = 0.1
|
||||
|
||||
# MLP size in flatten layers
|
||||
self.FLAT_MLP_SIZE = 512
|
||||
|
||||
# Flatten the last hidden to vector with {n} attention glimpses
|
||||
self.FLAT_GLIMPSES = 1
|
||||
self.FLAT_OUT_SIZE = 1024
|
||||
|
||||
|
||||
# --------------------------
|
||||
# ---- Optimizer Params ----
|
||||
# --------------------------
|
||||
|
||||
# The base learning rate
|
||||
self.LR_BASE = 0.0001
|
||||
|
||||
# Learning rate decay ratio
|
||||
self.LR_DECAY_R = 0.2
|
||||
|
||||
# Learning rate decay at {x, y, z...} epoch
|
||||
self.LR_DECAY_LIST = [10, 12]
|
||||
|
||||
# Max training epoch
|
||||
self.MAX_EPOCH = 30
|
||||
|
||||
# Gradient clip
|
||||
# (default: -1 means not using)
|
||||
self.GRAD_NORM_CLIP = -1
|
||||
|
||||
# Adam optimizer betas and eps
|
||||
self.OPT_BETAS = (0.9, 0.98)
|
||||
self.OPT_EPS = 1e-9
|
||||
self.OPT_WEIGHT_DECAY = 1e-5
|
||||
# --------------------------
|
||||
# ---- DNC Hyper-Params ----
|
||||
# --------------------------
|
||||
self.IN_SIZE_DNC = self.HIDDEN_SIZE
|
||||
self.OUT_SIZE_DNC = self.HIDDEN_SIZE
|
||||
self.WORD_LENGTH_DNC = 512
|
||||
self.CELL_COUNT_DNC = 64
|
||||
self.MEM_HIDDEN_SIZE = self.CELL_COUNT_DNC * self.WORD_LENGTH_DNC
|
||||
self.N_READ_HEADS_DNC = 4
|
||||
|
||||
def parse_to_dict(self, args):
|
||||
args_dict = {}
|
||||
for arg in dir(args):
|
||||
if not arg.startswith('_') and not isinstance(getattr(args, arg), MethodType):
|
||||
if getattr(args, arg) is not None:
|
||||
args_dict[arg] = getattr(args, arg)
|
||||
|
||||
return args_dict
|
||||
|
||||
|
||||
def add_args(self, args_dict):
|
||||
for arg in args_dict:
|
||||
setattr(self, arg, args_dict[arg])
|
||||
|
||||
|
||||
def proc(self):
|
||||
assert self.RUN_MODE in ['train', 'val', 'test']
|
||||
|
||||
# ------------ Devices setup
|
||||
# os.environ['CUDA_VISIBLE_DEVICES'] = self.GPU
|
||||
self.N_GPU = len(self.GPU.split(','))
|
||||
self.DEVICES = [_ for _ in range(self.N_GPU)]
|
||||
torch.set_num_threads(2)
|
||||
|
||||
|
||||
# ------------ Seed setup
|
||||
# fix pytorch seed
|
||||
torch.manual_seed(self.SEED)
|
||||
if self.N_GPU < 2:
|
||||
torch.cuda.manual_seed(self.SEED)
|
||||
else:
|
||||
torch.cuda.manual_seed_all(self.SEED)
|
||||
torch.backends.cudnn.deterministic = True
|
||||
|
||||
# fix numpy seed
|
||||
np.random.seed(self.SEED)
|
||||
|
||||
# fix random seed
|
||||
random.seed(self.SEED)
|
||||
|
||||
if self.CKPT_PATH is not None:
|
||||
print('Warning: you are now using CKPT_PATH args, '
|
||||
'CKPT_VERSION and CKPT_EPOCH will not work')
|
||||
self.CKPT_VERSION = self.CKPT_PATH.split('/')[-1] + '_' + str(random.randint(0, 99999999))
|
||||
|
||||
|
||||
# ------------ Split setup
|
||||
self.SPLIT['train'] = self.TRAIN_SPLIT
|
||||
if 'val' in self.SPLIT['train'].split('+') or self.RUN_MODE not in ['train']:
|
||||
self.EVAL_EVERY_EPOCH = False
|
||||
|
||||
if self.RUN_MODE not in ['test']:
|
||||
self.TEST_SAVE_PRED = False
|
||||
|
||||
|
||||
# ------------ Gradient accumulate setup
|
||||
assert self.BATCH_SIZE % self.GRAD_ACCU_STEPS == 0
|
||||
self.SUB_BATCH_SIZE = int(self.BATCH_SIZE / self.GRAD_ACCU_STEPS)
|
||||
|
||||
# Use a small eval batch will reduce gpu memory usage
|
||||
self.EVAL_BATCH_SIZE = 32
|
||||
|
||||
|
||||
# ------------ Networks setup
|
||||
# FeedForwardNet size in every MCA layer
|
||||
self.FF_SIZE = int(self.HIDDEN_SIZE * 4)
|
||||
self.FF_MEM_SIZE = int()
|
||||
|
||||
# A pipe line hidden size in attention compute
|
||||
assert self.HIDDEN_SIZE % self.MULTI_HEAD == 0
|
||||
self.HIDDEN_SIZE_HEAD = int(self.HIDDEN_SIZE / self.MULTI_HEAD)
|
||||
|
||||
|
||||
def __str__(self):
|
||||
for attr in dir(self):
|
||||
if not attr.startswith('__') and not isinstance(getattr(self, attr), MethodType):
|
||||
print('{ %-17s }->' % attr, getattr(self, attr))
|
||||
|
||||
return ''
|
||||
|
||||
def check_path(self):
|
||||
print('Checking dataset ...')
|
||||
|
||||
|
||||
if not os.path.exists(self.FRAMES):
|
||||
print(self.FRAMES + 'NOT EXIST')
|
||||
exit(-1)
|
||||
|
||||
if not os.path.exists(self.CLIPS):
|
||||
print(self.CLIPS + 'NOT EXIST')
|
||||
exit(-1)
|
||||
|
||||
for mode in self.QA_PATH:
|
||||
if not os.path.exists(self.QA_PATH[mode]):
|
||||
print(self.QA_PATH[mode] + 'NOT EXIST')
|
||||
exit(-1)
|
||||
|
||||
print('Finished')
|
||||
print('')
|
6
code/cfgs/fusion_cfgs.yml
Normal file
6
code/cfgs/fusion_cfgs.yml
Normal file
|
@ -0,0 +1,6 @@
|
|||
CONTROLLER_INPUT_SIZE: 512
|
||||
CONTROLLER_HIDDEN_SIZE: 512
|
||||
CONTROLLER_NUM_LAYERS: 2
|
||||
HIDDEN_DIM_COMP: 1024
|
||||
OUT_DIM_COMP: 512
|
||||
COMP_NUM_LAYERS: 2
|
61
code/cfgs/path_cfgs.py
Normal file
61
code/cfgs/path_cfgs.py
Normal file
|
@ -0,0 +1,61 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
import os
|
||||
|
||||
class PATH:
|
||||
def __init__(self, EXP_NAME, DATASET_PATH):
|
||||
# name of the experiment
|
||||
self.EXP_NAME = EXP_NAME
|
||||
|
||||
# Dataset root path
|
||||
self.DATASET_PATH = DATASET_PATH
|
||||
|
||||
# Bottom up features root path
|
||||
self.FRAMES = os.path.join(DATASET_PATH, 'frame_feat/')
|
||||
self.CLIPS = os.path.join(DATASET_PATH, 'clip_feat/')
|
||||
|
||||
|
||||
def init_path(self):
|
||||
self.QA_PATH = {
|
||||
'train': self.DATASET_PATH + 'train_qa.json',
|
||||
'val': self.DATASET_PATH + 'val_qa.json',
|
||||
'test': self.DATASET_PATH + 'test_qa.json',
|
||||
}
|
||||
self.C3D_PATH = self.DATASET_PATH + 'c3d.pickle'
|
||||
|
||||
if self.EXP_NAME not in os.listdir('./'):
|
||||
os.mkdir('./' + self.EXP_NAME)
|
||||
os.mkdir('./' + self.EXP_NAME + '/results')
|
||||
self.RESULT_PATH = './' + self.EXP_NAME + '/results/result_test/'
|
||||
self.PRED_PATH = './' + self.EXP_NAME + '/results/pred/'
|
||||
self.CACHE_PATH = './' + self.EXP_NAME + '/results/cache/'
|
||||
self.LOG_PATH = './' + self.EXP_NAME + '/results/log/'
|
||||
self.TB_PATH = './' + self.EXP_NAME + '/results/tensorboard/'
|
||||
self.CKPTS_PATH = './' + self.EXP_NAME + '/ckpts/'
|
||||
|
||||
if 'result_test' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/result_test/')
|
||||
|
||||
if 'pred' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/pred/')
|
||||
|
||||
if 'cache' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/cache')
|
||||
|
||||
if 'log' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/log')
|
||||
|
||||
if 'tensorboard' not in os.listdir('./' + self.EXP_NAME + '/results'):
|
||||
os.mkdir('./' + self.EXP_NAME + '/results/tensorboard')
|
||||
|
||||
if 'ckpts' not in os.listdir('./' + self.EXP_NAME):
|
||||
os.mkdir('./' + self.EXP_NAME + '/ckpts')
|
||||
|
||||
|
||||
def check_path(self):
|
||||
raise NotImplementedError
|
||||
|
13
code/cfgs/small_model.yml
Normal file
13
code/cfgs/small_model.yml
Normal file
|
@ -0,0 +1,13 @@
|
|||
LAYER: 6
|
||||
HIDDEN_SIZE: 512
|
||||
MEM_HIDDEN_SIZE: 2048
|
||||
MULTI_HEAD: 8
|
||||
DROPOUT_R: 0.1
|
||||
FLAT_MLP_SIZE: 512
|
||||
FLAT_GLIMPSES: 1
|
||||
FLAT_OUT_SIZE: 1024
|
||||
LR_BASE: 0.0001
|
||||
LR_DECAY_R: 0.2
|
||||
GRAD_ACCU_STEPS: 1
|
||||
CKPT_VERSION: 'small'
|
||||
CKPT_EPOCH: 13
|
0
code/core/.gitkeep
Normal file
0
code/core/.gitkeep
Normal file
0
code/core/data/.gitkeep
Normal file
0
code/core/data/.gitkeep
Normal file
103
code/core/data/dataset.py
Normal file
103
code/core/data/dataset.py
Normal file
|
@ -0,0 +1,103 @@
|
|||
import glob, os, json, pickle
|
||||
import numpy as np
|
||||
from collections import defaultdict
|
||||
|
||||
import torch
|
||||
from torch.utils.data import Dataset
|
||||
import torchvision.transforms as transforms
|
||||
|
||||
from core.data.utils import tokenize, ans_stat, proc_ques, qlen_to_key, ans_to_key
|
||||
|
||||
|
||||
class VideoQA_Dataset(Dataset):
|
||||
def __init__(self, __C):
|
||||
super(VideoQA_Dataset, self).__init__()
|
||||
self.__C = __C
|
||||
self.ans_size = __C.NUM_ANS
|
||||
# load raw data
|
||||
with open(__C.QA_PATH[__C.RUN_MODE], 'r') as f:
|
||||
self.raw_data = json.load(f)
|
||||
self.data_size = len(self.raw_data)
|
||||
|
||||
splits = __C.SPLIT[__C.RUN_MODE].split('+')
|
||||
|
||||
frames_list = glob.glob(__C.FRAMES + '*.pt')
|
||||
clips_list = glob.glob(__C.CLIPS + '*.pt')
|
||||
if 'msvd' in self.C.DATASET_PATH.lower():
|
||||
vid_ids = [int(s.split('/')[-1].split('.')[0][3:]) for s in frames_list]
|
||||
else:
|
||||
vid_ids = [int(s.split('/')[-1].split('.')[0][5:]) for s in frames_list]
|
||||
self.frames_dict = {k: v for (k,v) in zip(vid_ids, frames_list)}
|
||||
self.clips_dict = {k: v for (k,v) in zip(vid_ids, clips_list)}
|
||||
del frames_list, clips_list
|
||||
|
||||
q_list = []
|
||||
a_list = []
|
||||
a_dict = defaultdict(lambda: 0)
|
||||
for split in ['train', 'val']:
|
||||
with open(__C.QA_PATH[split], 'r') as f:
|
||||
qa_data = json.load(f)
|
||||
for d in qa_data:
|
||||
q_list.append(d['question'])
|
||||
a_list = d['answer']
|
||||
if d['answer'] not in a_dict:
|
||||
a_dict[d['answer']] = 1
|
||||
else:
|
||||
a_dict[d['answer']] += 1
|
||||
|
||||
top_answers = sorted(a_dict, key=a_dict.get, reverse=True)
|
||||
self.qlen_bins_to_idx = {
|
||||
'1-3': 0,
|
||||
'4-8': 1,
|
||||
'9-15': 2,
|
||||
}
|
||||
self.ans_rare_to_idx = {
|
||||
'0-99': 0,
|
||||
'100-299': 1,
|
||||
'300-999': 2,
|
||||
|
||||
}
|
||||
self.qtypes_to_idx = {
|
||||
'what': 0,
|
||||
'who': 1,
|
||||
'how': 2,
|
||||
'when': 3,
|
||||
'where': 4,
|
||||
}
|
||||
|
||||
if __C.RUN_MODE == 'train':
|
||||
self.ans_list = top_answers[:self.ans_size]
|
||||
|
||||
self.ans_to_ix, self.ix_to_ans = ans_stat(self.ans_list)
|
||||
|
||||
self.token_to_ix, self.pretrained_emb = tokenize(q_list, __C.USE_GLOVE)
|
||||
self.token_size = self.token_to_ix.__len__()
|
||||
print('== Question token vocab size:', self.token_size)
|
||||
|
||||
self.idx_to_qtypes = {v: k for (k, v) in self.qtypes_to_idx.items()}
|
||||
self.idx_to_qlen_bins = {v: k for (k, v) in self.qlen_bins_to_idx.items()}
|
||||
self.idx_to_ans_rare = {v: k for (k, v) in self.ans_rare_to_idx.items()}
|
||||
|
||||
def __getitem__(self, idx):
|
||||
sample = self.raw_data[idx]
|
||||
ques = sample['question']
|
||||
q_type = self.qtypes_to_idx[ques.split(' ')[0]]
|
||||
ques_idx, qlen, _ = proc_ques(ques, self.token_to_ix, self.__C.MAX_TOKEN)
|
||||
qlen_bin = self.qlen_bins_to_idx[qlen_to_key(qlen)]
|
||||
|
||||
answer = sample['answer']
|
||||
answer = self.ans_to_ix.get(answer, np.random.randint(0, high=len(self.ans_list)))
|
||||
ans_rarity = self.ans_rare_to_idx[ans_to_key(answer)]
|
||||
|
||||
answer_one_hot = torch.zeros(self.ans_size)
|
||||
answer_one_hot[answer] = 1.0
|
||||
|
||||
vid_id = sample['video_id']
|
||||
frames = torch.load(open(self.frames_dict[vid_id], 'rb')).cpu()
|
||||
clips = torch.load(open(self.clips_dict[vid_id], 'rb')).cpu()
|
||||
|
||||
return torch.from_numpy(ques_idx).long(), frames, clips, answer_one_hot, torch.tensor(answer).long(), \
|
||||
torch.tensor(q_type).long(), torch.tensor(qlen_bin).long(), torch.tensor(ans_rarity).long()
|
||||
|
||||
def __len__(self):
|
||||
return self.data_size
|
182
code/core/data/preprocess.py
Normal file
182
code/core/data/preprocess.py
Normal file
|
@ -0,0 +1,182 @@
|
|||
import os
|
||||
import sys
|
||||
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
|
||||
import skvideo.io as skv
|
||||
import torch
|
||||
import pickle
|
||||
from PIL import Image
|
||||
import tqdm
|
||||
import numpy as np
|
||||
from model.C3D import C3D
|
||||
import json
|
||||
from torchvision.models import vgg19
|
||||
import torchvision.transforms as transforms
|
||||
import torch.nn as nn
|
||||
import argparse
|
||||
|
||||
|
||||
def _select_frames(path, frame_num):
|
||||
"""Select representative frames for video.
|
||||
Ignore some frames both at begin and end of video.
|
||||
Args:
|
||||
path: Path of video.
|
||||
Returns:
|
||||
frames: list of frames.
|
||||
"""
|
||||
frames = list()
|
||||
video_data = skv.vread(path)
|
||||
total_frames = video_data.shape[0]
|
||||
# Ignore some frame at begin and end.
|
||||
for i in np.linspace(0, total_frames, frame_num + 2)[1:frame_num + 1]:
|
||||
frame_data = video_data[int(i)]
|
||||
img = Image.fromarray(frame_data)
|
||||
img = img.resize((224, 224), Image.BILINEAR)
|
||||
frame_data = np.array(img)
|
||||
frames.append(frame_data)
|
||||
return frames
|
||||
|
||||
def _select_clips(path, clip_num):
|
||||
"""Select self.batch_size clips for video. Each clip has 16 frames.
|
||||
Args:
|
||||
path: Path of video.
|
||||
Returns:
|
||||
clips: list of clips.
|
||||
"""
|
||||
clips = list()
|
||||
# video_info = skvideo.io.ffprobe(path)
|
||||
video_data = skv.vread(path)
|
||||
total_frames = video_data.shape[0]
|
||||
height = video_data[1]
|
||||
width = video_data.shape[2]
|
||||
for i in np.linspace(0, total_frames, clip_num + 2)[1:clip_num + 1]:
|
||||
# Select center frame first, then include surrounding frames
|
||||
clip_start = int(i) - 8
|
||||
clip_end = int(i) + 8
|
||||
if clip_start < 0:
|
||||
clip_end = clip_end - clip_start
|
||||
clip_start = 0
|
||||
if clip_end > total_frames:
|
||||
clip_start = clip_start - (clip_end - total_frames)
|
||||
clip_end = total_frames
|
||||
clip = video_data[clip_start:clip_end]
|
||||
new_clip = []
|
||||
for j in range(16):
|
||||
frame_data = clip[j]
|
||||
img = Image.fromarray(frame_data)
|
||||
img = img.resize((112, 112), Image.BILINEAR)
|
||||
frame_data = np.array(img) * 1.0
|
||||
# frame_data -= self.mean[j]
|
||||
new_clip.append(frame_data)
|
||||
clips.append(new_clip)
|
||||
return clips
|
||||
|
||||
def preprocess_videos(video_dir, frame_num, clip_num):
|
||||
frames_dir = os.path.join(os.path.dirname(video_dir), 'frames')
|
||||
os.mkdir(frames_dir)
|
||||
|
||||
clips_dir = os.path.join(os.path.dirname(video_dir), 'clips')
|
||||
os.mkdir(clips_dir)
|
||||
|
||||
for video_name in tqdm.tqdm(os.listdir(video_dir)):
|
||||
video_path = os.path.join(video_dir, video_name)
|
||||
frames = _select_frames(video_path, frame_num)
|
||||
clips = _select_clips(video_path, clip_num)
|
||||
|
||||
with open(os.path.join(frames_dir, video_name.split('.')[0] + '.pkl'), "wb") as f:
|
||||
pickle.dump(frames, f, protocol=pickle.HIGHEST_PROTOCOL)
|
||||
|
||||
with open(os.path.join(clips_dir, video_name.split('.')[0] + '.pkl'), "wb") as f:
|
||||
pickle.dump(clips, f, protocol=pickle.HIGHEST_PROTOCOL)
|
||||
|
||||
|
||||
def generate_video_features(path_frames, path_clips, c3d_path):
|
||||
device = torch.device('cuda:0')
|
||||
frame_feat_dir = os.path.join(os.path.dirname(path_frames), 'frame_feat')
|
||||
os.makedirs(frame_feat_dir, exist_ok=True)
|
||||
|
||||
clip_feat_dir = os.path.join(os.path.dirname(path_frames), 'clip_feat')
|
||||
os.makedirs(clip_feat_dir, exist_ok=True)
|
||||
|
||||
cnn = vgg19(pretrained=True)
|
||||
in_features = cnn.classifier[-1].in_features
|
||||
cnn.classifier = nn.Sequential(
|
||||
*list(cnn.classifier.children())[:-1]) # remove last fc layer
|
||||
cnn.to(device).eval()
|
||||
c3d = C3D()
|
||||
c3d.load_state_dict(torch.load(c3d_path))
|
||||
c3d.to(device).eval()
|
||||
transform = transforms.Compose([transforms.ToTensor(),
|
||||
transforms.Normalize((0.485, 0.456, 0.406),
|
||||
(0.229, 0.224, 0.225))])
|
||||
for vid_name in tqdm.tqdm(os.listdir(path_frames)):
|
||||
frame_path = os.path.join(path_frames, vid_name)
|
||||
clip_path = os.path.join(path_clips, vid_name)
|
||||
|
||||
frames = pickle.load(open(frame_path, 'rb'))
|
||||
clips = pickle.load(open(clip_path, 'rb'))
|
||||
|
||||
frames = [transform(f) for f in frames]
|
||||
frame_feat = []
|
||||
clip_feat = []
|
||||
|
||||
for frame in frames:
|
||||
with torch.no_grad():
|
||||
feat = cnn(frame.unsqueeze(0).to(device))
|
||||
frame_feat.append(feat)
|
||||
for clip in clips:
|
||||
# clip has shape (c x f x h x w)
|
||||
clip = torch.from_numpy(np.float32(np.array(clip)))
|
||||
clip = clip.transpose(3, 0)
|
||||
clip = clip.transpose(3, 1)
|
||||
clip = clip.transpose(3, 2).unsqueeze(0).to(device)
|
||||
with torch.no_grad():
|
||||
feat = c3d(clip)
|
||||
clip_feat.append(feat)
|
||||
frame_feat = torch.cat(frame_feat, dim=0)
|
||||
clip_feat = torch.cat(clip_feat, dim=0)
|
||||
|
||||
torch.save(frame_feat, os.path.join(frame_feat_dir, vid_name.split('.')[0] + '.pt'))
|
||||
torch.save(clip_feat, os.path.join(clip_feat_dir, vid_name.split('.')[0] + '.pt'))
|
||||
|
||||
def parse_args():
|
||||
'''
|
||||
Parse input arguments
|
||||
'''
|
||||
parser = argparse.ArgumentParser(description='Preprocessing Args')
|
||||
|
||||
parser.add_argument('--RAW_VID_PATH', dest='RAW_VID_PATH',
|
||||
help='The path to the raw videos',
|
||||
required=True,
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--FRAMES_OUTPUT_DIR', dest='FRAMES_OUTPUT_DIR',
|
||||
help='The directory where the processed frames and their features will be stored',
|
||||
required=True,
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--CLIPS_OUTPUT_DIR', dest='FRAMES_OUTPUT_DIR',
|
||||
help='The directory where the processed frames and their features will be stored',
|
||||
required=True,
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--C3D_PATH', dest='C3D_PATH',
|
||||
help='Pretrained C3D path',
|
||||
required=True,
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--NUM_SAMPLES', dest='NUM_SAMPLES',
|
||||
help='The number of frames/clips to be sampled from the video',
|
||||
default=20,
|
||||
type=int)
|
||||
|
||||
args = parser.parse_args()
|
||||
return args
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
args = parse_args()
|
||||
preprocess_videos(args.RAW_VID_PATH, args.NUM_SAMPLES, args.NUM_SAMPLES)
|
||||
frames_dir = os.path.join(os.path.dirname(args.RAW_VID_PATH), 'frames')
|
||||
clips_dir = os.path.join(os.path.dirname(args.RAW_VID_PATH), 'clips')
|
||||
generate_video_features(frames_dir, clips_dir)
|
81
code/core/data/utils.py
Normal file
81
code/core/data/utils.py
Normal file
|
@ -0,0 +1,81 @@
|
|||
import en_vectors_web_lg, random, re, json
|
||||
import numpy as np
|
||||
|
||||
def tokenize(ques_list, use_glove):
|
||||
token_to_ix = {
|
||||
'PAD': 0,
|
||||
'UNK': 1,
|
||||
}
|
||||
|
||||
spacy_tool = None
|
||||
pretrained_emb = []
|
||||
if use_glove:
|
||||
spacy_tool = en_vectors_web_lg.load()
|
||||
pretrained_emb.append(spacy_tool('PAD').vector)
|
||||
pretrained_emb.append(spacy_tool('UNK').vector)
|
||||
|
||||
for ques in ques_list:
|
||||
words = re.sub(
|
||||
r"([.,'!?\"()*#:;])",
|
||||
'',
|
||||
ques.lower()
|
||||
).replace('-', ' ').replace('/', ' ').split()
|
||||
|
||||
for word in words:
|
||||
if word not in token_to_ix:
|
||||
token_to_ix[word] = len(token_to_ix)
|
||||
if use_glove:
|
||||
pretrained_emb.append(spacy_tool(word).vector)
|
||||
|
||||
pretrained_emb = np.array(pretrained_emb)
|
||||
|
||||
return token_to_ix, pretrained_emb
|
||||
|
||||
|
||||
def proc_ques(ques, token_to_ix, max_token):
|
||||
ques_ix = np.zeros(max_token, np.int64)
|
||||
|
||||
words = re.sub(
|
||||
r"([.,'!?\"()*#:;])",
|
||||
'',
|
||||
ques.lower()
|
||||
).replace('-', ' ').replace('/', ' ').split()
|
||||
q_len = 0
|
||||
for ix, word in enumerate(words):
|
||||
if word in token_to_ix:
|
||||
ques_ix[ix] = token_to_ix[word]
|
||||
q_len += 1
|
||||
else:
|
||||
ques_ix[ix] = token_to_ix['UNK']
|
||||
|
||||
if ix + 1 == max_token:
|
||||
break
|
||||
|
||||
return ques_ix, q_len, len(words)
|
||||
|
||||
def ans_stat(ans_list):
|
||||
ans_to_ix, ix_to_ans = {}, {}
|
||||
for i, ans in enumerate(ans_list):
|
||||
ans_to_ix[ans] = i
|
||||
ix_to_ans[i] = ans
|
||||
|
||||
return ans_to_ix, ix_to_ans
|
||||
|
||||
def shuffle_list(ans_list):
|
||||
random.shuffle(ans_list)
|
||||
|
||||
def qlen_to_key(q_len):
|
||||
if 1<= q_len <=3:
|
||||
return '1-3'
|
||||
if 4<= q_len <=8:
|
||||
return '4-8'
|
||||
if 9<= q_len:
|
||||
return '9-15'
|
||||
|
||||
def ans_to_key(ans_idx):
|
||||
if 0 <= ans_idx <= 99 :
|
||||
return '0-99'
|
||||
if 100 <= ans_idx <= 299 :
|
||||
return '100-299'
|
||||
if 300 <= ans_idx <= 999 :
|
||||
return '300-999'
|
523
code/core/exec.py
Normal file
523
code/core/exec.py
Normal file
|
@ -0,0 +1,523 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
from core.data.dataset import VideoQA_Dataset
|
||||
from core.model.net import Net1, Net2, Net3, Net4
|
||||
from core.model.optim import get_optim, adjust_lr
|
||||
from core.metrics import get_acc
|
||||
from tqdm import tqdm
|
||||
from core.data.utils import shuffle_list
|
||||
|
||||
import os, json, torch, datetime, pickle, copy, shutil, time, math
|
||||
import numpy as np
|
||||
import torch.nn as nn
|
||||
import torch.utils.data as Data
|
||||
from tensorboardX import SummaryWriter
|
||||
from torch.autograd import Variable as var
|
||||
|
||||
class Execution:
|
||||
def __init__(self, __C):
|
||||
self.__C = __C
|
||||
print('Loading training set ........')
|
||||
__C_train = copy.deepcopy(self.__C)
|
||||
setattr(__C_train, 'RUN_MODE', 'train')
|
||||
self.dataset = VideoQA_Dataset(__C_train)
|
||||
|
||||
self.dataset_eval = None
|
||||
if self.__C.EVAL_EVERY_EPOCH:
|
||||
__C_eval = copy.deepcopy(self.__C)
|
||||
setattr(__C_eval, 'RUN_MODE', 'val')
|
||||
|
||||
print('Loading validation set for per-epoch evaluation ........')
|
||||
self.dataset_eval = VideoQA_Dataset(__C_eval)
|
||||
self.dataset_eval.ans_list = self.dataset.ans_list
|
||||
self.dataset_eval.ans_to_ix, self.dataset_eval.ix_to_ans = self.dataset.ans_to_ix, self.dataset.ix_to_ans
|
||||
self.dataset_eval.token_to_ix, self.dataset_eval.pretrained_emb = self.dataset.token_to_ix, self.dataset.pretrained_emb
|
||||
|
||||
__C_test = copy.deepcopy(self.__C)
|
||||
setattr(__C_test, 'RUN_MODE', 'test')
|
||||
|
||||
self.dataset_test = VideoQA_Dataset(__C_test)
|
||||
self.dataset_test.ans_list = self.dataset.ans_list
|
||||
self.dataset_test.ans_to_ix, self.dataset_test.ix_to_ans = self.dataset.ans_to_ix, self.dataset.ix_to_ans
|
||||
self.dataset_test.token_to_ix, self.dataset_test.pretrained_emb = self.dataset.token_to_ix, self.dataset.pretrained_emb
|
||||
|
||||
self.writer = SummaryWriter(self.__C.TB_PATH)
|
||||
|
||||
def train(self, dataset, dataset_eval=None):
|
||||
# Obtain needed information
|
||||
data_size = dataset.data_size
|
||||
token_size = dataset.token_size
|
||||
ans_size = dataset.ans_size
|
||||
pretrained_emb = dataset.pretrained_emb
|
||||
net = self.construct_net(self.__C.MODEL_TYPE)
|
||||
if os.path.isfile(self.__C.PRETRAINED_PATH) and self.__C.MODEL_TYPE == 11:
|
||||
print('Loading pretrained DNC-weigths')
|
||||
net.load_pretrained_weights()
|
||||
net.cuda()
|
||||
net.train()
|
||||
|
||||
# Define the multi-gpu training if needed
|
||||
if self.__C.N_GPU > 1:
|
||||
net = nn.DataParallel(net, device_ids=self.__C.DEVICES)
|
||||
|
||||
# Define the binary cross entropy loss
|
||||
# loss_fn = torch.nn.BCELoss(size_average=False).cuda()
|
||||
loss_fn = torch.nn.BCELoss(reduction='sum').cuda()
|
||||
# Load checkpoint if resume training
|
||||
if self.__C.RESUME:
|
||||
print(' ========== Resume training')
|
||||
|
||||
if self.__C.CKPT_PATH is not None:
|
||||
print('Warning: you are now using CKPT_PATH args, '
|
||||
'CKPT_VERSION and CKPT_EPOCH will not work')
|
||||
|
||||
path = self.__C.CKPT_PATH
|
||||
else:
|
||||
path = self.__C.CKPTS_PATH + \
|
||||
'ckpt_' + self.__C.CKPT_VERSION + \
|
||||
'/epoch' + str(self.__C.CKPT_EPOCH) + '.pkl'
|
||||
|
||||
# Load the network parameters
|
||||
print('Loading ckpt {}'.format(path))
|
||||
ckpt = torch.load(path)
|
||||
print('Finish!')
|
||||
net.load_state_dict(ckpt['state_dict'])
|
||||
|
||||
# Load the optimizer paramters
|
||||
optim = get_optim(self.__C, net, data_size, ckpt['optim'], lr_base=ckpt['lr_base'])
|
||||
optim._step = int(data_size / self.__C.BATCH_SIZE * self.__C.CKPT_EPOCH)
|
||||
optim.optimizer.load_state_dict(ckpt['optimizer'])
|
||||
|
||||
start_epoch = self.__C.CKPT_EPOCH
|
||||
|
||||
else:
|
||||
if ('ckpt_' + self.__C.VERSION) in os.listdir(self.__C.CKPTS_PATH):
|
||||
shutil.rmtree(self.__C.CKPTS_PATH + 'ckpt_' + self.__C.VERSION)
|
||||
|
||||
os.mkdir(self.__C.CKPTS_PATH + 'ckpt_' + self.__C.VERSION)
|
||||
|
||||
optim = get_optim(self.__C, net, data_size, self.__C.OPTIM)
|
||||
start_epoch = 0
|
||||
|
||||
loss_sum = 0
|
||||
named_params = list(net.named_parameters())
|
||||
grad_norm = np.zeros(len(named_params))
|
||||
|
||||
# Define multi-thread dataloader
|
||||
if self.__C.SHUFFLE_MODE in ['external']:
|
||||
dataloader = Data.DataLoader(
|
||||
dataset,
|
||||
batch_size=self.__C.BATCH_SIZE,
|
||||
shuffle=False,
|
||||
num_workers=self.__C.NUM_WORKERS,
|
||||
pin_memory=self.__C.PIN_MEM,
|
||||
drop_last=True
|
||||
)
|
||||
else:
|
||||
dataloader = Data.DataLoader(
|
||||
dataset,
|
||||
batch_size=self.__C.BATCH_SIZE,
|
||||
shuffle=True,
|
||||
num_workers=self.__C.NUM_WORKERS,
|
||||
pin_memory=self.__C.PIN_MEM,
|
||||
drop_last=True
|
||||
)
|
||||
|
||||
# Training script
|
||||
for epoch in range(start_epoch, self.__C.MAX_EPOCH):
|
||||
|
||||
# Save log information
|
||||
logfile = open(
|
||||
self.__C.LOG_PATH +
|
||||
'log_run_' + self.__C.VERSION + '.txt',
|
||||
'a+'
|
||||
)
|
||||
logfile.write(
|
||||
'nowTime: ' +
|
||||
datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') +
|
||||
'\n'
|
||||
)
|
||||
logfile.close()
|
||||
|
||||
# Learning Rate Decay
|
||||
if epoch in self.__C.LR_DECAY_LIST:
|
||||
adjust_lr(optim, self.__C.LR_DECAY_R)
|
||||
|
||||
# Externally shuffle
|
||||
if self.__C.SHUFFLE_MODE == 'external':
|
||||
shuffle_list(dataset.ans_list)
|
||||
|
||||
time_start = time.time()
|
||||
# Iteration
|
||||
for step, (
|
||||
ques_ix_iter,
|
||||
frames_feat_iter,
|
||||
clips_feat_iter,
|
||||
ans_iter,
|
||||
_,
|
||||
_,
|
||||
_,
|
||||
_
|
||||
) in enumerate(dataloader):
|
||||
|
||||
ques_ix_iter = ques_ix_iter.cuda()
|
||||
frames_feat_iter = frames_feat_iter.cuda()
|
||||
clips_feat_iter = clips_feat_iter.cuda()
|
||||
ans_iter = ans_iter.cuda()
|
||||
|
||||
optim.zero_grad()
|
||||
|
||||
for accu_step in range(self.__C.GRAD_ACCU_STEPS):
|
||||
|
||||
sub_frames_feat_iter = \
|
||||
frames_feat_iter[accu_step * self.__C.SUB_BATCH_SIZE:
|
||||
(accu_step + 1) * self.__C.SUB_BATCH_SIZE]
|
||||
sub_clips_feat_iter = \
|
||||
clips_feat_iter[accu_step * self.__C.SUB_BATCH_SIZE:
|
||||
(accu_step + 1) * self.__C.SUB_BATCH_SIZE]
|
||||
sub_ques_ix_iter = \
|
||||
ques_ix_iter[accu_step * self.__C.SUB_BATCH_SIZE:
|
||||
(accu_step + 1) * self.__C.SUB_BATCH_SIZE]
|
||||
sub_ans_iter = \
|
||||
ans_iter[accu_step * self.__C.SUB_BATCH_SIZE:
|
||||
(accu_step + 1) * self.__C.SUB_BATCH_SIZE]
|
||||
|
||||
pred = net(
|
||||
sub_frames_feat_iter,
|
||||
sub_clips_feat_iter,
|
||||
sub_ques_ix_iter
|
||||
)
|
||||
|
||||
loss = loss_fn(pred, sub_ans_iter)
|
||||
|
||||
# only mean-reduction needs be divided by grad_accu_steps
|
||||
# removing this line wouldn't change our results because the speciality of Adam optimizer,
|
||||
# but would be necessary if you use SGD optimizer.
|
||||
# loss /= self.__C.GRAD_ACCU_STEPS
|
||||
# start_backward = time.time()
|
||||
loss.backward()
|
||||
|
||||
if self.__C.VERBOSE:
|
||||
if dataset_eval is not None:
|
||||
mode_str = self.__C.SPLIT['train'] + '->' + self.__C.SPLIT['val']
|
||||
else:
|
||||
mode_str = self.__C.SPLIT['train'] + '->' + self.__C.SPLIT['test']
|
||||
|
||||
# logging
|
||||
|
||||
self.writer.add_scalar(
|
||||
'train/loss',
|
||||
loss.cpu().data.numpy() / self.__C.SUB_BATCH_SIZE,
|
||||
global_step=step + epoch * math.ceil(data_size / self.__C.BATCH_SIZE))
|
||||
|
||||
self.writer.add_scalar(
|
||||
'train/lr',
|
||||
optim._rate,
|
||||
global_step=step + epoch * math.ceil(data_size / self.__C.BATCH_SIZE))
|
||||
|
||||
print("\r[exp_name %s][version %s][epoch %2d][step %4d/%4d][%s] loss: %.4f, lr: %.2e" % (
|
||||
self.__C.EXP_NAME,
|
||||
self.__C.VERSION,
|
||||
epoch + 1,
|
||||
step,
|
||||
int(data_size / self.__C.BATCH_SIZE),
|
||||
mode_str,
|
||||
loss.cpu().data.numpy() / self.__C.SUB_BATCH_SIZE,
|
||||
optim._rate,
|
||||
), end=' ')
|
||||
|
||||
# Gradient norm clipping
|
||||
if self.__C.GRAD_NORM_CLIP > 0:
|
||||
nn.utils.clip_grad_norm_(
|
||||
net.parameters(),
|
||||
self.__C.GRAD_NORM_CLIP
|
||||
)
|
||||
|
||||
# Save the gradient information
|
||||
for name in range(len(named_params)):
|
||||
norm_v = torch.norm(named_params[name][1].grad).cpu().data.numpy() \
|
||||
if named_params[name][1].grad is not None else 0
|
||||
grad_norm[name] += norm_v * self.__C.GRAD_ACCU_STEPS
|
||||
|
||||
optim.step()
|
||||
|
||||
time_end = time.time()
|
||||
print('Finished in {}s'.format(int(time_end-time_start)))
|
||||
|
||||
epoch_finish = epoch + 1
|
||||
|
||||
# Save checkpoint
|
||||
state = {
|
||||
'state_dict': net.state_dict(),
|
||||
'optimizer': optim.optimizer.state_dict(),
|
||||
'lr_base': optim.lr_base,
|
||||
'optim': optim.lr_base, }
|
||||
|
||||
torch.save(
|
||||
state,
|
||||
self.__C.CKPTS_PATH +
|
||||
'ckpt_' + self.__C.VERSION +
|
||||
'/epoch' + str(epoch_finish) +
|
||||
'.pkl'
|
||||
)
|
||||
|
||||
# Logging
|
||||
logfile = open(
|
||||
self.__C.LOG_PATH +
|
||||
'log_run_' + self.__C.VERSION + '.txt',
|
||||
'a+'
|
||||
)
|
||||
logfile.write(
|
||||
'epoch = ' + str(epoch_finish) +
|
||||
' loss = ' + str(loss_sum / data_size) +
|
||||
'\n' +
|
||||
'lr = ' + str(optim._rate) +
|
||||
'\n\n'
|
||||
)
|
||||
logfile.close()
|
||||
|
||||
# Eval after every epoch
|
||||
if dataset_eval is not None:
|
||||
self.eval(
|
||||
net,
|
||||
dataset_eval,
|
||||
self.writer,
|
||||
epoch,
|
||||
valid=True,
|
||||
)
|
||||
|
||||
loss_sum = 0
|
||||
grad_norm = np.zeros(len(named_params))
|
||||
|
||||
|
||||
# Evaluation
|
||||
def eval(self, net, dataset, writer, epoch, valid=False):
|
||||
|
||||
ans_ix_list = []
|
||||
pred_list = []
|
||||
q_type_list = []
|
||||
q_bin_list = []
|
||||
ans_rarity_list = []
|
||||
|
||||
ans_qtype_dict = {'what': [], 'who': [], 'how': [], 'when': [], 'where': []}
|
||||
pred_qtype_dict = {'what': [], 'who': [], 'how': [], 'when': [], 'where': []}
|
||||
|
||||
|
||||
ans_qlen_bin_dict = {'1-3': [], '4-8': [], '9-15': []}
|
||||
pred_qlen_bin_dict = {'1-3': [], '4-8': [], '9-15': []}
|
||||
|
||||
ans_ans_rarity_dict = {'0-99': [], '100-299': [], '300-999': []}
|
||||
pred_ans_rarity_dict = {'0-99': [], '100-299': [], '300-999': []}
|
||||
|
||||
data_size = dataset.data_size
|
||||
|
||||
net.eval()
|
||||
|
||||
if self.__C.N_GPU > 1:
|
||||
net = nn.DataParallel(net, device_ids=self.__C.DEVICES)
|
||||
|
||||
dataloader = Data.DataLoader(
|
||||
dataset,
|
||||
batch_size=self.__C.EVAL_BATCH_SIZE,
|
||||
shuffle=False,
|
||||
num_workers=self.__C.NUM_WORKERS,
|
||||
pin_memory=True
|
||||
)
|
||||
|
||||
for step, (
|
||||
ques_ix_iter,
|
||||
frames_feat_iter,
|
||||
clips_feat_iter,
|
||||
_,
|
||||
ans_iter,
|
||||
q_type,
|
||||
qlen_bin,
|
||||
ans_rarity
|
||||
) in enumerate(dataloader):
|
||||
print("\rEvaluation: [step %4d/%4d]" % (
|
||||
step,
|
||||
int(data_size / self.__C.EVAL_BATCH_SIZE),
|
||||
), end=' ')
|
||||
ques_ix_iter = ques_ix_iter.cuda()
|
||||
frames_feat_iter = frames_feat_iter.cuda()
|
||||
clips_feat_iter = clips_feat_iter.cuda()
|
||||
with torch.no_grad():
|
||||
|
||||
pred = net(
|
||||
frames_feat_iter,
|
||||
clips_feat_iter,
|
||||
ques_ix_iter
|
||||
)
|
||||
|
||||
pred_np = pred.cpu().data.numpy()
|
||||
pred_argmax = np.argmax(pred_np, axis=1)
|
||||
pred_list.extend(pred_argmax)
|
||||
ans_ix_list.extend(ans_iter.tolist())
|
||||
q_type_list.extend(q_type.tolist())
|
||||
q_bin_list.extend(qlen_bin.tolist())
|
||||
ans_rarity_list.extend(ans_rarity.tolist())
|
||||
|
||||
print('')
|
||||
|
||||
assert len(pred_list) == len(ans_ix_list) == len(q_type_list) == len(q_bin_list) == len(ans_rarity_list)
|
||||
pred_list = [dataset.ix_to_ans[pred] for pred in pred_list]
|
||||
ans_ix_list = [dataset.ix_to_ans[ans] for ans in ans_ix_list]
|
||||
|
||||
# Run validation script
|
||||
scores_per_qtype = {
|
||||
'what': {},
|
||||
'who': {},
|
||||
'how': {},
|
||||
'when': {},
|
||||
'where': {},
|
||||
}
|
||||
scores_per_qlen_bin = {
|
||||
'1-3': {},
|
||||
'4-8': {},
|
||||
'9-15': {},
|
||||
}
|
||||
scores_ans_rarity_dict = {
|
||||
'0-99': {},
|
||||
'100-299': {},
|
||||
'300-999': {}
|
||||
}
|
||||
|
||||
if valid:
|
||||
# create vqa object and vqaRes object
|
||||
for pred, ans, q_type in zip(pred_list, ans_ix_list, q_type_list):
|
||||
pred_qtype_dict[dataset.idx_to_qtypes[q_type]].append(pred)
|
||||
ans_qtype_dict[dataset.idx_to_qtypes[q_type]].append(ans)
|
||||
|
||||
print('----------------- Computing scores -----------------')
|
||||
acc = get_acc(ans_ix_list, pred_list)
|
||||
print('----------------- Overall -----------------')
|
||||
print('acc: {}'.format(acc))
|
||||
writer.add_scalar('acc/overall', acc, global_step=epoch)
|
||||
|
||||
for q_type in scores_per_qtype:
|
||||
print('----------------- Computing "{}" q-type scores -----------------'.format(q_type))
|
||||
# acc, wups_0, wups_1 = get_scores(
|
||||
# ans_ix_dict[q_type], pred_ix_dict[q_type])
|
||||
acc = get_acc(ans_qtype_dict[q_type], pred_qtype_dict[q_type])
|
||||
print('acc: {}'.format(acc))
|
||||
writer.add_scalar(
|
||||
'acc/{}'.format(q_type), acc, global_step=epoch)
|
||||
else:
|
||||
for pred, ans, q_type, qlen_bin, a_rarity in zip(
|
||||
pred_list, ans_ix_list, q_type_list, q_bin_list, ans_rarity_list):
|
||||
|
||||
pred_qtype_dict[dataset.idx_to_qtypes[q_type]].append(pred)
|
||||
ans_qtype_dict[dataset.idx_to_qtypes[q_type]].append(ans)
|
||||
|
||||
pred_qlen_bin_dict[dataset.idx_to_qlen_bins[qlen_bin]].append(pred)
|
||||
ans_qlen_bin_dict[dataset.idx_to_qlen_bins[qlen_bin]].append(ans)
|
||||
|
||||
pred_ans_rarity_dict[dataset.idx_to_ans_rare[a_rarity]].append(pred)
|
||||
ans_ans_rarity_dict[dataset.idx_to_ans_rare[a_rarity]].append(ans)
|
||||
|
||||
print('----------------- Computing overall scores -----------------')
|
||||
acc = get_acc(ans_ix_list, pred_list)
|
||||
|
||||
print('----------------- Overall -----------------')
|
||||
print('acc:{}'.format(acc))
|
||||
|
||||
|
||||
print('----------------- Computing q-type scores -----------------')
|
||||
for q_type in scores_per_qtype:
|
||||
acc = get_acc(ans_qtype_dict[q_type], pred_qtype_dict[q_type])
|
||||
print(' {} '.format(q_type))
|
||||
print('acc:{}'.format(acc))
|
||||
|
||||
print('----------------- Computing qlen-bins scores -----------------')
|
||||
for qlen_bin in scores_per_qlen_bin:
|
||||
|
||||
acc = get_acc(ans_qlen_bin_dict[qlen_bin], pred_qlen_bin_dict[qlen_bin])
|
||||
print(' {} '.format(qlen_bin))
|
||||
print('acc:{}'.format(acc))
|
||||
|
||||
print('----------------- Computing ans-rarity scores -----------------')
|
||||
for a_rarity in scores_ans_rarity_dict:
|
||||
acc = get_acc(ans_ans_rarity_dict[a_rarity], pred_ans_rarity_dict[a_rarity])
|
||||
print(' {} '.format(a_rarity))
|
||||
print('acc:{}'.format(acc))
|
||||
net.train()
|
||||
|
||||
def construct_net(self, model_type):
|
||||
if model_type == 1:
|
||||
net = Net1(
|
||||
self.__C,
|
||||
self.dataset.pretrained_emb,
|
||||
self.dataset.token_size,
|
||||
self.dataset.ans_size
|
||||
)
|
||||
elif model_type == 2:
|
||||
net = Net2(
|
||||
self.__C,
|
||||
self.dataset.pretrained_emb,
|
||||
self.dataset.token_size,
|
||||
self.dataset.ans_size
|
||||
)
|
||||
elif model_type == 3:
|
||||
net = Net3(
|
||||
self.__C,
|
||||
self.dataset.pretrained_emb,
|
||||
self.dataset.token_size,
|
||||
self.dataset.ans_size
|
||||
)
|
||||
elif model_type == 4:
|
||||
net = Net4(
|
||||
self.__C,
|
||||
self.dataset.pretrained_emb,
|
||||
self.dataset.token_size,
|
||||
self.dataset.ans_size
|
||||
)
|
||||
else:
|
||||
raise ValueError('Net{} is not supported'.format(model_type))
|
||||
return net
|
||||
|
||||
def run(self, run_mode, epoch=None):
|
||||
self.set_seed(self.__C.SEED)
|
||||
if run_mode == 'train':
|
||||
self.empty_log(self.__C.VERSION)
|
||||
self.train(self.dataset, self.dataset_eval)
|
||||
|
||||
elif run_mode == 'val':
|
||||
self.eval(self.dataset, valid=True)
|
||||
|
||||
elif run_mode == 'test':
|
||||
net = self.construct_net(self.__C.MODEL_TYPE)
|
||||
assert epoch is not None
|
||||
path = self.__C.CKPTS_PATH + \
|
||||
'ckpt_' + self.__C.VERSION + \
|
||||
'/epoch' + str(epoch) + '.pkl'
|
||||
print('Loading ckpt {}'.format(path))
|
||||
state_dict = torch.load(path)['state_dict']
|
||||
net.load_state_dict(state_dict)
|
||||
net.cuda()
|
||||
self.eval(net, self.dataset_test, self.writer, 0)
|
||||
|
||||
else:
|
||||
exit(-1)
|
||||
|
||||
def set_seed(self, seed):
|
||||
"""Sets the seed for reproducibility.
|
||||
Args:
|
||||
seed (int): The seed used
|
||||
"""
|
||||
torch.manual_seed(seed)
|
||||
torch.cuda.manual_seed(seed)
|
||||
torch.backends.cudnn.deterministic = True
|
||||
torch.backends.cudnn.benchmark = False
|
||||
np.random.seed(seed)
|
||||
print('\nSeed set to {}...\n'.format(seed))
|
||||
|
||||
def empty_log(self, version):
|
||||
print('Initializing log file ........')
|
||||
if (os.path.exists(self.__C.LOG_PATH + 'log_run_' + version + '.txt')):
|
||||
os.remove(self.__C.LOG_PATH + 'log_run_' + version + '.txt')
|
||||
print('Finished!')
|
||||
print('')
|
211
code/core/metrics.py
Normal file
211
code/core/metrics.py
Normal file
|
@ -0,0 +1,211 @@
|
|||
"""
|
||||
Author: Mateusz Malinowski
|
||||
Email: mmalinow@mpi-inf.mpg.de
|
||||
|
||||
The script assumes there are two files
|
||||
- first file with ground truth answers
|
||||
- second file with predicted answers
|
||||
both answers are line-aligned
|
||||
|
||||
The script also assumes that answer items are comma separated.
|
||||
For instance, chair,table,window
|
||||
|
||||
It is also a set measure, so not exactly the same as accuracy
|
||||
even if dirac measure is used since {book,book}=={book}, also {book,chair}={chair,book}
|
||||
|
||||
Logs:
|
||||
05.09.2015 - white spaces surrounding words are stripped away so that {book, chair}={book,chair}
|
||||
"""
|
||||
|
||||
import sys
|
||||
|
||||
#import enchant
|
||||
|
||||
from numpy import prod
|
||||
from nltk.corpus import wordnet as wn
|
||||
from tqdm import tqdm
|
||||
|
||||
def file2list(filepath):
|
||||
with open(filepath,'r') as f:
|
||||
lines =[k for k in
|
||||
[k.strip() for k in f.readlines()]
|
||||
if len(k) > 0]
|
||||
|
||||
return lines
|
||||
|
||||
|
||||
def list2file(filepath,mylist):
|
||||
mylist='\n'.join(mylist)
|
||||
with open(filepath,'w') as f:
|
||||
f.writelines(mylist)
|
||||
|
||||
|
||||
def items2list(x):
|
||||
"""
|
||||
x - string of comma-separated answer items
|
||||
"""
|
||||
return [l.strip() for l in x.split(',')]
|
||||
|
||||
|
||||
def fuzzy_set_membership_measure(x,A,m):
|
||||
"""
|
||||
Set membership measure.
|
||||
x: element
|
||||
A: set of elements
|
||||
m: point-wise element-to-element measure m(a,b) ~ similarity(a,b)
|
||||
|
||||
This function implments a fuzzy set membership measure:
|
||||
m(x \in A) = max_{a \in A} m(x,a)}
|
||||
"""
|
||||
return 0 if A==[] else max(map(lambda a: m(x,a), A))
|
||||
|
||||
|
||||
def score_it(A,T,m):
|
||||
"""
|
||||
A: list of A items
|
||||
T: list of T items
|
||||
m: set membership measure
|
||||
m(a \in A) gives a membership quality of a into A
|
||||
|
||||
This function implements a fuzzy accuracy score:
|
||||
score(A,T) = min{prod_{a \in A} m(a \in T), prod_{t \in T} m(a \in A)}
|
||||
where A and T are set representations of the answers
|
||||
and m is a measure
|
||||
"""
|
||||
if A==[] and T==[]:
|
||||
return 1
|
||||
|
||||
# print A,T
|
||||
|
||||
score_left=0 if A==[] else prod(list(map(lambda a: m(a,T), A)))
|
||||
score_right=0 if T==[] else prod(list(map(lambda t: m(t,A),T)))
|
||||
return min(score_left,score_right)
|
||||
|
||||
|
||||
# implementations of different measure functions
|
||||
def dirac_measure(a,b):
|
||||
"""
|
||||
Returns 1 iff a=b and 0 otherwise.
|
||||
"""
|
||||
if a==[] or b==[]:
|
||||
return 0.0
|
||||
return float(a==b)
|
||||
|
||||
|
||||
def wup_measure(a,b,similarity_threshold=0.925):
|
||||
"""
|
||||
Returns Wu-Palmer similarity score.
|
||||
More specifically, it computes:
|
||||
max_{x \in interp(a)} max_{y \in interp(b)} wup(x,y)
|
||||
where interp is a 'interpretation field'
|
||||
"""
|
||||
def get_semantic_field(a):
|
||||
weight = 1.0
|
||||
semantic_field = wn.synsets(a,pos=wn.NOUN)
|
||||
return (semantic_field,weight)
|
||||
|
||||
|
||||
def get_stem_word(a):
|
||||
"""
|
||||
Sometimes answer has form word\d+:wordid.
|
||||
If so we return word and downweight
|
||||
"""
|
||||
weight = 1.0
|
||||
return (a,weight)
|
||||
|
||||
|
||||
global_weight=1.0
|
||||
|
||||
(a,global_weight_a)=get_stem_word(a)
|
||||
(b,global_weight_b)=get_stem_word(b)
|
||||
global_weight = min(global_weight_a,global_weight_b)
|
||||
|
||||
if a==b:
|
||||
# they are the same
|
||||
return 1.0*global_weight
|
||||
|
||||
if a==[] or b==[]:
|
||||
return 0
|
||||
|
||||
|
||||
interp_a,weight_a = get_semantic_field(a)
|
||||
interp_b,weight_b = get_semantic_field(b)
|
||||
|
||||
if interp_a == [] or interp_b == []:
|
||||
return 0
|
||||
|
||||
# we take the most optimistic interpretation
|
||||
global_max=0.0
|
||||
for x in interp_a:
|
||||
for y in interp_b:
|
||||
local_score=x.wup_similarity(y)
|
||||
if local_score > global_max:
|
||||
global_max=local_score
|
||||
|
||||
# we need to use the semantic fields and therefore we downweight
|
||||
# unless the score is high which indicates both are synonyms
|
||||
if global_max < similarity_threshold:
|
||||
interp_weight = 0.1
|
||||
else:
|
||||
interp_weight = 1.0
|
||||
|
||||
final_score=global_max*weight_a*weight_b*interp_weight*global_weight
|
||||
return final_score
|
||||
###
|
||||
|
||||
|
||||
def get_scores(input_gt, input_pred, threshold_0=0.0, threshold_1=0.9):
|
||||
element_membership_acc=dirac_measure
|
||||
element_membership_wups_0=lambda x,y: wup_measure(x,y,threshold_0)
|
||||
element_membership_wups_1=lambda x,y: wup_measure(x,y,threshold_1)
|
||||
|
||||
set_membership_acc=\
|
||||
lambda x,A: fuzzy_set_membership_measure(x,A,element_membership_acc)
|
||||
set_membership_wups_0=\
|
||||
lambda x,A: fuzzy_set_membership_measure(x,A,element_membership_wups_0)
|
||||
set_membership_wups_1=\
|
||||
lambda x,A: fuzzy_set_membership_measure(x,A,element_membership_wups_1)
|
||||
|
||||
score_list_acc = []
|
||||
score_list_wups_0 = []
|
||||
score_list_wups_1 = []
|
||||
pbar = tqdm(zip(input_gt,input_pred))
|
||||
pbar.set_description('Computing Acc')
|
||||
|
||||
for (ta,pa) in pbar:
|
||||
score_list_acc.append(score_it(items2list(ta),items2list(pa),set_membership_acc))
|
||||
|
||||
#final_score=sum(map(lambda x:float(x)/float(len(score_list)),score_list))
|
||||
final_score_acc=float(sum(score_list_acc))/float(len(score_list_acc))
|
||||
final_score_acc *= 100.0
|
||||
|
||||
pbar = tqdm(zip(input_gt,input_pred))
|
||||
pbar.set_description('Computing Wups_0.0')
|
||||
for (ta,pa) in pbar:
|
||||
score_list_wups_0.append(score_it(items2list(ta),items2list(pa),set_membership_wups_0))
|
||||
#final_score=sum(map(lambda x:float(x)/float(len(score_list)),score_list))
|
||||
final_score_wups_0=float(sum(score_list_wups_0))/float(len(score_list_wups_0))
|
||||
final_score_wups_0 *= 100.0
|
||||
|
||||
pbar = tqdm(zip(input_gt,input_pred))
|
||||
pbar.set_description('Computing Wups_0.9')
|
||||
for (ta,pa) in pbar:
|
||||
score_list_wups_1.append(score_it(items2list(ta),items2list(pa),set_membership_wups_1))
|
||||
#final_score=sum(map(lambda x:float(x)/float(len(score_list)),score_list))
|
||||
final_score_wups_1=float(sum(score_list_wups_1))/float(len(score_list_wups_1))
|
||||
final_score_wups_1 *= 100.0
|
||||
|
||||
# filtering to obtain the results
|
||||
#print 'full score:', score_list
|
||||
# print('accuracy = {0:.2f} | WUPS@{1} = {2:.2f} | WUPS@{3} = {4:.2f}'.format(
|
||||
# final_score_acc, threshold_0, final_score_wups_0, threshold_1, final_score_wups_1))
|
||||
return final_score_acc, final_score_wups_0, final_score_wups_1
|
||||
|
||||
def get_acc(gts, preds):
|
||||
sum_correct = 0
|
||||
assert len(gts) == len(preds)
|
||||
for gt, pred in zip(gts, preds):
|
||||
if gt == pred:
|
||||
sum_correct += 1
|
||||
acc = 100.0 * float(sum_correct/ len(gts))
|
||||
return acc
|
0
code/core/model/.gitkeep
Normal file
0
code/core/model/.gitkeep
Normal file
80
code/core/model/C3D.py
Normal file
80
code/core/model/C3D.py
Normal file
|
@ -0,0 +1,80 @@
|
|||
"""
|
||||
from https://github.com/DavideA/c3d-pytorch/blob/master/C3D_model.py
|
||||
"""
|
||||
|
||||
|
||||
import torch.nn as nn
|
||||
|
||||
|
||||
class C3D(nn.Module):
|
||||
"""
|
||||
The C3D network as described in [1].
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
super(C3D, self).__init__()
|
||||
|
||||
self.conv1 = nn.Conv3d(3, 64, kernel_size=(3, 3, 3), padding=(1, 1, 1))
|
||||
self.pool1 = nn.MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2))
|
||||
|
||||
self.conv2 = nn.Conv3d(64, 128, kernel_size=(3, 3, 3), padding=(1, 1, 1))
|
||||
self.pool2 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2))
|
||||
|
||||
self.conv3a = nn.Conv3d(128, 256, kernel_size=(3, 3, 3), padding=(1, 1, 1))
|
||||
self.conv3b = nn.Conv3d(256, 256, kernel_size=(3, 3, 3), padding=(1, 1, 1))
|
||||
self.pool3 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2))
|
||||
|
||||
self.conv4a = nn.Conv3d(256, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
|
||||
self.conv4b = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
|
||||
self.pool4 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2))
|
||||
|
||||
self.conv5a = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
|
||||
self.conv5b = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
|
||||
self.pool5 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=(0, 1, 1))
|
||||
|
||||
self.fc6 = nn.Linear(8192, 4096)
|
||||
self.fc7 = nn.Linear(4096, 4096)
|
||||
self.fc8 = nn.Linear(4096, 487)
|
||||
|
||||
self.dropout = nn.Dropout(p=0.5)
|
||||
|
||||
self.relu = nn.ReLU()
|
||||
self.softmax = nn.Softmax()
|
||||
|
||||
def forward(self, x):
|
||||
|
||||
h = self.relu(self.conv1(x))
|
||||
h = self.pool1(h)
|
||||
|
||||
h = self.relu(self.conv2(h))
|
||||
h = self.pool2(h)
|
||||
|
||||
h = self.relu(self.conv3a(h))
|
||||
h = self.relu(self.conv3b(h))
|
||||
h = self.pool3(h)
|
||||
|
||||
h = self.relu(self.conv4a(h))
|
||||
h = self.relu(self.conv4b(h))
|
||||
h = self.pool4(h)
|
||||
|
||||
h = self.relu(self.conv5a(h))
|
||||
h = self.relu(self.conv5b(h))
|
||||
h = self.pool5(h)
|
||||
|
||||
h = h.view(-1, 8192)
|
||||
h = self.relu(self.fc6(h))
|
||||
h = self.dropout(h)
|
||||
h = self.relu(self.fc7(h))
|
||||
# h = self.dropout(h)
|
||||
|
||||
# logits = self.fc8(h)
|
||||
# probs = self.softmax(logits)
|
||||
|
||||
return h
|
||||
|
||||
"""
|
||||
References
|
||||
----------
|
||||
[1] Tran, Du, et al. "Learning spatiotemporal features with 3d convolutional networks."
|
||||
Proceedings of the IEEE international conference on computer vision. 2015.
|
||||
"""
|
323
code/core/model/dnc.py
Normal file
323
code/core/model/dnc.py
Normal file
|
@ -0,0 +1,323 @@
|
|||
"""
|
||||
PyTorch DNC implementation from
|
||||
-->
|
||||
https://github.com/ixaxaar/pytorch-dnc
|
||||
<--
|
||||
"""
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
|
||||
import torch.nn as nn
|
||||
import torch as T
|
||||
from torch.autograd import Variable as var
|
||||
import numpy as np
|
||||
|
||||
from torch.nn.utils.rnn import pad_packed_sequence as pad
|
||||
from torch.nn.utils.rnn import pack_padded_sequence as pack
|
||||
from torch.nn.utils.rnn import PackedSequence
|
||||
|
||||
from .util import *
|
||||
from .memory import *
|
||||
|
||||
from torch.nn.init import orthogonal_, xavier_uniform_
|
||||
|
||||
|
||||
class DNC(nn.Module):
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
input_size,
|
||||
hidden_size,
|
||||
rnn_type='lstm',
|
||||
num_layers=1,
|
||||
num_hidden_layers=2,
|
||||
bias=True,
|
||||
batch_first=True,
|
||||
dropout=0,
|
||||
bidirectional=False,
|
||||
nr_cells=5,
|
||||
read_heads=2,
|
||||
cell_size=10,
|
||||
nonlinearity='tanh',
|
||||
gpu_id=-1,
|
||||
independent_linears=False,
|
||||
share_memory=True,
|
||||
debug=False,
|
||||
clip=20
|
||||
):
|
||||
super(DNC, self).__init__()
|
||||
# todo: separate weights and RNNs for the interface and output vectors
|
||||
|
||||
self.input_size = input_size
|
||||
self.hidden_size = hidden_size
|
||||
self.rnn_type = rnn_type
|
||||
self.num_layers = num_layers
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.bias = bias
|
||||
self.batch_first = batch_first
|
||||
self.dropout = dropout
|
||||
self.bidirectional = bidirectional
|
||||
self.nr_cells = nr_cells
|
||||
self.read_heads = read_heads
|
||||
self.cell_size = cell_size
|
||||
self.nonlinearity = nonlinearity
|
||||
self.gpu_id = gpu_id
|
||||
self.independent_linears = independent_linears
|
||||
self.share_memory = share_memory
|
||||
self.debug = debug
|
||||
self.clip = clip
|
||||
|
||||
self.w = self.cell_size
|
||||
self.r = self.read_heads
|
||||
|
||||
self.read_vectors_size = self.r * self.w
|
||||
self.output_size = self.hidden_size
|
||||
|
||||
self.nn_input_size = self.input_size + self.read_vectors_size
|
||||
self.nn_output_size = self.output_size + self.read_vectors_size
|
||||
|
||||
self.rnns = []
|
||||
self.memories = []
|
||||
|
||||
for layer in range(self.num_layers):
|
||||
if self.rnn_type.lower() == 'rnn':
|
||||
self.rnns.append(nn.RNN((self.nn_input_size if layer == 0 else self.nn_output_size), self.output_size,
|
||||
bias=self.bias, nonlinearity=self.nonlinearity, batch_first=True, dropout=self.dropout, num_layers=self.num_hidden_layers))
|
||||
elif self.rnn_type.lower() == 'gru':
|
||||
self.rnns.append(nn.GRU((self.nn_input_size if layer == 0 else self.nn_output_size),
|
||||
self.output_size, bias=self.bias, batch_first=True, dropout=self.dropout, num_layers=self.num_hidden_layers))
|
||||
if self.rnn_type.lower() == 'lstm':
|
||||
self.rnns.append(nn.LSTM((self.nn_input_size if layer == 0 else self.nn_output_size),
|
||||
self.output_size, bias=self.bias, batch_first=True, dropout=self.dropout, num_layers=self.num_hidden_layers))
|
||||
setattr(self, self.rnn_type.lower() + '_layer_' + str(layer), self.rnns[layer])
|
||||
|
||||
# memories for each layer
|
||||
if not self.share_memory:
|
||||
self.memories.append(
|
||||
Memory(
|
||||
input_size=self.output_size,
|
||||
mem_size=self.nr_cells,
|
||||
cell_size=self.w,
|
||||
read_heads=self.r,
|
||||
gpu_id=self.gpu_id,
|
||||
independent_linears=self.independent_linears
|
||||
)
|
||||
)
|
||||
setattr(self, 'rnn_layer_memory_' + str(layer), self.memories[layer])
|
||||
|
||||
# only one memory shared by all layers
|
||||
if self.share_memory:
|
||||
self.memories.append(
|
||||
Memory(
|
||||
input_size=self.output_size,
|
||||
mem_size=self.nr_cells,
|
||||
cell_size=self.w,
|
||||
read_heads=self.r,
|
||||
gpu_id=self.gpu_id,
|
||||
independent_linears=self.independent_linears
|
||||
)
|
||||
)
|
||||
setattr(self, 'rnn_layer_memory_shared', self.memories[0])
|
||||
|
||||
# final output layer
|
||||
self.output = nn.Linear(self.nn_output_size, self.output_size)
|
||||
orthogonal_(self.output.weight)
|
||||
|
||||
if self.gpu_id != -1:
|
||||
[x.cuda(self.gpu_id) for x in self.rnns]
|
||||
[x.cuda(self.gpu_id) for x in self.memories]
|
||||
self.output.cuda()
|
||||
|
||||
def _init_hidden(self, hx, batch_size, reset_experience):
|
||||
# create empty hidden states if not provided
|
||||
if hx is None:
|
||||
hx = (None, None, None)
|
||||
(chx, mhx, last_read) = hx
|
||||
|
||||
# initialize hidden state of the controller RNN
|
||||
if chx is None:
|
||||
h = cuda(T.zeros(self.num_hidden_layers, batch_size, self.output_size), gpu_id=self.gpu_id)
|
||||
xavier_uniform_(h)
|
||||
|
||||
chx = [ (h, h) if self.rnn_type.lower() == 'lstm' else h for x in range(self.num_layers)]
|
||||
|
||||
# Last read vectors
|
||||
if last_read is None:
|
||||
last_read = cuda(T.zeros(batch_size, self.w * self.r), gpu_id=self.gpu_id)
|
||||
|
||||
# memory states
|
||||
if mhx is None:
|
||||
if self.share_memory:
|
||||
mhx = self.memories[0].reset(batch_size, erase=reset_experience)
|
||||
else:
|
||||
mhx = [m.reset(batch_size, erase=reset_experience) for m in self.memories]
|
||||
else:
|
||||
if self.share_memory:
|
||||
mhx = self.memories[0].reset(batch_size, mhx, erase=reset_experience)
|
||||
else:
|
||||
mhx = [m.reset(batch_size, h, erase=reset_experience) for m, h in zip(self.memories, mhx)]
|
||||
|
||||
return chx, mhx, last_read
|
||||
|
||||
def _debug(self, mhx, debug_obj):
|
||||
if not debug_obj:
|
||||
debug_obj = {
|
||||
'memory': [],
|
||||
'link_matrix': [],
|
||||
'precedence': [],
|
||||
'read_weights': [],
|
||||
'write_weights': [],
|
||||
'usage_vector': [],
|
||||
}
|
||||
|
||||
debug_obj['memory'].append(mhx['memory'][0].data.cpu().numpy())
|
||||
debug_obj['link_matrix'].append(mhx['link_matrix'][0][0].data.cpu().numpy())
|
||||
debug_obj['precedence'].append(mhx['precedence'][0].data.cpu().numpy())
|
||||
debug_obj['read_weights'].append(mhx['read_weights'][0].data.cpu().numpy())
|
||||
debug_obj['write_weights'].append(mhx['write_weights'][0].data.cpu().numpy())
|
||||
debug_obj['usage_vector'].append(mhx['usage_vector'][0].unsqueeze(0).data.cpu().numpy())
|
||||
return debug_obj
|
||||
|
||||
def _layer_forward(self, input, layer, hx=(None, None), pass_through_memory=True):
|
||||
(chx, mhx) = hx
|
||||
|
||||
# pass through the controller layer
|
||||
input, chx = self.rnns[layer](input.unsqueeze(1), chx)
|
||||
input = input.squeeze(1)
|
||||
|
||||
# clip the controller output
|
||||
if self.clip != 0:
|
||||
output = T.clamp(input, -self.clip, self.clip)
|
||||
else:
|
||||
output = input
|
||||
|
||||
# the interface vector
|
||||
ξ = output
|
||||
|
||||
# pass through memory
|
||||
if pass_through_memory:
|
||||
if self.share_memory:
|
||||
read_vecs, mhx = self.memories[0](ξ, mhx)
|
||||
else:
|
||||
read_vecs, mhx = self.memories[layer](ξ, mhx)
|
||||
# the read vectors
|
||||
read_vectors = read_vecs.view(-1, self.w * self.r)
|
||||
else:
|
||||
read_vectors = None
|
||||
|
||||
return output, (chx, mhx, read_vectors)
|
||||
|
||||
def forward(self, input, hx=(None, None, None), reset_experience=False, pass_through_memory=True):
|
||||
# handle packed data
|
||||
is_packed = type(input) is PackedSequence
|
||||
if is_packed:
|
||||
input, lengths = pad(input)
|
||||
max_length = lengths[0]
|
||||
else:
|
||||
max_length = input.size(1) if self.batch_first else input.size(0)
|
||||
lengths = [input.size(1)] * max_length if self.batch_first else [input.size(0)] * max_length
|
||||
|
||||
batch_size = input.size(0) if self.batch_first else input.size(1)
|
||||
|
||||
if not self.batch_first:
|
||||
input = input.transpose(0, 1)
|
||||
# make the data time-first
|
||||
|
||||
controller_hidden, mem_hidden, last_read = self._init_hidden(hx, batch_size, reset_experience)
|
||||
|
||||
# concat input with last read (or padding) vectors
|
||||
inputs = [T.cat([input[:, x, :], last_read], 1) for x in range(max_length)]
|
||||
|
||||
# batched forward pass per element / word / etc
|
||||
if self.debug:
|
||||
viz = None
|
||||
|
||||
outs = [None] * max_length
|
||||
read_vectors = None
|
||||
rv = [None] * max_length
|
||||
# pass through time
|
||||
for time in range(max_length):
|
||||
# pass thorugh layers
|
||||
for layer in range(self.num_layers):
|
||||
# this layer's hidden states
|
||||
chx = controller_hidden[layer]
|
||||
m = mem_hidden if self.share_memory else mem_hidden[layer]
|
||||
# pass through controller
|
||||
outs[time], (chx, m, read_vectors) = \
|
||||
self._layer_forward(inputs[time], layer, (chx, m), pass_through_memory)
|
||||
|
||||
# debug memory
|
||||
if self.debug:
|
||||
viz = self._debug(m, viz)
|
||||
|
||||
# store the memory back (per layer or shared)
|
||||
if self.share_memory:
|
||||
mem_hidden = m
|
||||
else:
|
||||
mem_hidden[layer] = m
|
||||
controller_hidden[layer] = chx
|
||||
|
||||
if read_vectors is not None:
|
||||
# the controller output + read vectors go into next layer
|
||||
outs[time] = T.cat([outs[time], read_vectors], 1)
|
||||
if layer == self.num_layers - 1:
|
||||
rv[time] = read_vectors.reshape(batch_size, self.r, self.w)
|
||||
else:
|
||||
outs[time] = T.cat([outs[time], last_read], 1)
|
||||
inputs[time] = outs[time]
|
||||
|
||||
if self.debug:
|
||||
viz = {k: np.array(v) for k, v in viz.items()}
|
||||
viz = {k: v.reshape(v.shape[0], v.shape[1] * v.shape[2]) for k, v in viz.items()}
|
||||
|
||||
# pass through final output layer
|
||||
inputs = [self.output(i) for i in inputs]
|
||||
outputs = T.stack(inputs, 1 if self.batch_first else 0)
|
||||
|
||||
if is_packed:
|
||||
outputs = pack(output, lengths)
|
||||
|
||||
if self.debug:
|
||||
return outputs, (controller_hidden, mem_hidden, read_vectors), rv, viz
|
||||
else:
|
||||
return outputs, (controller_hidden, mem_hidden, read_vectors), rv
|
||||
|
||||
def __repr__(self):
|
||||
s = "\n----------------------------------------\n"
|
||||
s += '{name}({input_size}, {hidden_size}'
|
||||
if self.rnn_type != 'lstm':
|
||||
s += ', rnn_type={rnn_type}'
|
||||
if self.num_layers != 1:
|
||||
s += ', num_layers={num_layers}'
|
||||
if self.num_hidden_layers != 2:
|
||||
s += ', num_hidden_layers={num_hidden_layers}'
|
||||
if self.bias != True:
|
||||
s += ', bias={bias}'
|
||||
if self.batch_first != True:
|
||||
s += ', batch_first={batch_first}'
|
||||
if self.dropout != 0:
|
||||
s += ', dropout={dropout}'
|
||||
if self.bidirectional != False:
|
||||
s += ', bidirectional={bidirectional}'
|
||||
if self.nr_cells != 5:
|
||||
s += ', nr_cells={nr_cells}'
|
||||
if self.read_heads != 2:
|
||||
s += ', read_heads={read_heads}'
|
||||
if self.cell_size != 10:
|
||||
s += ', cell_size={cell_size}'
|
||||
if self.nonlinearity != 'tanh':
|
||||
s += ', nonlinearity={nonlinearity}'
|
||||
if self.gpu_id != -1:
|
||||
s += ', gpu_id={gpu_id}'
|
||||
if self.independent_linears != False:
|
||||
s += ', independent_linears={independent_linears}'
|
||||
if self.share_memory != True:
|
||||
s += ', share_memory={share_memory}'
|
||||
if self.debug != False:
|
||||
s += ', debug={debug}'
|
||||
if self.clip != 20:
|
||||
s += ', clip={clip}'
|
||||
|
||||
s += ")\n" + super(DNC, self).__repr__() + \
|
||||
"\n----------------------------------------\n"
|
||||
return s.format(name=self.__class__.__name__, **self.__dict__)
|
208
code/core/model/mca.py
Normal file
208
code/core/model/mca.py
Normal file
|
@ -0,0 +1,208 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
from core.model.net_utils import FC, MLP, LayerNorm
|
||||
from core.model.dnc_improved import DNC, SharedMemDNC
|
||||
from core.model.dnc_improved import FeedforwardController
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
import torch, math
|
||||
import time
|
||||
|
||||
|
||||
# ------------------------------
|
||||
# ---- Multi-Head Attention ----
|
||||
# ------------------------------
|
||||
|
||||
class MHAtt(nn.Module):
|
||||
def __init__(self, __C):
|
||||
super(MHAtt, self).__init__()
|
||||
self.__C = __C
|
||||
|
||||
self.linear_v = nn.Linear(__C.HIDDEN_SIZE, __C.HIDDEN_SIZE)
|
||||
self.linear_k = nn.Linear(__C.HIDDEN_SIZE, __C.HIDDEN_SIZE)
|
||||
self.linear_q = nn.Linear(__C.HIDDEN_SIZE, __C.HIDDEN_SIZE)
|
||||
self.linear_merge = nn.Linear(__C.HIDDEN_SIZE, __C.HIDDEN_SIZE)
|
||||
|
||||
self.dropout = nn.Dropout(__C.DROPOUT_R)
|
||||
|
||||
def forward(self, v, k, q, mask):
|
||||
n_batches = q.size(0)
|
||||
|
||||
v = self.linear_v(v).view(
|
||||
n_batches,
|
||||
-1,
|
||||
self.__C.MULTI_HEAD,
|
||||
self.__C.HIDDEN_SIZE_HEAD
|
||||
).transpose(1, 2)
|
||||
|
||||
k = self.linear_k(k).view(
|
||||
n_batches,
|
||||
-1,
|
||||
self.__C.MULTI_HEAD,
|
||||
self.__C.HIDDEN_SIZE_HEAD
|
||||
).transpose(1, 2)
|
||||
|
||||
q = self.linear_q(q).view(
|
||||
n_batches,
|
||||
-1,
|
||||
self.__C.MULTI_HEAD,
|
||||
self.__C.HIDDEN_SIZE_HEAD
|
||||
).transpose(1, 2)
|
||||
|
||||
atted = self.att(v, k, q, mask)
|
||||
atted = atted.transpose(1, 2).contiguous().view(
|
||||
n_batches,
|
||||
-1,
|
||||
self.__C.HIDDEN_SIZE
|
||||
)
|
||||
|
||||
atted = self.linear_merge(atted)
|
||||
|
||||
return atted
|
||||
|
||||
def att(self, value, key, query, mask):
|
||||
d_k = query.size(-1)
|
||||
|
||||
scores = torch.matmul(
|
||||
query, key.transpose(-2, -1)
|
||||
) / math.sqrt(d_k)
|
||||
|
||||
if mask is not None:
|
||||
scores = scores.masked_fill(mask, -1e9)
|
||||
|
||||
att_map = F.softmax(scores, dim=-1)
|
||||
att_map = self.dropout(att_map)
|
||||
|
||||
return torch.matmul(att_map, value)
|
||||
|
||||
|
||||
|
||||
# ---------------------------
|
||||
# ---- Feed Forward Nets ----
|
||||
# ---------------------------
|
||||
|
||||
class FFN(nn.Module):
|
||||
def __init__(self, __C):
|
||||
super(FFN, self).__init__()
|
||||
|
||||
self.mlp = MLP(
|
||||
in_size=__C.HIDDEN_SIZE,
|
||||
mid_size=__C.FF_SIZE,
|
||||
out_size=__C.HIDDEN_SIZE,
|
||||
dropout_r=__C.DROPOUT_R,
|
||||
use_relu=True
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
return self.mlp(x)
|
||||
|
||||
|
||||
# ------------------------
|
||||
# ---- Self Attention ----
|
||||
# ------------------------
|
||||
|
||||
class SA(nn.Module):
|
||||
def __init__(self, __C):
|
||||
super(SA, self).__init__()
|
||||
self.mhatt = MHAtt(__C)
|
||||
self.ffn = FFN(__C)
|
||||
|
||||
self.dropout1 = nn.Dropout(__C.DROPOUT_R)
|
||||
self.norm1 = LayerNorm(__C.HIDDEN_SIZE)
|
||||
|
||||
self.dropout2 = nn.Dropout(__C.DROPOUT_R)
|
||||
self.norm2 = LayerNorm(__C.HIDDEN_SIZE)
|
||||
|
||||
def forward(self, x, x_mask):
|
||||
x = self.norm1(x + self.dropout1(
|
||||
self.mhatt(x, x, x, x_mask)
|
||||
))
|
||||
|
||||
x = self.norm2(x + self.dropout2(
|
||||
self.ffn(x)
|
||||
))
|
||||
|
||||
return x
|
||||
|
||||
# -------------------------------
|
||||
# ---- Self Guided Attention ----
|
||||
# -------------------------------
|
||||
|
||||
class SGA(nn.Module):
|
||||
def __init__(self, __C):
|
||||
super(SGA, self).__init__()
|
||||
|
||||
self.mhatt1 = MHAtt(__C)
|
||||
self.mhatt2 = MHAtt(__C)
|
||||
self.ffn = FFN(__C)
|
||||
|
||||
self.dropout1 = nn.Dropout(__C.DROPOUT_R)
|
||||
self.norm1 = LayerNorm(__C.HIDDEN_SIZE)
|
||||
|
||||
self.dropout2 = nn.Dropout(__C.DROPOUT_R)
|
||||
self.norm2 = LayerNorm(__C.HIDDEN_SIZE)
|
||||
|
||||
self.dropout3 = nn.Dropout(__C.DROPOUT_R)
|
||||
self.norm3 = LayerNorm(__C.HIDDEN_SIZE)
|
||||
|
||||
def forward(self, x, y, x_mask, y_mask):
|
||||
x = self.norm1(x + self.dropout1(
|
||||
self.mhatt1(x, x, x, x_mask)
|
||||
))
|
||||
|
||||
x = self.norm2(x + self.dropout2(
|
||||
self.mhatt2(y, y, x, y_mask)
|
||||
))
|
||||
|
||||
x = self.norm3(x + self.dropout3(
|
||||
self.ffn(x)
|
||||
))
|
||||
|
||||
return x
|
||||
|
||||
|
||||
# ------------------------------------------------
|
||||
# ---- MAC Layers Cascaded by Encoder-Decoder ----
|
||||
# ------------------------------------------------
|
||||
|
||||
class MCA_ED(nn.Module):
|
||||
def __init__(self, __C):
|
||||
super(MCA_ED, self).__init__()
|
||||
|
||||
self.enc_list = nn.ModuleList([SA(__C) for _ in range(__C.LAYER)])
|
||||
self.dec_list = nn.ModuleList([SGA(__C) for _ in range(__C.LAYER)])
|
||||
|
||||
def forward(self, x, y, x_mask, y_mask):
|
||||
# Get hidden vector
|
||||
for enc in self.enc_list:
|
||||
x = enc(x, x_mask)
|
||||
|
||||
for dec in self.dec_list:
|
||||
y = dec(y, x, y_mask, x_mask)
|
||||
return x, y
|
||||
|
||||
class VLC(nn.Module):
|
||||
def __init__(self, __C):
|
||||
super(VLC, self).__init__()
|
||||
|
||||
self.enc_list = nn.ModuleList([SA(__C) for _ in range(__C.LAYER)])
|
||||
self.dec_lang_frames_list = nn.ModuleList([SGA(__C) for _ in range(__C.LAYER)])
|
||||
self.dec_lang_clips_list = nn.ModuleList([SGA(__C) for _ in range(__C.LAYER)])
|
||||
|
||||
|
||||
def forward(self, x, y, z, x_mask, y_mask, z_mask):
|
||||
# Get hidden vector
|
||||
for enc in self.enc_list:
|
||||
x = enc(x, x_mask)
|
||||
|
||||
for dec in self.dec_lang_frames_list:
|
||||
y = dec(y, x, y_mask, x_mask)
|
||||
|
||||
for dec in self.dec_lang_clips_list:
|
||||
z = dec(z, x, z_mask, x_mask)
|
||||
return x, y, z
|
||||
|
314
code/core/model/memory.py
Normal file
314
code/core/model/memory.py
Normal file
|
@ -0,0 +1,314 @@
|
|||
"""
|
||||
PyTorch DNC implementation from
|
||||
-->
|
||||
https://github.com/ixaxaar/pytorch-dnc
|
||||
<--
|
||||
"""
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
import torch.nn as nn
|
||||
import torch as T
|
||||
from torch.autograd import Variable as var
|
||||
import torch.nn.functional as F
|
||||
import numpy as np
|
||||
|
||||
from core.model.util import *
|
||||
|
||||
|
||||
class Memory(nn.Module):
|
||||
|
||||
def __init__(self, input_size, mem_size=512, cell_size=32, read_heads=4, gpu_id=-1, independent_linears=True):
|
||||
super(Memory, self).__init__()
|
||||
|
||||
self.input_size = input_size
|
||||
self.mem_size = mem_size
|
||||
self.cell_size = cell_size
|
||||
self.read_heads = read_heads
|
||||
self.gpu_id = gpu_id
|
||||
self.independent_linears = independent_linears
|
||||
|
||||
m = self.mem_size
|
||||
w = self.cell_size
|
||||
r = self.read_heads
|
||||
|
||||
if self.independent_linears:
|
||||
self.read_keys_transform = nn.Linear(self.input_size, w * r)
|
||||
self.read_strengths_transform = nn.Linear(self.input_size, r)
|
||||
self.write_key_transform = nn.Linear(self.input_size, w)
|
||||
self.write_strength_transform = nn.Linear(self.input_size, 1)
|
||||
self.erase_vector_transform = nn.Linear(self.input_size, w)
|
||||
self.write_vector_transform = nn.Linear(self.input_size, w)
|
||||
self.free_gates_transform = nn.Linear(self.input_size, r)
|
||||
self.allocation_gate_transform = nn.Linear(self.input_size, 1)
|
||||
self.write_gate_transform = nn.Linear(self.input_size, 1)
|
||||
self.read_modes_transform = nn.Linear(self.input_size, 3 * r)
|
||||
else:
|
||||
self.interface_size = (w * r) + (3 * w) + (5 * r) + 3
|
||||
self.interface_weights = nn.Linear(
|
||||
self.input_size, self.interface_size)
|
||||
|
||||
self.I = cuda(1 - T.eye(m).unsqueeze(0),
|
||||
gpu_id=self.gpu_id) # (1 * n * n)
|
||||
|
||||
def reset(self, batch_size=1, hidden=None, erase=True):
|
||||
m = self.mem_size
|
||||
w = self.cell_size
|
||||
r = self.read_heads
|
||||
b = batch_size
|
||||
|
||||
if hidden is None:
|
||||
return {
|
||||
'memory': cuda(T.zeros(b, m, w).fill_(0), gpu_id=self.gpu_id),
|
||||
'link_matrix': cuda(T.zeros(b, 1, m, m), gpu_id=self.gpu_id),
|
||||
'precedence': cuda(T.zeros(b, 1, m), gpu_id=self.gpu_id),
|
||||
'read_weights': cuda(T.zeros(b, r, m).fill_(0), gpu_id=self.gpu_id),
|
||||
'write_weights': cuda(T.zeros(b, 1, m).fill_(0), gpu_id=self.gpu_id),
|
||||
'usage_vector': cuda(T.zeros(b, m), gpu_id=self.gpu_id),
|
||||
# 'free_gates': cuda(T.zeros(b, r), gpu_id=self.gpu_id),
|
||||
# 'alloc_gates': cuda(T.zeros(b, 1), gpu_id=self.gpu_id),
|
||||
# 'write_gates': cuda(T.zeros(b, 1), gpu_id=self.gpu_id),
|
||||
# 'read_modes': cuda(T.zeros(b, r, 3), gpu_id=self.gpu_id)
|
||||
}
|
||||
else:
|
||||
hidden['memory'] = hidden['memory'].clone()
|
||||
hidden['link_matrix'] = hidden['link_matrix'].clone()
|
||||
hidden['precedence'] = hidden['precedence'].clone()
|
||||
hidden['read_weights'] = hidden['read_weights'].clone()
|
||||
hidden['write_weights'] = hidden['write_weights'].clone()
|
||||
hidden['usage_vector'] = hidden['usage_vector'].clone()
|
||||
# hidden['free_gates'] = hidden['free_gates'].clone()
|
||||
# hidden['alloc_gates'] = hidden['alloc_gates'].clone()
|
||||
# hidden['write_gates'] = hidden['write_gates'].clone()
|
||||
# hidden['read_modes'] = hidden['read_modes'].clone()
|
||||
|
||||
if erase:
|
||||
hidden['memory'].data.fill_(0)
|
||||
hidden['link_matrix'].data.zero_()
|
||||
hidden['precedence'].data.zero_()
|
||||
hidden['read_weights'].data.fill_(0)
|
||||
hidden['write_weights'].data.fill_(0)
|
||||
hidden['usage_vector'].data.zero_()
|
||||
# hidden['free_gates'].data.fill_()
|
||||
# hidden['alloc_gates'].data.fill_()
|
||||
# hidden['write_gates'].data.fill_()
|
||||
# hidden['read_modes'].data.fill_()
|
||||
|
||||
return hidden
|
||||
|
||||
def get_usage_vector(self, usage, free_gates, read_weights, write_weights):
|
||||
# write_weights = write_weights.detach() # detach from the computation graph
|
||||
# if read_weights.size(0) > free_gates.size(0):
|
||||
# read_weights = read_weights[:free_gates.size(0), :, :]
|
||||
# if usage.size(0) > free_gates.size(0):
|
||||
# usage = usage[:free_gates.size(0), :]
|
||||
# if write_weights.size(0) > free_gates.size(0):
|
||||
# write_weights = write_weights[:free_gates.size(0), :, :]
|
||||
usage = usage + (1 - usage) * (1 - T.prod(1 - write_weights, 1))
|
||||
ψ = T.prod(1 - free_gates.unsqueeze(2) * read_weights, 1)
|
||||
return usage * ψ
|
||||
|
||||
def allocate(self, usage, write_gate):
|
||||
# ensure values are not too small prior to cumprod.
|
||||
usage = δ + (1 - δ) * usage
|
||||
batch_size = usage.size(0)
|
||||
# free list
|
||||
sorted_usage, φ = T.topk(usage, self.mem_size, dim=1, largest=False)
|
||||
|
||||
# cumprod with exclusive=True
|
||||
# https://discuss.pytorch.org/t/cumprod-exclusive-true-equivalences/2614/8
|
||||
v = var(sorted_usage.data.new(batch_size, 1).fill_(1))
|
||||
cat_sorted_usage = T.cat((v, sorted_usage), 1)
|
||||
prod_sorted_usage = T.cumprod(cat_sorted_usage, 1)[:, :-1]
|
||||
|
||||
sorted_allocation_weights = (1 - sorted_usage) * prod_sorted_usage.squeeze()
|
||||
|
||||
# construct the reverse sorting index https://stackoverflow.com/questions/2483696/undo-or-reverse-argsort-python
|
||||
_, φ_rev = T.topk(φ, k=self.mem_size, dim=1, largest=False)
|
||||
allocation_weights = sorted_allocation_weights.gather(1, φ_rev.long())
|
||||
|
||||
return allocation_weights.unsqueeze(1), usage
|
||||
|
||||
def write_weighting(self, memory, write_content_weights, allocation_weights, write_gate, allocation_gate):
|
||||
ag = allocation_gate.unsqueeze(-1)
|
||||
wg = write_gate.unsqueeze(-1)
|
||||
|
||||
return wg * (ag * allocation_weights + (1 - ag) * write_content_weights)
|
||||
|
||||
def get_link_matrix(self, link_matrix, write_weights, precedence):
|
||||
precedence = precedence.unsqueeze(2)
|
||||
write_weights_i = write_weights.unsqueeze(3)
|
||||
write_weights_j = write_weights.unsqueeze(2)
|
||||
|
||||
prev_scale = 1 - write_weights_i - write_weights_j
|
||||
new_link_matrix = write_weights_i * precedence
|
||||
|
||||
link_matrix = prev_scale * link_matrix + new_link_matrix
|
||||
# trick to delete diag elems
|
||||
return self.I.expand_as(link_matrix) * link_matrix
|
||||
|
||||
def update_precedence(self, precedence, write_weights):
|
||||
return (1 - T.sum(write_weights, 2, keepdim=True)) * precedence + write_weights
|
||||
|
||||
def write(self, write_key, write_vector, erase_vector, free_gates, read_strengths, write_strength, write_gate, allocation_gate, hidden):
|
||||
# get current usage
|
||||
hidden['usage_vector'] = self.get_usage_vector(
|
||||
hidden['usage_vector'],
|
||||
free_gates,
|
||||
hidden['read_weights'],
|
||||
hidden['write_weights']
|
||||
)
|
||||
|
||||
# lookup memory with write_key and write_strength
|
||||
write_content_weights = self.content_weightings(
|
||||
hidden['memory'], write_key, write_strength)
|
||||
|
||||
# get memory allocation
|
||||
alloc, _ = self.allocate(
|
||||
hidden['usage_vector'],
|
||||
allocation_gate * write_gate
|
||||
)
|
||||
|
||||
# get write weightings
|
||||
hidden['write_weights'] = self.write_weighting(
|
||||
hidden['memory'],
|
||||
write_content_weights,
|
||||
alloc,
|
||||
write_gate,
|
||||
allocation_gate
|
||||
)
|
||||
|
||||
weighted_resets = hidden['write_weights'].unsqueeze(
|
||||
3) * erase_vector.unsqueeze(2)
|
||||
reset_gate = T.prod(1 - weighted_resets, 1)
|
||||
# Update memory
|
||||
hidden['memory'] = hidden['memory'] * reset_gate
|
||||
|
||||
hidden['memory'] = hidden['memory'] + \
|
||||
T.bmm(hidden['write_weights'].transpose(1, 2), write_vector)
|
||||
|
||||
# update link_matrix
|
||||
hidden['link_matrix'] = self.get_link_matrix(
|
||||
hidden['link_matrix'],
|
||||
hidden['write_weights'],
|
||||
hidden['precedence']
|
||||
)
|
||||
hidden['precedence'] = self.update_precedence(
|
||||
hidden['precedence'], hidden['write_weights'])
|
||||
|
||||
return hidden
|
||||
|
||||
def content_weightings(self, memory, keys, strengths):
|
||||
# if memory.size(0) > keys.size(0):
|
||||
# memory = memory[:keys.size(0), :, :]
|
||||
d = θ(memory, keys)
|
||||
return σ(d * strengths.unsqueeze(2), 2)
|
||||
|
||||
def directional_weightings(self, link_matrix, read_weights):
|
||||
rw = read_weights.unsqueeze(1)
|
||||
|
||||
f = T.matmul(link_matrix, rw.transpose(2, 3)).transpose(2, 3)
|
||||
b = T.matmul(rw, link_matrix)
|
||||
return f.transpose(1, 2), b.transpose(1, 2)
|
||||
|
||||
def read_weightings(self, memory, content_weights, link_matrix, read_modes, read_weights):
|
||||
forward_weight, backward_weight = self.directional_weightings(
|
||||
link_matrix, read_weights)
|
||||
|
||||
content_mode = read_modes[:, :, 2].contiguous(
|
||||
).unsqueeze(2) * content_weights
|
||||
backward_mode = T.sum(
|
||||
read_modes[:, :, 0:1].contiguous().unsqueeze(3) * backward_weight, 2)
|
||||
forward_mode = T.sum(
|
||||
read_modes[:, :, 1:2].contiguous().unsqueeze(3) * forward_weight, 2)
|
||||
|
||||
return backward_mode + content_mode + forward_mode
|
||||
|
||||
def read_vectors(self, memory, read_weights):
|
||||
return T.bmm(read_weights, memory)
|
||||
|
||||
def read(self, read_keys, read_strengths, read_modes, hidden):
|
||||
content_weights = self.content_weightings(
|
||||
hidden['memory'], read_keys, read_strengths)
|
||||
|
||||
hidden['read_weights'] = self.read_weightings(
|
||||
hidden['memory'],
|
||||
content_weights,
|
||||
hidden['link_matrix'],
|
||||
read_modes,
|
||||
hidden['read_weights']
|
||||
)
|
||||
read_vectors = self.read_vectors(
|
||||
hidden['memory'], hidden['read_weights'])
|
||||
return read_vectors, hidden
|
||||
|
||||
def forward(self, ξ, hidden):
|
||||
|
||||
# ξ = ξ.detach()
|
||||
m = self.mem_size
|
||||
w = self.cell_size
|
||||
r = self.read_heads
|
||||
b = ξ.size()[0]
|
||||
|
||||
if self.independent_linears:
|
||||
# r read keys (b * r * w)
|
||||
read_keys = self.read_keys_transform(ξ).view(b, r, w)
|
||||
# r read strengths (b * r)
|
||||
read_strengths = F.softplus(
|
||||
self.read_strengths_transform(ξ).view(b, r))
|
||||
# write key (b * 1 * w)
|
||||
write_key = self.write_key_transform(ξ).view(b, 1, w)
|
||||
# write strength (b * 1)
|
||||
write_strength = F.softplus(
|
||||
self.write_strength_transform(ξ).view(b, 1))
|
||||
# erase vector (b * 1 * w)
|
||||
erase_vector = T.sigmoid(
|
||||
self.erase_vector_transform(ξ).view(b, 1, w))
|
||||
# write vector (b * 1 * w)
|
||||
write_vector = self.write_vector_transform(ξ).view(b, 1, w)
|
||||
# r free gates (b * r)
|
||||
free_gates = T.sigmoid(self.free_gates_transform(ξ).view(b, r))
|
||||
# allocation gate (b * 1)
|
||||
allocation_gate = T.sigmoid(
|
||||
self.allocation_gate_transform(ξ).view(b, 1))
|
||||
# write gate (b * 1)
|
||||
write_gate = T.sigmoid(self.write_gate_transform(ξ).view(b, 1))
|
||||
# read modes (b * r * 3)
|
||||
read_modes = σ(self.read_modes_transform(ξ).view(b, r, 3), -1)
|
||||
else:
|
||||
ξ = self.interface_weights(ξ)
|
||||
# r read keys (b * w * r)
|
||||
read_keys = ξ[:, :r * w].contiguous().view(b, r, w)
|
||||
# r read strengths (b * r)
|
||||
read_strengths = F.softplus(
|
||||
ξ[:, r * w:r * w + r].contiguous().view(b, r))
|
||||
# write key (b * w * 1)
|
||||
write_key = ξ[:, r * w + r:r * w + r + w].contiguous().view(b, 1, w)
|
||||
# write strength (b * 1)
|
||||
write_strength = F.softplus(
|
||||
ξ[:, r * w + r + w].contiguous().view(b, 1))
|
||||
# erase vector (b * w)
|
||||
erase_vector = T.sigmoid(
|
||||
ξ[:, r * w + r + w + 1: r * w + r + 2 * w + 1].contiguous().view(b, 1, w))
|
||||
# write vector (b * w)
|
||||
write_vector = ξ[:, r * w + r + 2 * w + 1: r * w + r + 3 * w + 1].contiguous().view(b, 1, w)
|
||||
# r free gates (b * r)
|
||||
free_gates = T.sigmoid(
|
||||
ξ[:, r * w + r + 3 * w + 1: r * w + 2 * r + 3 * w + 1].contiguous().view(b, r))
|
||||
# allocation gate (b * 1)
|
||||
allocation_gate = T.sigmoid(
|
||||
ξ[:, r * w + 2 * r + 3 * w + 1].contiguous().unsqueeze(1).view(b, 1))
|
||||
# write gate (b * 1)
|
||||
write_gate = T.sigmoid(
|
||||
ξ[:, r * w + 2 * r + 3 * w + 2].contiguous()).unsqueeze(1).view(b, 1)
|
||||
# read modes (b * 3*r)
|
||||
read_modes = σ(ξ[:, r * w + 2 * r + 3 * w + 3: r *
|
||||
w + 5 * r + 3 * w + 3].contiguous().view(b, r, 3), -1)
|
||||
|
||||
hidden = self.write(write_key, write_vector, erase_vector, free_gates,
|
||||
read_strengths, write_strength, write_gate, allocation_gate, hidden)
|
||||
hidden["free_gates"] = free_gates.clone().detach()
|
||||
hidden["allocation_gate"] = allocation_gate.clone().detach()
|
||||
hidden["write_gate"] = write_gate.clone().detach()
|
||||
hidden["read_modes"] = read_modes.clone().detach()
|
||||
|
||||
return self.read(read_keys, read_strengths, read_modes, hidden)
|
501
code/core/model/net.py
Normal file
501
code/core/model/net.py
Normal file
|
@ -0,0 +1,501 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
from core.model.net_utils import FC, MLP, LayerNorm
|
||||
from core.model.mca import SA, MCA_ED, VLC
|
||||
from core.model.dnc import DNC
|
||||
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
import torch
|
||||
|
||||
# ------------------------------
|
||||
# ---- Flatten the sequence ----
|
||||
# ------------------------------
|
||||
|
||||
class AttFlat(nn.Module):
|
||||
def __init__(self, __C):
|
||||
super(AttFlat, self).__init__()
|
||||
self.__C = __C
|
||||
|
||||
self.mlp = MLP(
|
||||
in_size=__C.HIDDEN_SIZE,
|
||||
mid_size=__C.FLAT_MLP_SIZE,
|
||||
out_size=__C.FLAT_GLIMPSES,
|
||||
dropout_r=__C.DROPOUT_R,
|
||||
use_relu=True
|
||||
)
|
||||
|
||||
self.linear_merge = nn.Linear(
|
||||
__C.HIDDEN_SIZE * __C.FLAT_GLIMPSES,
|
||||
__C.FLAT_OUT_SIZE
|
||||
)
|
||||
|
||||
def forward(self, x, x_mask):
|
||||
att = self.mlp(x)
|
||||
att = att.masked_fill(
|
||||
x_mask.squeeze(1).squeeze(1).unsqueeze(2),
|
||||
-1e9
|
||||
)
|
||||
att = F.softmax(att, dim=1)
|
||||
|
||||
att_list = []
|
||||
for i in range(self.__C.FLAT_GLIMPSES):
|
||||
att_list.append(
|
||||
torch.sum(att[:, :, i: i + 1] * x, dim=1)
|
||||
)
|
||||
|
||||
x_atted = torch.cat(att_list, dim=1)
|
||||
x_atted = self.linear_merge(x_atted)
|
||||
|
||||
return x_atted
|
||||
|
||||
class AttFlatMem(AttFlat):
|
||||
def __init__(self, __C):
|
||||
super(AttFlatMem, self).__init__(__C)
|
||||
self.__C = __C
|
||||
|
||||
def forward(self, x_mem, x, x_mask):
|
||||
att = self.mlp(x_mem)
|
||||
att = att.masked_fill(
|
||||
x_mask.squeeze(1).squeeze(1).unsqueeze(2),
|
||||
float('-inf')
|
||||
)
|
||||
att = F.softmax(att, dim=1)
|
||||
att_list = []
|
||||
for i in range(self.__C.FLAT_GLIMPSES):
|
||||
att_list.append(
|
||||
torch.sum(att[:, :, i: i + 1] * x, dim=1)
|
||||
)
|
||||
x_atted = torch.cat(att_list, dim=1)
|
||||
x_atted = self.linear_merge(x_atted)
|
||||
|
||||
return x_atted
|
||||
# -------------------------
|
||||
# ---- Main MCAN Model ----
|
||||
# -------------------------
|
||||
|
||||
class Net1(nn.Module):
|
||||
def __init__(self, __C, pretrained_emb, token_size, answer_size):
|
||||
super(Net1, self).__init__()
|
||||
print('Training with Network type 1: VLCN')
|
||||
self.pretrained_path = __C.PRETRAINED_PATH
|
||||
self.embedding = nn.Embedding(
|
||||
num_embeddings=token_size,
|
||||
embedding_dim=__C.WORD_EMBED_SIZE
|
||||
)
|
||||
|
||||
# Loading the GloVe embedding weights
|
||||
if __C.USE_GLOVE:
|
||||
self.embedding.weight.data.copy_(torch.from_numpy(pretrained_emb))
|
||||
|
||||
self.lstm = nn.LSTM(
|
||||
input_size=__C.WORD_EMBED_SIZE,
|
||||
hidden_size=__C.HIDDEN_SIZE,
|
||||
num_layers=1,
|
||||
batch_first=True
|
||||
)
|
||||
|
||||
self.frame_feat_linear = nn.Linear(
|
||||
__C.FRAME_FEAT_SIZE,
|
||||
__C.HIDDEN_SIZE
|
||||
)
|
||||
|
||||
self.clip_feat_linear = nn.Linear(
|
||||
__C.CLIP_FEAT_SIZE,
|
||||
__C.HIDDEN_SIZE
|
||||
)
|
||||
self.backbone = VLC(__C)
|
||||
|
||||
self.attflat_lang = AttFlat(__C)
|
||||
self.attflat_frame = AttFlat(__C)
|
||||
self.attflat_clip = AttFlat(__C)
|
||||
|
||||
self.dnc = DNC(
|
||||
__C.FLAT_OUT_SIZE,
|
||||
__C.FLAT_OUT_SIZE,
|
||||
rnn_type='lstm',
|
||||
num_layers=2,
|
||||
num_hidden_layers=2,
|
||||
bias=True,
|
||||
batch_first=True,
|
||||
dropout=0,
|
||||
bidirectional=True,
|
||||
nr_cells=__C.CELL_COUNT_DNC,
|
||||
read_heads=__C.N_READ_HEADS_DNC,
|
||||
cell_size=__C.WORD_LENGTH_DNC,
|
||||
nonlinearity='tanh',
|
||||
gpu_id=0,
|
||||
independent_linears=False,
|
||||
share_memory=False,
|
||||
debug=False,
|
||||
clip=20,
|
||||
)
|
||||
|
||||
self.proj_norm = LayerNorm(__C.FLAT_OUT_SIZE)
|
||||
|
||||
self.proj_norm_dnc = LayerNorm(__C.FLAT_OUT_SIZE + __C.N_READ_HEADS_DNC * __C.WORD_LENGTH_DNC)
|
||||
self.linear_dnc = FC(__C.FLAT_OUT_SIZE + __C.N_READ_HEADS_DNC * __C.WORD_LENGTH_DNC, __C.FLAT_OUT_SIZE, dropout_r=0.2)
|
||||
self.proj = nn.Linear(__C.FLAT_OUT_SIZE, answer_size)
|
||||
|
||||
def forward(self, frame_feat, clip_feat, ques_ix):
|
||||
|
||||
# Make mask
|
||||
lang_feat_mask = self.make_mask(ques_ix.unsqueeze(2))
|
||||
frame_feat_mask = self.make_mask(frame_feat)
|
||||
clip_feat_mask = self.make_mask(clip_feat)
|
||||
|
||||
# Pre-process Language Feature
|
||||
lang_feat = self.embedding(ques_ix)
|
||||
lang_feat, _ = self.lstm(lang_feat)
|
||||
|
||||
|
||||
# Pre-process Video Feature
|
||||
frame_feat = self.frame_feat_linear(frame_feat)
|
||||
clip_feat = self.clip_feat_linear(clip_feat)
|
||||
|
||||
# Backbone Framework
|
||||
lang_feat, frame_feat, clip_feat = self.backbone(
|
||||
lang_feat,
|
||||
frame_feat,
|
||||
clip_feat,
|
||||
lang_feat_mask,
|
||||
frame_feat_mask,
|
||||
clip_feat_mask
|
||||
)
|
||||
|
||||
lang_feat = self.attflat_lang(
|
||||
lang_feat,
|
||||
lang_feat_mask
|
||||
)
|
||||
|
||||
frame_feat = self.attflat_frame(
|
||||
frame_feat,
|
||||
frame_feat_mask
|
||||
)
|
||||
|
||||
clip_feat = self.attflat_clip(
|
||||
clip_feat,
|
||||
clip_feat_mask
|
||||
)
|
||||
proj_feat_0 = lang_feat + frame_feat + clip_feat
|
||||
proj_feat_0 = self.proj_norm(proj_feat_0)
|
||||
|
||||
proj_feat_1 = torch.stack([lang_feat, frame_feat, clip_feat], dim=1)
|
||||
proj_feat_1, (_, _, rv), _ = self.dnc(proj_feat_1, (None, None, None), reset_experience=True, pass_through_memory=True)
|
||||
proj_feat_1 = proj_feat_1.sum(1)
|
||||
proj_feat_1 = torch.cat([proj_feat_1, rv], dim=-1)
|
||||
proj_feat_1 = self.proj_norm_dnc(proj_feat_1)
|
||||
proj_feat_1 = self.linear_dnc(proj_feat_1)
|
||||
# proj_feat_1 = self.proj_norm(proj_feat_1)
|
||||
|
||||
proj_feat = torch.sigmoid(self.proj(proj_feat_0 + proj_feat_1))
|
||||
|
||||
return proj_feat
|
||||
|
||||
def load_pretrained_weights(self):
|
||||
pretrained_msvd = torch.load(self.pretrained_path)['state_dict']
|
||||
for n_pretrained, p_pretrained in pretrained_msvd.items():
|
||||
if 'dnc' in n_pretrained:
|
||||
self.state_dict()[n_pretrained].copy_(p_pretrained)
|
||||
print('Pre-trained dnc-weights successfully loaded!')
|
||||
|
||||
# Masking
|
||||
def make_mask(self, feature):
|
||||
return (torch.sum(
|
||||
torch.abs(feature),
|
||||
dim=-1
|
||||
) == 0).unsqueeze(1).unsqueeze(2)
|
||||
|
||||
class Net2(nn.Module):
|
||||
def __init__(self, __C, pretrained_emb, token_size, answer_size):
|
||||
super(Net2, self).__init__()
|
||||
print('Training with Network type 2: VLCN-FLF')
|
||||
self.embedding = nn.Embedding(
|
||||
num_embeddings=token_size,
|
||||
embedding_dim=__C.WORD_EMBED_SIZE
|
||||
)
|
||||
# Loading the GloVe embedding weights
|
||||
if __C.USE_GLOVE:
|
||||
self.embedding.weight.data.copy_(torch.from_numpy(pretrained_emb))
|
||||
|
||||
self.lstm = nn.LSTM(
|
||||
input_size=__C.WORD_EMBED_SIZE,
|
||||
hidden_size=__C.HIDDEN_SIZE,
|
||||
num_layers=1,
|
||||
batch_first=True
|
||||
)
|
||||
|
||||
self.frame_feat_linear = nn.Linear(
|
||||
__C.FRAME_FEAT_SIZE,
|
||||
__C.HIDDEN_SIZE
|
||||
)
|
||||
|
||||
self.clip_feat_linear = nn.Linear(
|
||||
__C.CLIP_FEAT_SIZE,
|
||||
__C.HIDDEN_SIZE
|
||||
)
|
||||
self.backbone = VLC(__C)
|
||||
|
||||
self.attflat_lang = AttFlat(__C)
|
||||
self.attflat_frame = AttFlat(__C)
|
||||
self.attflat_clip = AttFlat(__C)
|
||||
|
||||
self.proj_norm = LayerNorm(__C.FLAT_OUT_SIZE)
|
||||
self.proj = nn.Linear(__C.FLAT_OUT_SIZE, answer_size)
|
||||
|
||||
|
||||
def forward(self, frame_feat, clip_feat, ques_ix):
|
||||
|
||||
# Make mask
|
||||
lang_feat_mask = self.make_mask(ques_ix.unsqueeze(2))
|
||||
frame_feat_mask = self.make_mask(frame_feat)
|
||||
clip_feat_mask = self.make_mask(clip_feat)
|
||||
|
||||
# Pre-process Language Feature
|
||||
lang_feat = self.embedding(ques_ix)
|
||||
lang_feat, _ = self.lstm(lang_feat)
|
||||
|
||||
|
||||
# Pre-process Video Feature
|
||||
frame_feat = self.frame_feat_linear(frame_feat)
|
||||
clip_feat = self.clip_feat_linear(clip_feat)
|
||||
|
||||
# Backbone Framework
|
||||
lang_feat, frame_feat, clip_feat = self.backbone(
|
||||
lang_feat,
|
||||
frame_feat,
|
||||
clip_feat,
|
||||
lang_feat_mask,
|
||||
frame_feat_mask,
|
||||
clip_feat_mask
|
||||
)
|
||||
|
||||
lang_feat = self.attflat_lang(
|
||||
lang_feat,
|
||||
lang_feat_mask
|
||||
)
|
||||
|
||||
frame_feat = self.attflat_frame(
|
||||
frame_feat,
|
||||
frame_feat_mask
|
||||
)
|
||||
|
||||
clip_feat = self.attflat_clip(
|
||||
clip_feat,
|
||||
clip_feat_mask
|
||||
)
|
||||
proj_feat = lang_feat + frame_feat + clip_feat
|
||||
proj_feat = self.proj_norm(proj_feat)
|
||||
proj_feat = torch.sigmoid(self.proj(proj_feat))
|
||||
|
||||
return proj_feat
|
||||
# Masking
|
||||
def make_mask(self, feature):
|
||||
return (torch.sum(
|
||||
torch.abs(feature),
|
||||
dim=-1
|
||||
) == 0).unsqueeze(1).unsqueeze(2)
|
||||
|
||||
class Net3(nn.Module):
|
||||
def __init__(self, __C, pretrained_emb, token_size, answer_size):
|
||||
super(Net3, self).__init__()
|
||||
print('Training with Network type 3: VLCN+LSTM')
|
||||
|
||||
self.embedding = nn.Embedding(
|
||||
num_embeddings=token_size,
|
||||
embedding_dim=__C.WORD_EMBED_SIZE
|
||||
)
|
||||
|
||||
# Loading the GloVe embedding weights
|
||||
if __C.USE_GLOVE:
|
||||
self.embedding.weight.data.copy_(torch.from_numpy(pretrained_emb))
|
||||
|
||||
self.lstm = nn.LSTM(
|
||||
input_size=__C.WORD_EMBED_SIZE,
|
||||
hidden_size=__C.HIDDEN_SIZE,
|
||||
num_layers=1,
|
||||
batch_first=True
|
||||
)
|
||||
|
||||
self.frame_feat_linear = nn.Linear(
|
||||
__C.FRAME_FEAT_SIZE,
|
||||
__C.HIDDEN_SIZE
|
||||
)
|
||||
|
||||
self.clip_feat_linear = nn.Linear(
|
||||
__C.CLIP_FEAT_SIZE,
|
||||
__C.HIDDEN_SIZE
|
||||
)
|
||||
self.backbone = VLC(__C)
|
||||
|
||||
self.attflat_lang = AttFlat(__C)
|
||||
self.attflat_frame = AttFlat(__C)
|
||||
self.attflat_clip = AttFlat(__C)
|
||||
|
||||
self.lstm_fusion = nn.LSTM(
|
||||
input_size=__C.FLAT_OUT_SIZE,
|
||||
hidden_size=__C.FLAT_OUT_SIZE,
|
||||
num_layers=2,
|
||||
batch_first=True,
|
||||
bidirectional=True
|
||||
)
|
||||
|
||||
self.proj_norm = LayerNorm(__C.FLAT_OUT_SIZE)
|
||||
self.proj_feat_1 = nn.Linear(__C.FLAT_OUT_SIZE * 2, __C.FLAT_OUT_SIZE)
|
||||
|
||||
self.proj_norm_lstm = LayerNorm(__C.FLAT_OUT_SIZE)
|
||||
self.proj = nn.Linear(__C.FLAT_OUT_SIZE, answer_size)
|
||||
|
||||
def forward(self, frame_feat, clip_feat, ques_ix):
|
||||
|
||||
# Make mask
|
||||
lang_feat_mask = self.make_mask(ques_ix.unsqueeze(2))
|
||||
frame_feat_mask = self.make_mask(frame_feat)
|
||||
clip_feat_mask = self.make_mask(clip_feat)
|
||||
|
||||
# Pre-process Language Feature
|
||||
lang_feat = self.embedding(ques_ix)
|
||||
lang_feat, _ = self.lstm(lang_feat)
|
||||
|
||||
|
||||
# Pre-process Video Feature
|
||||
frame_feat = self.frame_feat_linear(frame_feat)
|
||||
clip_feat = self.clip_feat_linear(clip_feat)
|
||||
|
||||
# Backbone Framework
|
||||
lang_feat, frame_feat, clip_feat = self.backbone(
|
||||
lang_feat,
|
||||
frame_feat,
|
||||
clip_feat,
|
||||
lang_feat_mask,
|
||||
frame_feat_mask,
|
||||
clip_feat_mask
|
||||
)
|
||||
|
||||
lang_feat = self.attflat_lang(
|
||||
lang_feat,
|
||||
lang_feat_mask
|
||||
)
|
||||
|
||||
frame_feat = self.attflat_frame(
|
||||
frame_feat,
|
||||
frame_feat_mask
|
||||
)
|
||||
|
||||
clip_feat = self.attflat_clip(
|
||||
clip_feat,
|
||||
clip_feat_mask
|
||||
)
|
||||
proj_feat_0 = lang_feat + frame_feat + clip_feat
|
||||
proj_feat_0 = self.proj_norm(proj_feat_0)
|
||||
|
||||
proj_feat_1 = torch.stack([lang_feat, frame_feat, clip_feat], dim=1)
|
||||
proj_feat_1, _ = self.lstm_fusion(proj_feat_1)
|
||||
proj_feat_1 = proj_feat_1.sum(1)
|
||||
proj_feat_1 = self.proj_feat_1(proj_feat_1)
|
||||
proj_feat_1 = self.proj_norm_lstm(proj_feat_1)
|
||||
|
||||
proj_feat = torch.sigmoid(self.proj(proj_feat_0 + proj_feat_1))
|
||||
|
||||
return proj_feat
|
||||
|
||||
# Masking
|
||||
def make_mask(self, feature):
|
||||
return (torch.sum(
|
||||
torch.abs(feature),
|
||||
dim=-1
|
||||
) == 0).unsqueeze(1).unsqueeze(2)
|
||||
|
||||
class Net4(nn.Module):
|
||||
def __init__(self, __C, pretrained_emb, token_size, answer_size):
|
||||
super(Net4, self).__init__()
|
||||
print('Training with Network type 4: MCAN')
|
||||
self.embedding = nn.Embedding(
|
||||
num_embeddings=token_size,
|
||||
embedding_dim=__C.WORD_EMBED_SIZE
|
||||
)
|
||||
|
||||
# Loading the GloVe embedding weights
|
||||
if __C.USE_GLOVE:
|
||||
self.embedding.weight.data.copy_(torch.from_numpy(pretrained_emb))
|
||||
|
||||
self.lstm = nn.LSTM(
|
||||
input_size=__C.WORD_EMBED_SIZE,
|
||||
hidden_size=__C.HIDDEN_SIZE,
|
||||
num_layers=1,
|
||||
batch_first=True
|
||||
)
|
||||
|
||||
self.frame_feat_linear = nn.Linear(
|
||||
__C.FRAME_FEAT_SIZE,
|
||||
__C.HIDDEN_SIZE
|
||||
)
|
||||
|
||||
self.clip_feat_linear = nn.Linear(
|
||||
__C.CLIP_FEAT_SIZE,
|
||||
__C.HIDDEN_SIZE
|
||||
)
|
||||
self.backbone = MCA_ED(__C)
|
||||
|
||||
self.attflat_lang = AttFlat(__C)
|
||||
self.attflat_vid = AttFlat(__C)
|
||||
|
||||
self.proj_norm = LayerNorm(__C.FLAT_OUT_SIZE)
|
||||
self.proj = nn.Linear(__C.FLAT_OUT_SIZE, answer_size)
|
||||
|
||||
|
||||
def forward(self, frame_feat, clip_feat, ques_ix):
|
||||
|
||||
# Make mask
|
||||
lang_feat_mask = self.make_mask(ques_ix.unsqueeze(2))
|
||||
frame_feat_mask = self.make_mask(frame_feat)
|
||||
clip_feat_mask = self.make_mask(clip_feat)
|
||||
|
||||
# Pre-process Language Feature
|
||||
lang_feat = self.embedding(ques_ix)
|
||||
lang_feat, _ = self.lstm(lang_feat)
|
||||
|
||||
|
||||
# Pre-process Video Feature
|
||||
frame_feat = self.frame_feat_linear(frame_feat)
|
||||
clip_feat = self.clip_feat_linear(clip_feat)
|
||||
|
||||
# concat frame and clip features
|
||||
vid_feat = torch.cat([frame_feat, clip_feat], dim=1)
|
||||
vid_feat_mask = torch.cat([frame_feat_mask, clip_feat_mask], dim=-1)
|
||||
# Backbone Framework
|
||||
lang_feat, vid_feat = self.backbone(
|
||||
lang_feat,
|
||||
vid_feat,
|
||||
lang_feat_mask,
|
||||
vid_feat_mask,
|
||||
)
|
||||
|
||||
lang_feat = self.attflat_lang(
|
||||
lang_feat,
|
||||
lang_feat_mask
|
||||
)
|
||||
|
||||
vid_feat = self.attflat_vid(
|
||||
vid_feat,
|
||||
vid_feat_mask
|
||||
)
|
||||
|
||||
proj_feat = lang_feat + vid_feat
|
||||
proj_feat = self.proj_norm(proj_feat)
|
||||
proj_feat = torch.sigmoid(self.proj(proj_feat))
|
||||
|
||||
return proj_feat
|
||||
|
||||
# Masking
|
||||
def make_mask(self, feature):
|
||||
return (torch.sum(
|
||||
torch.abs(feature),
|
||||
dim=-1
|
||||
) == 0).unsqueeze(1).unsqueeze(2)
|
||||
|
||||
|
62
code/core/model/net_utils.py
Normal file
62
code/core/model/net_utils.py
Normal file
|
@ -0,0 +1,62 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
import torch.nn as nn
|
||||
import os
|
||||
import torch
|
||||
|
||||
|
||||
class FC(nn.Module):
|
||||
def __init__(self, in_size, out_size, dropout_r=0., use_relu=True):
|
||||
super(FC, self).__init__()
|
||||
self.dropout_r = dropout_r
|
||||
self.use_relu = use_relu
|
||||
|
||||
self.linear = nn.Linear(in_size, out_size)
|
||||
|
||||
if use_relu:
|
||||
self.relu = nn.ReLU(inplace=True)
|
||||
|
||||
if dropout_r > 0:
|
||||
self.dropout = nn.Dropout(dropout_r)
|
||||
|
||||
def forward(self, x):
|
||||
x = self.linear(x)
|
||||
|
||||
if self.use_relu:
|
||||
x = self.relu(x)
|
||||
|
||||
if self.dropout_r > 0:
|
||||
x = self.dropout(x)
|
||||
|
||||
return x
|
||||
|
||||
|
||||
class MLP(nn.Module):
|
||||
def __init__(self, in_size, mid_size, out_size, dropout_r=0., use_relu=True):
|
||||
super(MLP, self).__init__()
|
||||
|
||||
self.fc = FC(in_size, mid_size, dropout_r=dropout_r, use_relu=use_relu)
|
||||
self.linear = nn.Linear(mid_size, out_size)
|
||||
|
||||
def forward(self, x):
|
||||
return self.linear(self.fc(x))
|
||||
|
||||
|
||||
class LayerNorm(nn.Module):
|
||||
def __init__(self, size, eps=1e-6):
|
||||
super(LayerNorm, self).__init__()
|
||||
self.eps = eps
|
||||
|
||||
self.a_2 = nn.Parameter(torch.ones(size))
|
||||
self.b_2 = nn.Parameter(torch.zeros(size))
|
||||
|
||||
def forward(self, x):
|
||||
mean = x.mean(-1, keepdim=True)
|
||||
std = x.std(-1, keepdim=True)
|
||||
|
||||
return self.a_2 * (x - mean) / (std + self.eps) + self.b_2
|
||||
|
98
code/core/model/optim.py
Normal file
98
code/core/model/optim.py
Normal file
|
@ -0,0 +1,98 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
import torch
|
||||
import torch.optim as Optim
|
||||
|
||||
|
||||
class WarmupOptimizer(object):
|
||||
def __init__(self, lr_base, optimizer, data_size, batch_size):
|
||||
self.optimizer = optimizer
|
||||
self._step = 0
|
||||
self.lr_base = lr_base
|
||||
self._rate = 0
|
||||
self.data_size = data_size
|
||||
self.batch_size = batch_size
|
||||
|
||||
def step(self):
|
||||
self._step += 1
|
||||
|
||||
rate = self.rate()
|
||||
for p in self.optimizer.param_groups:
|
||||
p['lr'] = rate
|
||||
self._rate = rate
|
||||
|
||||
self.optimizer.step()
|
||||
|
||||
|
||||
def zero_grad(self):
|
||||
self.optimizer.zero_grad()
|
||||
|
||||
|
||||
def rate(self, step=None):
|
||||
if step is None:
|
||||
step = self._step
|
||||
|
||||
if step <= int(self.data_size / self.batch_size * 1):
|
||||
r = self.lr_base * 1/4.
|
||||
elif step <= int(self.data_size / self.batch_size * 2):
|
||||
r = self.lr_base * 2/4.
|
||||
elif step <= int(self.data_size / self.batch_size * 3):
|
||||
r = self.lr_base * 3/4.
|
||||
else:
|
||||
r = self.lr_base
|
||||
|
||||
return r
|
||||
|
||||
|
||||
def get_optim(__C, model, data_size, optimizer, lr_base=None):
|
||||
if lr_base is None:
|
||||
lr_base = __C.LR_BASE
|
||||
|
||||
# modules = model._modules
|
||||
# params_list = []
|
||||
# for m in modules:
|
||||
# if 'dnc' in m:
|
||||
# params_list.append({
|
||||
# 'params': filter(lambda p: p.requires_grad, modules[m].parameters()),
|
||||
# 'lr': __C.LR_DNC_BASE,
|
||||
# 'flag': True
|
||||
# })
|
||||
# else:
|
||||
# params_list.append({
|
||||
# 'params': filter(lambda p: p.requires_grad, modules[m].parameters()),
|
||||
|
||||
# })
|
||||
if optimizer == 'adam':
|
||||
optim = Optim.Adam(
|
||||
filter(lambda p: p.requires_grad, model.parameters()),
|
||||
lr=0,
|
||||
betas=__C.OPT_BETAS,
|
||||
eps=__C.OPT_EPS,
|
||||
|
||||
)
|
||||
elif optimizer == 'rmsprop':
|
||||
optim = Optim.RMSprop(
|
||||
filter(lambda p: p.requires_grad, model.parameters()),
|
||||
lr=0,
|
||||
eps=__C.OPT_EPS,
|
||||
weight_decay=__C.OPT_WEIGHT_DECAY
|
||||
)
|
||||
else:
|
||||
raise ValueError('{} optimizer is not supported'.fromat(optimizer))
|
||||
return WarmupOptimizer(
|
||||
lr_base,
|
||||
optim,
|
||||
data_size,
|
||||
__C.BATCH_SIZE
|
||||
)
|
||||
|
||||
|
||||
def adjust_lr(optim, decay_r):
|
||||
optim.lr_base *= decay_r
|
||||
|
||||
def adjust_lr_dnc(optim, decay_r):
|
||||
optim.lr_dnc_base *= decay_r
|
163
code/core/model/utils.py
Normal file
163
code/core/model/utils.py
Normal file
|
@ -0,0 +1,163 @@
|
|||
"""
|
||||
PyTorch DNC implementation from
|
||||
-->
|
||||
https://github.com/ixaxaar/pytorch-dnc
|
||||
<--
|
||||
"""
|
||||
|
||||
import torch.nn as nn
|
||||
import torch as T
|
||||
import torch.nn.functional as F
|
||||
import numpy as np
|
||||
import torch
|
||||
from torch.autograd import Variable
|
||||
import re
|
||||
import string
|
||||
|
||||
|
||||
def recursiveTrace(obj):
|
||||
print(type(obj))
|
||||
if hasattr(obj, 'grad_fn'):
|
||||
print(obj.grad_fn)
|
||||
recursiveTrace(obj.grad_fn)
|
||||
elif hasattr(obj, 'saved_variables'):
|
||||
print(obj.requires_grad, len(obj.saved_tensors), len(obj.saved_variables))
|
||||
[print(v) for v in obj.saved_variables]
|
||||
[recursiveTrace(v.grad_fn) for v in obj.saved_variables]
|
||||
|
||||
|
||||
def cuda(x, grad=False, gpu_id=-1):
|
||||
x = x.float() if T.is_tensor(x) else x
|
||||
if gpu_id == -1:
|
||||
t = T.FloatTensor(x)
|
||||
t.requires_grad=grad
|
||||
return t
|
||||
else:
|
||||
t = T.FloatTensor(x.pin_memory()).cuda(gpu_id)
|
||||
t.requires_grad=grad
|
||||
return t
|
||||
|
||||
|
||||
def cudavec(x, grad=False, gpu_id=-1):
|
||||
if gpu_id == -1:
|
||||
t = T.Tensor(T.from_numpy(x))
|
||||
t.requires_grad = grad
|
||||
return t
|
||||
else:
|
||||
t = T.Tensor(T.from_numpy(x).pin_memory()).cuda(gpu_id)
|
||||
t.requires_grad = grad
|
||||
return t
|
||||
|
||||
|
||||
def cudalong(x, grad=False, gpu_id=-1):
|
||||
if gpu_id == -1:
|
||||
t = T.LongTensor(T.from_numpy(x.astype(np.long)))
|
||||
t.requires_grad = grad
|
||||
return t
|
||||
else:
|
||||
t = T.LongTensor(T.from_numpy(x.astype(np.long)).pin_memory()).cuda(gpu_id)
|
||||
t.requires_grad = grad
|
||||
return t
|
||||
|
||||
|
||||
def θ(a, b, normBy=2):
|
||||
"""Batchwise Cosine similarity
|
||||
Cosine similarity
|
||||
Arguments:
|
||||
a {Tensor} -- A 3D Tensor (b * m * w)
|
||||
b {Tensor} -- A 3D Tensor (b * r * w)
|
||||
Returns:
|
||||
Tensor -- Batchwise cosine similarity (b * r * m)
|
||||
"""
|
||||
dot = T.bmm(a, b.transpose(1,2))
|
||||
a_norm = T.norm(a, normBy, dim=2).unsqueeze(2)
|
||||
b_norm = T.norm(b, normBy, dim=2).unsqueeze(1)
|
||||
cos = dot / (a_norm * b_norm + δ)
|
||||
return cos.transpose(1,2).contiguous()
|
||||
|
||||
|
||||
def σ(input, axis=1):
|
||||
"""Softmax on an axis
|
||||
Softmax on an axis
|
||||
Arguments:
|
||||
input {Tensor} -- input Tensor
|
||||
Keyword Arguments:
|
||||
axis {number} -- axis on which to take softmax on (default: {1})
|
||||
Returns:
|
||||
Tensor -- Softmax output Tensor
|
||||
"""
|
||||
input_size = input.size()
|
||||
|
||||
trans_input = input.transpose(axis, len(input_size) - 1)
|
||||
trans_size = trans_input.size()
|
||||
|
||||
input_2d = trans_input.contiguous().view(-1, trans_size[-1])
|
||||
soft_max_2d = F.softmax(input_2d, -1)
|
||||
soft_max_nd = soft_max_2d.view(*trans_size)
|
||||
return soft_max_nd.transpose(axis, len(input_size) - 1)
|
||||
|
||||
δ = 1e-6
|
||||
|
||||
|
||||
def register_nan_checks(model):
|
||||
def check_grad(module, grad_input, grad_output):
|
||||
# print(module) you can add this to see that the hook is called
|
||||
# print('hook called for ' + str(type(module)))
|
||||
if any(np.all(np.isnan(gi.data.cpu().numpy())) for gi in grad_input if gi is not None):
|
||||
print('NaN gradient in grad_input ' + type(module).__name__)
|
||||
|
||||
model.apply(lambda module: module.register_backward_hook(check_grad))
|
||||
|
||||
|
||||
def apply_dict(dic):
|
||||
for k, v in dic.items():
|
||||
apply_var(v, k)
|
||||
if isinstance(v, nn.Module):
|
||||
key_list = [a for a in dir(v) if not a.startswith('__')]
|
||||
for key in key_list:
|
||||
apply_var(getattr(v, key), key)
|
||||
for pk, pv in v._parameters.items():
|
||||
apply_var(pv, pk)
|
||||
|
||||
|
||||
def apply_var(v, k):
|
||||
if isinstance(v, Variable) and v.requires_grad:
|
||||
v.register_hook(check_nan_gradient(k))
|
||||
|
||||
|
||||
def check_nan_gradient(name=''):
|
||||
def f(tensor):
|
||||
if np.isnan(T.mean(tensor).data.cpu().numpy()):
|
||||
print('\nnan gradient of {} :'.format(name))
|
||||
# print(tensor)
|
||||
# assert 0, 'nan gradient'
|
||||
return tensor
|
||||
return f
|
||||
|
||||
def ptr(tensor):
|
||||
if T.is_tensor(tensor):
|
||||
return tensor.storage().data_ptr()
|
||||
elif hasattr(tensor, 'data'):
|
||||
return tensor.clone().data.storage().data_ptr()
|
||||
else:
|
||||
return tensor
|
||||
|
||||
# TODO: EWW change this shit
|
||||
def ensure_gpu(tensor, gpu_id):
|
||||
if "cuda" in str(type(tensor)) and gpu_id != -1:
|
||||
return tensor.cuda(gpu_id)
|
||||
elif "cuda" in str(type(tensor)):
|
||||
return tensor.cpu()
|
||||
elif "Tensor" in str(type(tensor)) and gpu_id != -1:
|
||||
return tensor.cuda(gpu_id)
|
||||
elif "Tensor" in str(type(tensor)):
|
||||
return tensor
|
||||
elif type(tensor) is np.ndarray:
|
||||
return cudavec(tensor, gpu_id=gpu_id).data
|
||||
else:
|
||||
return tensor
|
||||
|
||||
|
||||
def print_gradient(x, name):
|
||||
s = "Gradient of " + name + " ----------------------------------"
|
||||
x.register_hook(lambda y: print(s, y.squeeze()))
|
48
code/requirements.txt
Normal file
48
code/requirements.txt
Normal file
|
@ -0,0 +1,48 @@
|
|||
absl-py==0.12.0
|
||||
blis==0.7.4
|
||||
cachetools==4.2.1
|
||||
catalogue==1.0.0
|
||||
certifi==2020.12.5
|
||||
chardet==4.0.0
|
||||
click==7.1.2
|
||||
cycler==0.10.0
|
||||
cymem==2.0.5
|
||||
google-auth==1.28.0
|
||||
google-auth-oauthlib==0.4.3
|
||||
grpcio==1.36.1
|
||||
idna==2.10
|
||||
importlib-metadata==3.7.3
|
||||
joblib==1.0.1
|
||||
Markdown==3.3.4
|
||||
mkl-fft==1.3.0
|
||||
mkl-random==1.1.1
|
||||
mkl-service==2.3.0
|
||||
murmurhash==1.0.5
|
||||
nltk==3.6.2
|
||||
oauthlib==3.1.0
|
||||
olefile==0.46
|
||||
plac==1.1.3
|
||||
positional-encodings==3.0.0
|
||||
preshed==3.0.5
|
||||
protobuf==3.15.6
|
||||
pyasn1==0.4.8
|
||||
pyasn1-modules==0.2.8
|
||||
PyYAML==5.4.1
|
||||
regex==2021.4.4
|
||||
requests==2.25.1
|
||||
requests-oauthlib==1.3.0
|
||||
rsa==4.7.2
|
||||
scikit-video==1.1.11
|
||||
scipy==1.5.4
|
||||
spacy==2.3.5
|
||||
srsly==1.0.5
|
||||
tensorboard==2.4.1
|
||||
tensorboard-plugin-wit==1.8.0
|
||||
tensorboardX==2.1
|
||||
thinc==7.4.5
|
||||
tqdm==4.59.0
|
||||
typing-extensions==3.7.4.3
|
||||
urllib3==1.26.4
|
||||
wasabi==0.8.2
|
||||
Werkzeug==1.0.1
|
||||
zipp==3.4.1
|
198
code/run.py
Normal file
198
code/run.py
Normal file
|
@ -0,0 +1,198 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
from cfgs.base_cfgs import Cfgs
|
||||
from core.exec import Execution
|
||||
import argparse, yaml, os
|
||||
|
||||
def str2bool(v):
|
||||
if isinstance(v, bool):
|
||||
return v
|
||||
if v.lower() in ('yes', 'true', 't', 'y', '1'):
|
||||
return True
|
||||
elif v.lower() in ('no', 'false', 'f', 'n', '0'):
|
||||
return False
|
||||
else:
|
||||
raise argparse.ArgumentTypeError('Boolean value expected.')
|
||||
|
||||
def parse_args():
|
||||
'''
|
||||
Parse input arguments
|
||||
'''
|
||||
parser = argparse.ArgumentParser(description='VLCN Args')
|
||||
|
||||
parser.add_argument('--RUN', dest='RUN_MODE',
|
||||
default='train',
|
||||
choices=['train', 'val', 'test'],
|
||||
help='{train, val, test}',
|
||||
type=str) # , required=True)
|
||||
|
||||
parser.add_argument('--MODEL', dest='MODEL',
|
||||
choices=['small', 'large'],
|
||||
help='{small, large}',
|
||||
default='small', type=str)
|
||||
|
||||
parser.add_argument('--OPTIM', dest='OPTIM',
|
||||
choices=['adam', 'rmsprop'],
|
||||
help='The optimizer',
|
||||
default='rmsprop', type=str)
|
||||
|
||||
parser.add_argument('--SPLIT', dest='TRAIN_SPLIT',
|
||||
choices=['train', 'train+val'],
|
||||
help="set training split, "
|
||||
"eg.'train', 'train+val'"
|
||||
"set 'train' can trigger the "
|
||||
"eval after every epoch",
|
||||
default='train',
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--EVAL_EE', dest='EVAL_EVERY_EPOCH',
|
||||
default=True,
|
||||
help='set True to evaluate the '
|
||||
'val split when an epoch finished'
|
||||
"(only work when train with "
|
||||
"'train' split)",
|
||||
type=bool)
|
||||
|
||||
parser.add_argument('--SAVE_PRED', dest='TEST_SAVE_PRED',
|
||||
help='set True to save the '
|
||||
'prediction vectors'
|
||||
'(only work in testing)',
|
||||
default=False,
|
||||
type=bool)
|
||||
|
||||
parser.add_argument('--BS', dest='BATCH_SIZE',
|
||||
help='batch size during training',
|
||||
default=64,
|
||||
type=int)
|
||||
|
||||
parser.add_argument('--MAX_EPOCH', dest='MAX_EPOCH',
|
||||
default=30,
|
||||
help='max training epoch',
|
||||
type=int)
|
||||
|
||||
parser.add_argument('--PRELOAD', dest='PRELOAD',
|
||||
help='pre-load the features into memory'
|
||||
'to increase the I/O speed',
|
||||
default=False,
|
||||
type=bool)
|
||||
|
||||
parser.add_argument('--GPU', dest='GPU',
|
||||
help="gpu select, eg.'0, 1, 2'",
|
||||
default='0',
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--SEED', dest='SEED',
|
||||
help='fix random seed',
|
||||
default=42,
|
||||
type=int)
|
||||
|
||||
parser.add_argument('--VERSION', dest='VERSION',
|
||||
help='version control',
|
||||
default='1.0.0',
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--RESUME', dest='RESUME',
|
||||
default=False,
|
||||
help='resume training',
|
||||
type=str2bool)
|
||||
|
||||
parser.add_argument('--CKPT_V', dest='CKPT_VERSION',
|
||||
help='checkpoint version',
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--CKPT_E', dest='CKPT_EPOCH',
|
||||
help='checkpoint epoch',
|
||||
type=int)
|
||||
|
||||
parser.add_argument('--CKPT_PATH', dest='CKPT_PATH',
|
||||
help='load checkpoint path, we '
|
||||
'recommend that you use '
|
||||
'CKPT_VERSION and CKPT_EPOCH '
|
||||
'instead',
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--ACCU', dest='GRAD_ACCU_STEPS',
|
||||
help='reduce gpu memory usage',
|
||||
type=int)
|
||||
|
||||
parser.add_argument('--NW', dest='NUM_WORKERS',
|
||||
help='multithreaded loading',
|
||||
default=0,
|
||||
type=int)
|
||||
|
||||
parser.add_argument('--PINM', dest='PIN_MEM',
|
||||
help='use pin memory',
|
||||
type=bool)
|
||||
|
||||
parser.add_argument('--VERB', dest='VERBOSE',
|
||||
help='verbose print',
|
||||
type=bool)
|
||||
|
||||
parser.add_argument('--DATA_PATH', dest='DATASET_PATH',
|
||||
default='/projects/abdessaied/data/MSRVTT-QA/',
|
||||
help='Dataset root path',
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--EXP_NAME', dest='EXP_NAME',
|
||||
help='The name of the experiment',
|
||||
default="test",
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--DEBUG', dest='DEBUG',
|
||||
help='Triggeres debug mode: small fractions of the data are loaded ',
|
||||
default='0',
|
||||
type=str2bool)
|
||||
|
||||
parser.add_argument('--ENABLE_TIME_MONITORING', dest='ENABLE_TIME_MONITORING',
|
||||
help='Triggeres time monitoring when training',
|
||||
default='0',
|
||||
type=str2bool)
|
||||
|
||||
parser.add_argument('--MODEL_TYPE', dest='MODEL_TYPE',
|
||||
help='The model type to be used\n 1: VLCN \n 2:VLCN-FLF \n 3: VLCN+LSTM \n 4: MCAN',
|
||||
default=1,
|
||||
type=int)
|
||||
|
||||
parser.add_argument('--PRETRAINED_PATH', dest='PRETRAINED_PATH',
|
||||
help='Pretrained weights on msvd',
|
||||
default='-',
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--TEST_EPOCH', dest='TEST_EPOCH',
|
||||
help='',
|
||||
default=7,
|
||||
type=int)
|
||||
|
||||
args = parser.parse_args()
|
||||
return args
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
args = parse_args()
|
||||
os.chdir(os.path.dirname(os.path.abspath(__file__)))
|
||||
__C = Cfgs(args.EXP_NAME, args.DATASET_PATH)
|
||||
args_dict = __C.parse_to_dict(args)
|
||||
|
||||
cfg_file = "cfgs/{}_model.yml".format(args.MODEL)
|
||||
with open(cfg_file, 'r') as f:
|
||||
yaml_dict = yaml.load(f)
|
||||
|
||||
args_dict = {**yaml_dict, **args_dict}
|
||||
|
||||
__C.add_args(args_dict)
|
||||
__C.proc()
|
||||
|
||||
print('Hyper Parameters:')
|
||||
print(__C)
|
||||
|
||||
__C.check_path()
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = __C.GPU
|
||||
|
||||
execution = Execution(__C)
|
||||
execution.run(__C.RUN_MODE)
|
||||
|
||||
#execution.run('test', epoch=__C.TEST_EPOCH)
|
0
core/.gitkeep
Normal file
0
core/.gitkeep
Normal file
0
core/data/.gitkeep
Normal file
0
core/data/.gitkeep
Normal file
103
core/data/dataset.py
Normal file
103
core/data/dataset.py
Normal file
|
@ -0,0 +1,103 @@
|
|||
import glob, os, json, pickle
|
||||
import numpy as np
|
||||
from collections import defaultdict
|
||||
|
||||
import torch
|
||||
from torch.utils.data import Dataset
|
||||
import torchvision.transforms as transforms
|
||||
|
||||
from core.data.utils import tokenize, ans_stat, proc_ques, qlen_to_key, ans_to_key
|
||||
|
||||
|
||||
class VideoQA_Dataset(Dataset):
|
||||
def __init__(self, __C):
|
||||
super(VideoQA_Dataset, self).__init__()
|
||||
self.__C = __C
|
||||
self.ans_size = __C.NUM_ANS
|
||||
# load raw data
|
||||
with open(__C.QA_PATH[__C.RUN_MODE], 'r') as f:
|
||||
self.raw_data = json.load(f)
|
||||
self.data_size = len(self.raw_data)
|
||||
|
||||
splits = __C.SPLIT[__C.RUN_MODE].split('+')
|
||||
|
||||
frames_list = glob.glob(__C.FRAMES + '*.pt')
|
||||
clips_list = glob.glob(__C.CLIPS + '*.pt')
|
||||
if 'msvd' in self.C.DATASET_PATH.lower():
|
||||
vid_ids = [int(s.split('/')[-1].split('.')[0][3:]) for s in frames_list]
|
||||
else:
|
||||
vid_ids = [int(s.split('/')[-1].split('.')[0][5:]) for s in frames_list]
|
||||
self.frames_dict = {k: v for (k,v) in zip(vid_ids, frames_list)}
|
||||
self.clips_dict = {k: v for (k,v) in zip(vid_ids, clips_list)}
|
||||
del frames_list, clips_list
|
||||
|
||||
q_list = []
|
||||
a_list = []
|
||||
a_dict = defaultdict(lambda: 0)
|
||||
for split in ['train', 'val']:
|
||||
with open(__C.QA_PATH[split], 'r') as f:
|
||||
qa_data = json.load(f)
|
||||
for d in qa_data:
|
||||
q_list.append(d['question'])
|
||||
a_list = d['answer']
|
||||
if d['answer'] not in a_dict:
|
||||
a_dict[d['answer']] = 1
|
||||
else:
|
||||
a_dict[d['answer']] += 1
|
||||
|
||||
top_answers = sorted(a_dict, key=a_dict.get, reverse=True)
|
||||
self.qlen_bins_to_idx = {
|
||||
'1-3': 0,
|
||||
'4-8': 1,
|
||||
'9-15': 2,
|
||||
}
|
||||
self.ans_rare_to_idx = {
|
||||
'0-99': 0,
|
||||
'100-299': 1,
|
||||
'300-999': 2,
|
||||
|
||||
}
|
||||
self.qtypes_to_idx = {
|
||||
'what': 0,
|
||||
'who': 1,
|
||||
'how': 2,
|
||||
'when': 3,
|
||||
'where': 4,
|
||||
}
|
||||
|
||||
if __C.RUN_MODE == 'train':
|
||||
self.ans_list = top_answers[:self.ans_size]
|
||||
|
||||
self.ans_to_ix, self.ix_to_ans = ans_stat(self.ans_list)
|
||||
|
||||
self.token_to_ix, self.pretrained_emb = tokenize(q_list, __C.USE_GLOVE)
|
||||
self.token_size = self.token_to_ix.__len__()
|
||||
print('== Question token vocab size:', self.token_size)
|
||||
|
||||
self.idx_to_qtypes = {v: k for (k, v) in self.qtypes_to_idx.items()}
|
||||
self.idx_to_qlen_bins = {v: k for (k, v) in self.qlen_bins_to_idx.items()}
|
||||
self.idx_to_ans_rare = {v: k for (k, v) in self.ans_rare_to_idx.items()}
|
||||
|
||||
def __getitem__(self, idx):
|
||||
sample = self.raw_data[idx]
|
||||
ques = sample['question']
|
||||
q_type = self.qtypes_to_idx[ques.split(' ')[0]]
|
||||
ques_idx, qlen, _ = proc_ques(ques, self.token_to_ix, self.__C.MAX_TOKEN)
|
||||
qlen_bin = self.qlen_bins_to_idx[qlen_to_key(qlen)]
|
||||
|
||||
answer = sample['answer']
|
||||
answer = self.ans_to_ix.get(answer, np.random.randint(0, high=len(self.ans_list)))
|
||||
ans_rarity = self.ans_rare_to_idx[ans_to_key(answer)]
|
||||
|
||||
answer_one_hot = torch.zeros(self.ans_size)
|
||||
answer_one_hot[answer] = 1.0
|
||||
|
||||
vid_id = sample['video_id']
|
||||
frames = torch.load(open(self.frames_dict[vid_id], 'rb')).cpu()
|
||||
clips = torch.load(open(self.clips_dict[vid_id], 'rb')).cpu()
|
||||
|
||||
return torch.from_numpy(ques_idx).long(), frames, clips, answer_one_hot, torch.tensor(answer).long(), \
|
||||
torch.tensor(q_type).long(), torch.tensor(qlen_bin).long(), torch.tensor(ans_rarity).long()
|
||||
|
||||
def __len__(self):
|
||||
return self.data_size
|
182
core/data/preprocess.py
Normal file
182
core/data/preprocess.py
Normal file
|
@ -0,0 +1,182 @@
|
|||
import os
|
||||
import sys
|
||||
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
|
||||
import skvideo.io as skv
|
||||
import torch
|
||||
import pickle
|
||||
from PIL import Image
|
||||
import tqdm
|
||||
import numpy as np
|
||||
from model.C3D import C3D
|
||||
import json
|
||||
from torchvision.models import vgg19
|
||||
import torchvision.transforms as transforms
|
||||
import torch.nn as nn
|
||||
import argparse
|
||||
|
||||
|
||||
def _select_frames(path, frame_num):
|
||||
"""Select representative frames for video.
|
||||
Ignore some frames both at begin and end of video.
|
||||
Args:
|
||||
path: Path of video.
|
||||
Returns:
|
||||
frames: list of frames.
|
||||
"""
|
||||
frames = list()
|
||||
video_data = skv.vread(path)
|
||||
total_frames = video_data.shape[0]
|
||||
# Ignore some frame at begin and end.
|
||||
for i in np.linspace(0, total_frames, frame_num + 2)[1:frame_num + 1]:
|
||||
frame_data = video_data[int(i)]
|
||||
img = Image.fromarray(frame_data)
|
||||
img = img.resize((224, 224), Image.BILINEAR)
|
||||
frame_data = np.array(img)
|
||||
frames.append(frame_data)
|
||||
return frames
|
||||
|
||||
def _select_clips(path, clip_num):
|
||||
"""Select self.batch_size clips for video. Each clip has 16 frames.
|
||||
Args:
|
||||
path: Path of video.
|
||||
Returns:
|
||||
clips: list of clips.
|
||||
"""
|
||||
clips = list()
|
||||
# video_info = skvideo.io.ffprobe(path)
|
||||
video_data = skv.vread(path)
|
||||
total_frames = video_data.shape[0]
|
||||
height = video_data[1]
|
||||
width = video_data.shape[2]
|
||||
for i in np.linspace(0, total_frames, clip_num + 2)[1:clip_num + 1]:
|
||||
# Select center frame first, then include surrounding frames
|
||||
clip_start = int(i) - 8
|
||||
clip_end = int(i) + 8
|
||||
if clip_start < 0:
|
||||
clip_end = clip_end - clip_start
|
||||
clip_start = 0
|
||||
if clip_end > total_frames:
|
||||
clip_start = clip_start - (clip_end - total_frames)
|
||||
clip_end = total_frames
|
||||
clip = video_data[clip_start:clip_end]
|
||||
new_clip = []
|
||||
for j in range(16):
|
||||
frame_data = clip[j]
|
||||
img = Image.fromarray(frame_data)
|
||||
img = img.resize((112, 112), Image.BILINEAR)
|
||||
frame_data = np.array(img) * 1.0
|
||||
# frame_data -= self.mean[j]
|
||||
new_clip.append(frame_data)
|
||||
clips.append(new_clip)
|
||||
return clips
|
||||
|
||||
def preprocess_videos(video_dir, frame_num, clip_num):
|
||||
frames_dir = os.path.join(os.path.dirname(video_dir), 'frames')
|
||||
os.mkdir(frames_dir)
|
||||
|
||||
clips_dir = os.path.join(os.path.dirname(video_dir), 'clips')
|
||||
os.mkdir(clips_dir)
|
||||
|
||||
for video_name in tqdm.tqdm(os.listdir(video_dir)):
|
||||
video_path = os.path.join(video_dir, video_name)
|
||||
frames = _select_frames(video_path, frame_num)
|
||||
clips = _select_clips(video_path, clip_num)
|
||||
|
||||
with open(os.path.join(frames_dir, video_name.split('.')[0] + '.pkl'), "wb") as f:
|
||||
pickle.dump(frames, f, protocol=pickle.HIGHEST_PROTOCOL)
|
||||
|
||||
with open(os.path.join(clips_dir, video_name.split('.')[0] + '.pkl'), "wb") as f:
|
||||
pickle.dump(clips, f, protocol=pickle.HIGHEST_PROTOCOL)
|
||||
|
||||
|
||||
def generate_video_features(path_frames, path_clips, c3d_path):
|
||||
device = torch.device('cuda:0')
|
||||
frame_feat_dir = os.path.join(os.path.dirname(path_frames), 'frame_feat')
|
||||
os.makedirs(frame_feat_dir, exist_ok=True)
|
||||
|
||||
clip_feat_dir = os.path.join(os.path.dirname(path_frames), 'clip_feat')
|
||||
os.makedirs(clip_feat_dir, exist_ok=True)
|
||||
|
||||
cnn = vgg19(pretrained=True)
|
||||
in_features = cnn.classifier[-1].in_features
|
||||
cnn.classifier = nn.Sequential(
|
||||
*list(cnn.classifier.children())[:-1]) # remove last fc layer
|
||||
cnn.to(device).eval()
|
||||
c3d = C3D()
|
||||
c3d.load_state_dict(torch.load(c3d_path))
|
||||
c3d.to(device).eval()
|
||||
transform = transforms.Compose([transforms.ToTensor(),
|
||||
transforms.Normalize((0.485, 0.456, 0.406),
|
||||
(0.229, 0.224, 0.225))])
|
||||
for vid_name in tqdm.tqdm(os.listdir(path_frames)):
|
||||
frame_path = os.path.join(path_frames, vid_name)
|
||||
clip_path = os.path.join(path_clips, vid_name)
|
||||
|
||||
frames = pickle.load(open(frame_path, 'rb'))
|
||||
clips = pickle.load(open(clip_path, 'rb'))
|
||||
|
||||
frames = [transform(f) for f in frames]
|
||||
frame_feat = []
|
||||
clip_feat = []
|
||||
|
||||
for frame in frames:
|
||||
with torch.no_grad():
|
||||
feat = cnn(frame.unsqueeze(0).to(device))
|
||||
frame_feat.append(feat)
|
||||
for clip in clips:
|
||||
# clip has shape (c x f x h x w)
|
||||
clip = torch.from_numpy(np.float32(np.array(clip)))
|
||||
clip = clip.transpose(3, 0)
|
||||
clip = clip.transpose(3, 1)
|
||||
clip = clip.transpose(3, 2).unsqueeze(0).to(device)
|
||||
with torch.no_grad():
|
||||
feat = c3d(clip)
|
||||
clip_feat.append(feat)
|
||||
frame_feat = torch.cat(frame_feat, dim=0)
|
||||
clip_feat = torch.cat(clip_feat, dim=0)
|
||||
|
||||
torch.save(frame_feat, os.path.join(frame_feat_dir, vid_name.split('.')[0] + '.pt'))
|
||||
torch.save(clip_feat, os.path.join(clip_feat_dir, vid_name.split('.')[0] + '.pt'))
|
||||
|
||||
def parse_args():
|
||||
'''
|
||||
Parse input arguments
|
||||
'''
|
||||
parser = argparse.ArgumentParser(description='Preprocessing Args')
|
||||
|
||||
parser.add_argument('--RAW_VID_PATH', dest='RAW_VID_PATH',
|
||||
help='The path to the raw videos',
|
||||
required=True,
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--FRAMES_OUTPUT_DIR', dest='FRAMES_OUTPUT_DIR',
|
||||
help='The directory where the processed frames and their features will be stored',
|
||||
required=True,
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--CLIPS_OUTPUT_DIR', dest='FRAMES_OUTPUT_DIR',
|
||||
help='The directory where the processed frames and their features will be stored',
|
||||
required=True,
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--C3D_PATH', dest='C3D_PATH',
|
||||
help='Pretrained C3D path',
|
||||
required=True,
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--NUM_SAMPLES', dest='NUM_SAMPLES',
|
||||
help='The number of frames/clips to be sampled from the video',
|
||||
default=20,
|
||||
type=int)
|
||||
|
||||
args = parser.parse_args()
|
||||
return args
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
args = parse_args()
|
||||
preprocess_videos(args.RAW_VID_PATH, args.NUM_SAMPLES, args.NUM_SAMPLES)
|
||||
frames_dir = os.path.join(os.path.dirname(args.RAW_VID_PATH), 'frames')
|
||||
clips_dir = os.path.join(os.path.dirname(args.RAW_VID_PATH), 'clips')
|
||||
generate_video_features(frames_dir, clips_dir)
|
81
core/data/utils.py
Normal file
81
core/data/utils.py
Normal file
|
@ -0,0 +1,81 @@
|
|||
import en_vectors_web_lg, random, re, json
|
||||
import numpy as np
|
||||
|
||||
def tokenize(ques_list, use_glove):
|
||||
token_to_ix = {
|
||||
'PAD': 0,
|
||||
'UNK': 1,
|
||||
}
|
||||
|
||||
spacy_tool = None
|
||||
pretrained_emb = []
|
||||
if use_glove:
|
||||
spacy_tool = en_vectors_web_lg.load()
|
||||
pretrained_emb.append(spacy_tool('PAD').vector)
|
||||
pretrained_emb.append(spacy_tool('UNK').vector)
|
||||
|
||||
for ques in ques_list:
|
||||
words = re.sub(
|
||||
r"([.,'!?\"()*#:;])",
|
||||
'',
|
||||
ques.lower()
|
||||
).replace('-', ' ').replace('/', ' ').split()
|
||||
|
||||
for word in words:
|
||||
if word not in token_to_ix:
|
||||
token_to_ix[word] = len(token_to_ix)
|
||||
if use_glove:
|
||||
pretrained_emb.append(spacy_tool(word).vector)
|
||||
|
||||
pretrained_emb = np.array(pretrained_emb)
|
||||
|
||||
return token_to_ix, pretrained_emb
|
||||
|
||||
|
||||
def proc_ques(ques, token_to_ix, max_token):
|
||||
ques_ix = np.zeros(max_token, np.int64)
|
||||
|
||||
words = re.sub(
|
||||
r"([.,'!?\"()*#:;])",
|
||||
'',
|
||||
ques.lower()
|
||||
).replace('-', ' ').replace('/', ' ').split()
|
||||
q_len = 0
|
||||
for ix, word in enumerate(words):
|
||||
if word in token_to_ix:
|
||||
ques_ix[ix] = token_to_ix[word]
|
||||
q_len += 1
|
||||
else:
|
||||
ques_ix[ix] = token_to_ix['UNK']
|
||||
|
||||
if ix + 1 == max_token:
|
||||
break
|
||||
|
||||
return ques_ix, q_len, len(words)
|
||||
|
||||
def ans_stat(ans_list):
|
||||
ans_to_ix, ix_to_ans = {}, {}
|
||||
for i, ans in enumerate(ans_list):
|
||||
ans_to_ix[ans] = i
|
||||
ix_to_ans[i] = ans
|
||||
|
||||
return ans_to_ix, ix_to_ans
|
||||
|
||||
def shuffle_list(ans_list):
|
||||
random.shuffle(ans_list)
|
||||
|
||||
def qlen_to_key(q_len):
|
||||
if 1<= q_len <=3:
|
||||
return '1-3'
|
||||
if 4<= q_len <=8:
|
||||
return '4-8'
|
||||
if 9<= q_len:
|
||||
return '9-15'
|
||||
|
||||
def ans_to_key(ans_idx):
|
||||
if 0 <= ans_idx <= 99 :
|
||||
return '0-99'
|
||||
if 100 <= ans_idx <= 299 :
|
||||
return '100-299'
|
||||
if 300 <= ans_idx <= 999 :
|
||||
return '300-999'
|
523
core/exec.py
Normal file
523
core/exec.py
Normal file
|
@ -0,0 +1,523 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
from core.data.dataset import VideoQA_Dataset
|
||||
from core.model.net import Net1, Net2, Net3, Net4
|
||||
from core.model.optim import get_optim, adjust_lr
|
||||
from core.metrics import get_acc
|
||||
from tqdm import tqdm
|
||||
from core.data.utils import shuffle_list
|
||||
|
||||
import os, json, torch, datetime, pickle, copy, shutil, time, math
|
||||
import numpy as np
|
||||
import torch.nn as nn
|
||||
import torch.utils.data as Data
|
||||
from tensorboardX import SummaryWriter
|
||||
from torch.autograd import Variable as var
|
||||
|
||||
class Execution:
|
||||
def __init__(self, __C):
|
||||
self.__C = __C
|
||||
print('Loading training set ........')
|
||||
__C_train = copy.deepcopy(self.__C)
|
||||
setattr(__C_train, 'RUN_MODE', 'train')
|
||||
self.dataset = VideoQA_Dataset(__C_train)
|
||||
|
||||
self.dataset_eval = None
|
||||
if self.__C.EVAL_EVERY_EPOCH:
|
||||
__C_eval = copy.deepcopy(self.__C)
|
||||
setattr(__C_eval, 'RUN_MODE', 'val')
|
||||
|
||||
print('Loading validation set for per-epoch evaluation ........')
|
||||
self.dataset_eval = VideoQA_Dataset(__C_eval)
|
||||
self.dataset_eval.ans_list = self.dataset.ans_list
|
||||
self.dataset_eval.ans_to_ix, self.dataset_eval.ix_to_ans = self.dataset.ans_to_ix, self.dataset.ix_to_ans
|
||||
self.dataset_eval.token_to_ix, self.dataset_eval.pretrained_emb = self.dataset.token_to_ix, self.dataset.pretrained_emb
|
||||
|
||||
__C_test = copy.deepcopy(self.__C)
|
||||
setattr(__C_test, 'RUN_MODE', 'test')
|
||||
|
||||
self.dataset_test = VideoQA_Dataset(__C_test)
|
||||
self.dataset_test.ans_list = self.dataset.ans_list
|
||||
self.dataset_test.ans_to_ix, self.dataset_test.ix_to_ans = self.dataset.ans_to_ix, self.dataset.ix_to_ans
|
||||
self.dataset_test.token_to_ix, self.dataset_test.pretrained_emb = self.dataset.token_to_ix, self.dataset.pretrained_emb
|
||||
|
||||
self.writer = SummaryWriter(self.__C.TB_PATH)
|
||||
|
||||
def train(self, dataset, dataset_eval=None):
|
||||
# Obtain needed information
|
||||
data_size = dataset.data_size
|
||||
token_size = dataset.token_size
|
||||
ans_size = dataset.ans_size
|
||||
pretrained_emb = dataset.pretrained_emb
|
||||
net = self.construct_net(self.__C.MODEL_TYPE)
|
||||
if os.path.isfile(self.__C.PRETRAINED_PATH) and self.__C.MODEL_TYPE == 11:
|
||||
print('Loading pretrained DNC-weigths')
|
||||
net.load_pretrained_weights()
|
||||
net.cuda()
|
||||
net.train()
|
||||
|
||||
# Define the multi-gpu training if needed
|
||||
if self.__C.N_GPU > 1:
|
||||
net = nn.DataParallel(net, device_ids=self.__C.DEVICES)
|
||||
|
||||
# Define the binary cross entropy loss
|
||||
# loss_fn = torch.nn.BCELoss(size_average=False).cuda()
|
||||
loss_fn = torch.nn.BCELoss(reduction='sum').cuda()
|
||||
# Load checkpoint if resume training
|
||||
if self.__C.RESUME:
|
||||
print(' ========== Resume training')
|
||||
|
||||
if self.__C.CKPT_PATH is not None:
|
||||
print('Warning: you are now using CKPT_PATH args, '
|
||||
'CKPT_VERSION and CKPT_EPOCH will not work')
|
||||
|
||||
path = self.__C.CKPT_PATH
|
||||
else:
|
||||
path = self.__C.CKPTS_PATH + \
|
||||
'ckpt_' + self.__C.CKPT_VERSION + \
|
||||
'/epoch' + str(self.__C.CKPT_EPOCH) + '.pkl'
|
||||
|
||||
# Load the network parameters
|
||||
print('Loading ckpt {}'.format(path))
|
||||
ckpt = torch.load(path)
|
||||
print('Finish!')
|
||||
net.load_state_dict(ckpt['state_dict'])
|
||||
|
||||
# Load the optimizer paramters
|
||||
optim = get_optim(self.__C, net, data_size, ckpt['optim'], lr_base=ckpt['lr_base'])
|
||||
optim._step = int(data_size / self.__C.BATCH_SIZE * self.__C.CKPT_EPOCH)
|
||||
optim.optimizer.load_state_dict(ckpt['optimizer'])
|
||||
|
||||
start_epoch = self.__C.CKPT_EPOCH
|
||||
|
||||
else:
|
||||
if ('ckpt_' + self.__C.VERSION) in os.listdir(self.__C.CKPTS_PATH):
|
||||
shutil.rmtree(self.__C.CKPTS_PATH + 'ckpt_' + self.__C.VERSION)
|
||||
|
||||
os.mkdir(self.__C.CKPTS_PATH + 'ckpt_' + self.__C.VERSION)
|
||||
|
||||
optim = get_optim(self.__C, net, data_size, self.__C.OPTIM)
|
||||
start_epoch = 0
|
||||
|
||||
loss_sum = 0
|
||||
named_params = list(net.named_parameters())
|
||||
grad_norm = np.zeros(len(named_params))
|
||||
|
||||
# Define multi-thread dataloader
|
||||
if self.__C.SHUFFLE_MODE in ['external']:
|
||||
dataloader = Data.DataLoader(
|
||||
dataset,
|
||||
batch_size=self.__C.BATCH_SIZE,
|
||||
shuffle=False,
|
||||
num_workers=self.__C.NUM_WORKERS,
|
||||
pin_memory=self.__C.PIN_MEM,
|
||||
drop_last=True
|
||||
)
|
||||
else:
|
||||
dataloader = Data.DataLoader(
|
||||
dataset,
|
||||
batch_size=self.__C.BATCH_SIZE,
|
||||
shuffle=True,
|
||||
num_workers=self.__C.NUM_WORKERS,
|
||||
pin_memory=self.__C.PIN_MEM,
|
||||
drop_last=True
|
||||
)
|
||||
|
||||
# Training script
|
||||
for epoch in range(start_epoch, self.__C.MAX_EPOCH):
|
||||
|
||||
# Save log information
|
||||
logfile = open(
|
||||
self.__C.LOG_PATH +
|
||||
'log_run_' + self.__C.VERSION + '.txt',
|
||||
'a+'
|
||||
)
|
||||
logfile.write(
|
||||
'nowTime: ' +
|
||||
datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') +
|
||||
'\n'
|
||||
)
|
||||
logfile.close()
|
||||
|
||||
# Learning Rate Decay
|
||||
if epoch in self.__C.LR_DECAY_LIST:
|
||||
adjust_lr(optim, self.__C.LR_DECAY_R)
|
||||
|
||||
# Externally shuffle
|
||||
if self.__C.SHUFFLE_MODE == 'external':
|
||||
shuffle_list(dataset.ans_list)
|
||||
|
||||
time_start = time.time()
|
||||
# Iteration
|
||||
for step, (
|
||||
ques_ix_iter,
|
||||
frames_feat_iter,
|
||||
clips_feat_iter,
|
||||
ans_iter,
|
||||
_,
|
||||
_,
|
||||
_,
|
||||
_
|
||||
) in enumerate(dataloader):
|
||||
|
||||
ques_ix_iter = ques_ix_iter.cuda()
|
||||
frames_feat_iter = frames_feat_iter.cuda()
|
||||
clips_feat_iter = clips_feat_iter.cuda()
|
||||
ans_iter = ans_iter.cuda()
|
||||
|
||||
optim.zero_grad()
|
||||
|
||||
for accu_step in range(self.__C.GRAD_ACCU_STEPS):
|
||||
|
||||
sub_frames_feat_iter = \
|
||||
frames_feat_iter[accu_step * self.__C.SUB_BATCH_SIZE:
|
||||
(accu_step + 1) * self.__C.SUB_BATCH_SIZE]
|
||||
sub_clips_feat_iter = \
|
||||
clips_feat_iter[accu_step * self.__C.SUB_BATCH_SIZE:
|
||||
(accu_step + 1) * self.__C.SUB_BATCH_SIZE]
|
||||
sub_ques_ix_iter = \
|
||||
ques_ix_iter[accu_step * self.__C.SUB_BATCH_SIZE:
|
||||
(accu_step + 1) * self.__C.SUB_BATCH_SIZE]
|
||||
sub_ans_iter = \
|
||||
ans_iter[accu_step * self.__C.SUB_BATCH_SIZE:
|
||||
(accu_step + 1) * self.__C.SUB_BATCH_SIZE]
|
||||
|
||||
pred = net(
|
||||
sub_frames_feat_iter,
|
||||
sub_clips_feat_iter,
|
||||
sub_ques_ix_iter
|
||||
)
|
||||
|
||||
loss = loss_fn(pred, sub_ans_iter)
|
||||
|
||||
# only mean-reduction needs be divided by grad_accu_steps
|
||||
# removing this line wouldn't change our results because the speciality of Adam optimizer,
|
||||
# but would be necessary if you use SGD optimizer.
|
||||
# loss /= self.__C.GRAD_ACCU_STEPS
|
||||
# start_backward = time.time()
|
||||
loss.backward()
|
||||
|
||||
if self.__C.VERBOSE:
|
||||
if dataset_eval is not None:
|
||||
mode_str = self.__C.SPLIT['train'] + '->' + self.__C.SPLIT['val']
|
||||
else:
|
||||
mode_str = self.__C.SPLIT['train'] + '->' + self.__C.SPLIT['test']
|
||||
|
||||
# logging
|
||||
|
||||
self.writer.add_scalar(
|
||||
'train/loss',
|
||||
loss.cpu().data.numpy() / self.__C.SUB_BATCH_SIZE,
|
||||
global_step=step + epoch * math.ceil(data_size / self.__C.BATCH_SIZE))
|
||||
|
||||
self.writer.add_scalar(
|
||||
'train/lr',
|
||||
optim._rate,
|
||||
global_step=step + epoch * math.ceil(data_size / self.__C.BATCH_SIZE))
|
||||
|
||||
print("\r[exp_name %s][version %s][epoch %2d][step %4d/%4d][%s] loss: %.4f, lr: %.2e" % (
|
||||
self.__C.EXP_NAME,
|
||||
self.__C.VERSION,
|
||||
epoch + 1,
|
||||
step,
|
||||
int(data_size / self.__C.BATCH_SIZE),
|
||||
mode_str,
|
||||
loss.cpu().data.numpy() / self.__C.SUB_BATCH_SIZE,
|
||||
optim._rate,
|
||||
), end=' ')
|
||||
|
||||
# Gradient norm clipping
|
||||
if self.__C.GRAD_NORM_CLIP > 0:
|
||||
nn.utils.clip_grad_norm_(
|
||||
net.parameters(),
|
||||
self.__C.GRAD_NORM_CLIP
|
||||
)
|
||||
|
||||
# Save the gradient information
|
||||
for name in range(len(named_params)):
|
||||
norm_v = torch.norm(named_params[name][1].grad).cpu().data.numpy() \
|
||||
if named_params[name][1].grad is not None else 0
|
||||
grad_norm[name] += norm_v * self.__C.GRAD_ACCU_STEPS
|
||||
|
||||
optim.step()
|
||||
|
||||
time_end = time.time()
|
||||
print('Finished in {}s'.format(int(time_end-time_start)))
|
||||
|
||||
epoch_finish = epoch + 1
|
||||
|
||||
# Save checkpoint
|
||||
state = {
|
||||
'state_dict': net.state_dict(),
|
||||
'optimizer': optim.optimizer.state_dict(),
|
||||
'lr_base': optim.lr_base,
|
||||
'optim': optim.lr_base, }
|
||||
|
||||
torch.save(
|
||||
state,
|
||||
self.__C.CKPTS_PATH +
|
||||
'ckpt_' + self.__C.VERSION +
|
||||
'/epoch' + str(epoch_finish) +
|
||||
'.pkl'
|
||||
)
|
||||
|
||||
# Logging
|
||||
logfile = open(
|
||||
self.__C.LOG_PATH +
|
||||
'log_run_' + self.__C.VERSION + '.txt',
|
||||
'a+'
|
||||
)
|
||||
logfile.write(
|
||||
'epoch = ' + str(epoch_finish) +
|
||||
' loss = ' + str(loss_sum / data_size) +
|
||||
'\n' +
|
||||
'lr = ' + str(optim._rate) +
|
||||
'\n\n'
|
||||
)
|
||||
logfile.close()
|
||||
|
||||
# Eval after every epoch
|
||||
if dataset_eval is not None:
|
||||
self.eval(
|
||||
net,
|
||||
dataset_eval,
|
||||
self.writer,
|
||||
epoch,
|
||||
valid=True,
|
||||
)
|
||||
|
||||
loss_sum = 0
|
||||
grad_norm = np.zeros(len(named_params))
|
||||
|
||||
|
||||
# Evaluation
|
||||
def eval(self, net, dataset, writer, epoch, valid=False):
|
||||
|
||||
ans_ix_list = []
|
||||
pred_list = []
|
||||
q_type_list = []
|
||||
q_bin_list = []
|
||||
ans_rarity_list = []
|
||||
|
||||
ans_qtype_dict = {'what': [], 'who': [], 'how': [], 'when': [], 'where': []}
|
||||
pred_qtype_dict = {'what': [], 'who': [], 'how': [], 'when': [], 'where': []}
|
||||
|
||||
|
||||
ans_qlen_bin_dict = {'1-3': [], '4-8': [], '9-15': []}
|
||||
pred_qlen_bin_dict = {'1-3': [], '4-8': [], '9-15': []}
|
||||
|
||||
ans_ans_rarity_dict = {'0-99': [], '100-299': [], '300-999': []}
|
||||
pred_ans_rarity_dict = {'0-99': [], '100-299': [], '300-999': []}
|
||||
|
||||
data_size = dataset.data_size
|
||||
|
||||
net.eval()
|
||||
|
||||
if self.__C.N_GPU > 1:
|
||||
net = nn.DataParallel(net, device_ids=self.__C.DEVICES)
|
||||
|
||||
dataloader = Data.DataLoader(
|
||||
dataset,
|
||||
batch_size=self.__C.EVAL_BATCH_SIZE,
|
||||
shuffle=False,
|
||||
num_workers=self.__C.NUM_WORKERS,
|
||||
pin_memory=True
|
||||
)
|
||||
|
||||
for step, (
|
||||
ques_ix_iter,
|
||||
frames_feat_iter,
|
||||
clips_feat_iter,
|
||||
_,
|
||||
ans_iter,
|
||||
q_type,
|
||||
qlen_bin,
|
||||
ans_rarity
|
||||
) in enumerate(dataloader):
|
||||
print("\rEvaluation: [step %4d/%4d]" % (
|
||||
step,
|
||||
int(data_size / self.__C.EVAL_BATCH_SIZE),
|
||||
), end=' ')
|
||||
ques_ix_iter = ques_ix_iter.cuda()
|
||||
frames_feat_iter = frames_feat_iter.cuda()
|
||||
clips_feat_iter = clips_feat_iter.cuda()
|
||||
with torch.no_grad():
|
||||
|
||||
pred = net(
|
||||
frames_feat_iter,
|
||||
clips_feat_iter,
|
||||
ques_ix_iter
|
||||
)
|
||||
|
||||
pred_np = pred.cpu().data.numpy()
|
||||
pred_argmax = np.argmax(pred_np, axis=1)
|
||||
pred_list.extend(pred_argmax)
|
||||
ans_ix_list.extend(ans_iter.tolist())
|
||||
q_type_list.extend(q_type.tolist())
|
||||
q_bin_list.extend(qlen_bin.tolist())
|
||||
ans_rarity_list.extend(ans_rarity.tolist())
|
||||
|
||||
print('')
|
||||
|
||||
assert len(pred_list) == len(ans_ix_list) == len(q_type_list) == len(q_bin_list) == len(ans_rarity_list)
|
||||
pred_list = [dataset.ix_to_ans[pred] for pred in pred_list]
|
||||
ans_ix_list = [dataset.ix_to_ans[ans] for ans in ans_ix_list]
|
||||
|
||||
# Run validation script
|
||||
scores_per_qtype = {
|
||||
'what': {},
|
||||
'who': {},
|
||||
'how': {},
|
||||
'when': {},
|
||||
'where': {},
|
||||
}
|
||||
scores_per_qlen_bin = {
|
||||
'1-3': {},
|
||||
'4-8': {},
|
||||
'9-15': {},
|
||||
}
|
||||
scores_ans_rarity_dict = {
|
||||
'0-99': {},
|
||||
'100-299': {},
|
||||
'300-999': {}
|
||||
}
|
||||
|
||||
if valid:
|
||||
# create vqa object and vqaRes object
|
||||
for pred, ans, q_type in zip(pred_list, ans_ix_list, q_type_list):
|
||||
pred_qtype_dict[dataset.idx_to_qtypes[q_type]].append(pred)
|
||||
ans_qtype_dict[dataset.idx_to_qtypes[q_type]].append(ans)
|
||||
|
||||
print('----------------- Computing scores -----------------')
|
||||
acc = get_acc(ans_ix_list, pred_list)
|
||||
print('----------------- Overall -----------------')
|
||||
print('acc: {}'.format(acc))
|
||||
writer.add_scalar('acc/overall', acc, global_step=epoch)
|
||||
|
||||
for q_type in scores_per_qtype:
|
||||
print('----------------- Computing "{}" q-type scores -----------------'.format(q_type))
|
||||
# acc, wups_0, wups_1 = get_scores(
|
||||
# ans_ix_dict[q_type], pred_ix_dict[q_type])
|
||||
acc = get_acc(ans_qtype_dict[q_type], pred_qtype_dict[q_type])
|
||||
print('acc: {}'.format(acc))
|
||||
writer.add_scalar(
|
||||
'acc/{}'.format(q_type), acc, global_step=epoch)
|
||||
else:
|
||||
for pred, ans, q_type, qlen_bin, a_rarity in zip(
|
||||
pred_list, ans_ix_list, q_type_list, q_bin_list, ans_rarity_list):
|
||||
|
||||
pred_qtype_dict[dataset.idx_to_qtypes[q_type]].append(pred)
|
||||
ans_qtype_dict[dataset.idx_to_qtypes[q_type]].append(ans)
|
||||
|
||||
pred_qlen_bin_dict[dataset.idx_to_qlen_bins[qlen_bin]].append(pred)
|
||||
ans_qlen_bin_dict[dataset.idx_to_qlen_bins[qlen_bin]].append(ans)
|
||||
|
||||
pred_ans_rarity_dict[dataset.idx_to_ans_rare[a_rarity]].append(pred)
|
||||
ans_ans_rarity_dict[dataset.idx_to_ans_rare[a_rarity]].append(ans)
|
||||
|
||||
print('----------------- Computing overall scores -----------------')
|
||||
acc = get_acc(ans_ix_list, pred_list)
|
||||
|
||||
print('----------------- Overall -----------------')
|
||||
print('acc:{}'.format(acc))
|
||||
|
||||
|
||||
print('----------------- Computing q-type scores -----------------')
|
||||
for q_type in scores_per_qtype:
|
||||
acc = get_acc(ans_qtype_dict[q_type], pred_qtype_dict[q_type])
|
||||
print(' {} '.format(q_type))
|
||||
print('acc:{}'.format(acc))
|
||||
|
||||
print('----------------- Computing qlen-bins scores -----------------')
|
||||
for qlen_bin in scores_per_qlen_bin:
|
||||
|
||||
acc = get_acc(ans_qlen_bin_dict[qlen_bin], pred_qlen_bin_dict[qlen_bin])
|
||||
print(' {} '.format(qlen_bin))
|
||||
print('acc:{}'.format(acc))
|
||||
|
||||
print('----------------- Computing ans-rarity scores -----------------')
|
||||
for a_rarity in scores_ans_rarity_dict:
|
||||
acc = get_acc(ans_ans_rarity_dict[a_rarity], pred_ans_rarity_dict[a_rarity])
|
||||
print(' {} '.format(a_rarity))
|
||||
print('acc:{}'.format(acc))
|
||||
net.train()
|
||||
|
||||
def construct_net(self, model_type):
|
||||
if model_type == 1:
|
||||
net = Net1(
|
||||
self.__C,
|
||||
self.dataset.pretrained_emb,
|
||||
self.dataset.token_size,
|
||||
self.dataset.ans_size
|
||||
)
|
||||
elif model_type == 2:
|
||||
net = Net2(
|
||||
self.__C,
|
||||
self.dataset.pretrained_emb,
|
||||
self.dataset.token_size,
|
||||
self.dataset.ans_size
|
||||
)
|
||||
elif model_type == 3:
|
||||
net = Net3(
|
||||
self.__C,
|
||||
self.dataset.pretrained_emb,
|
||||
self.dataset.token_size,
|
||||
self.dataset.ans_size
|
||||
)
|
||||
elif model_type == 4:
|
||||
net = Net4(
|
||||
self.__C,
|
||||
self.dataset.pretrained_emb,
|
||||
self.dataset.token_size,
|
||||
self.dataset.ans_size
|
||||
)
|
||||
else:
|
||||
raise ValueError('Net{} is not supported'.format(model_type))
|
||||
return net
|
||||
|
||||
def run(self, run_mode, epoch=None):
|
||||
self.set_seed(self.__C.SEED)
|
||||
if run_mode == 'train':
|
||||
self.empty_log(self.__C.VERSION)
|
||||
self.train(self.dataset, self.dataset_eval)
|
||||
|
||||
elif run_mode == 'val':
|
||||
self.eval(self.dataset, valid=True)
|
||||
|
||||
elif run_mode == 'test':
|
||||
net = self.construct_net(self.__C.MODEL_TYPE)
|
||||
assert epoch is not None
|
||||
path = self.__C.CKPTS_PATH + \
|
||||
'ckpt_' + self.__C.VERSION + \
|
||||
'/epoch' + str(epoch) + '.pkl'
|
||||
print('Loading ckpt {}'.format(path))
|
||||
state_dict = torch.load(path)['state_dict']
|
||||
net.load_state_dict(state_dict)
|
||||
net.cuda()
|
||||
self.eval(net, self.dataset_test, self.writer, 0)
|
||||
|
||||
else:
|
||||
exit(-1)
|
||||
|
||||
def set_seed(self, seed):
|
||||
"""Sets the seed for reproducibility.
|
||||
Args:
|
||||
seed (int): The seed used
|
||||
"""
|
||||
torch.manual_seed(seed)
|
||||
torch.cuda.manual_seed(seed)
|
||||
torch.backends.cudnn.deterministic = True
|
||||
torch.backends.cudnn.benchmark = False
|
||||
np.random.seed(seed)
|
||||
print('\nSeed set to {}...\n'.format(seed))
|
||||
|
||||
def empty_log(self, version):
|
||||
print('Initializing log file ........')
|
||||
if (os.path.exists(self.__C.LOG_PATH + 'log_run_' + version + '.txt')):
|
||||
os.remove(self.__C.LOG_PATH + 'log_run_' + version + '.txt')
|
||||
print('Finished!')
|
||||
print('')
|
211
core/metrics.py
Normal file
211
core/metrics.py
Normal file
|
@ -0,0 +1,211 @@
|
|||
"""
|
||||
Author: Mateusz Malinowski
|
||||
Email: mmalinow@mpi-inf.mpg.de
|
||||
|
||||
The script assumes there are two files
|
||||
- first file with ground truth answers
|
||||
- second file with predicted answers
|
||||
both answers are line-aligned
|
||||
|
||||
The script also assumes that answer items are comma separated.
|
||||
For instance, chair,table,window
|
||||
|
||||
It is also a set measure, so not exactly the same as accuracy
|
||||
even if dirac measure is used since {book,book}=={book}, also {book,chair}={chair,book}
|
||||
|
||||
Logs:
|
||||
05.09.2015 - white spaces surrounding words are stripped away so that {book, chair}={book,chair}
|
||||
"""
|
||||
|
||||
import sys
|
||||
|
||||
#import enchant
|
||||
|
||||
from numpy import prod
|
||||
from nltk.corpus import wordnet as wn
|
||||
from tqdm import tqdm
|
||||
|
||||
def file2list(filepath):
|
||||
with open(filepath,'r') as f:
|
||||
lines =[k for k in
|
||||
[k.strip() for k in f.readlines()]
|
||||
if len(k) > 0]
|
||||
|
||||
return lines
|
||||
|
||||
|
||||
def list2file(filepath,mylist):
|
||||
mylist='\n'.join(mylist)
|
||||
with open(filepath,'w') as f:
|
||||
f.writelines(mylist)
|
||||
|
||||
|
||||
def items2list(x):
|
||||
"""
|
||||
x - string of comma-separated answer items
|
||||
"""
|
||||
return [l.strip() for l in x.split(',')]
|
||||
|
||||
|
||||
def fuzzy_set_membership_measure(x,A,m):
|
||||
"""
|
||||
Set membership measure.
|
||||
x: element
|
||||
A: set of elements
|
||||
m: point-wise element-to-element measure m(a,b) ~ similarity(a,b)
|
||||
|
||||
This function implments a fuzzy set membership measure:
|
||||
m(x \in A) = max_{a \in A} m(x,a)}
|
||||
"""
|
||||
return 0 if A==[] else max(map(lambda a: m(x,a), A))
|
||||
|
||||
|
||||
def score_it(A,T,m):
|
||||
"""
|
||||
A: list of A items
|
||||
T: list of T items
|
||||
m: set membership measure
|
||||
m(a \in A) gives a membership quality of a into A
|
||||
|
||||
This function implements a fuzzy accuracy score:
|
||||
score(A,T) = min{prod_{a \in A} m(a \in T), prod_{t \in T} m(a \in A)}
|
||||
where A and T are set representations of the answers
|
||||
and m is a measure
|
||||
"""
|
||||
if A==[] and T==[]:
|
||||
return 1
|
||||
|
||||
# print A,T
|
||||
|
||||
score_left=0 if A==[] else prod(list(map(lambda a: m(a,T), A)))
|
||||
score_right=0 if T==[] else prod(list(map(lambda t: m(t,A),T)))
|
||||
return min(score_left,score_right)
|
||||
|
||||
|
||||
# implementations of different measure functions
|
||||
def dirac_measure(a,b):
|
||||
"""
|
||||
Returns 1 iff a=b and 0 otherwise.
|
||||
"""
|
||||
if a==[] or b==[]:
|
||||
return 0.0
|
||||
return float(a==b)
|
||||
|
||||
|
||||
def wup_measure(a,b,similarity_threshold=0.925):
|
||||
"""
|
||||
Returns Wu-Palmer similarity score.
|
||||
More specifically, it computes:
|
||||
max_{x \in interp(a)} max_{y \in interp(b)} wup(x,y)
|
||||
where interp is a 'interpretation field'
|
||||
"""
|
||||
def get_semantic_field(a):
|
||||
weight = 1.0
|
||||
semantic_field = wn.synsets(a,pos=wn.NOUN)
|
||||
return (semantic_field,weight)
|
||||
|
||||
|
||||
def get_stem_word(a):
|
||||
"""
|
||||
Sometimes answer has form word\d+:wordid.
|
||||
If so we return word and downweight
|
||||
"""
|
||||
weight = 1.0
|
||||
return (a,weight)
|
||||
|
||||
|
||||
global_weight=1.0
|
||||
|
||||
(a,global_weight_a)=get_stem_word(a)
|
||||
(b,global_weight_b)=get_stem_word(b)
|
||||
global_weight = min(global_weight_a,global_weight_b)
|
||||
|
||||
if a==b:
|
||||
# they are the same
|
||||
return 1.0*global_weight
|
||||
|
||||
if a==[] or b==[]:
|
||||
return 0
|
||||
|
||||
|
||||
interp_a,weight_a = get_semantic_field(a)
|
||||
interp_b,weight_b = get_semantic_field(b)
|
||||
|
||||
if interp_a == [] or interp_b == []:
|
||||
return 0
|
||||
|
||||
# we take the most optimistic interpretation
|
||||
global_max=0.0
|
||||
for x in interp_a:
|
||||
for y in interp_b:
|
||||
local_score=x.wup_similarity(y)
|
||||
if local_score > global_max:
|
||||
global_max=local_score
|
||||
|
||||
# we need to use the semantic fields and therefore we downweight
|
||||
# unless the score is high which indicates both are synonyms
|
||||
if global_max < similarity_threshold:
|
||||
interp_weight = 0.1
|
||||
else:
|
||||
interp_weight = 1.0
|
||||
|
||||
final_score=global_max*weight_a*weight_b*interp_weight*global_weight
|
||||
return final_score
|
||||
###
|
||||
|
||||
|
||||
def get_scores(input_gt, input_pred, threshold_0=0.0, threshold_1=0.9):
|
||||
element_membership_acc=dirac_measure
|
||||
element_membership_wups_0=lambda x,y: wup_measure(x,y,threshold_0)
|
||||
element_membership_wups_1=lambda x,y: wup_measure(x,y,threshold_1)
|
||||
|
||||
set_membership_acc=\
|
||||
lambda x,A: fuzzy_set_membership_measure(x,A,element_membership_acc)
|
||||
set_membership_wups_0=\
|
||||
lambda x,A: fuzzy_set_membership_measure(x,A,element_membership_wups_0)
|
||||
set_membership_wups_1=\
|
||||
lambda x,A: fuzzy_set_membership_measure(x,A,element_membership_wups_1)
|
||||
|
||||
score_list_acc = []
|
||||
score_list_wups_0 = []
|
||||
score_list_wups_1 = []
|
||||
pbar = tqdm(zip(input_gt,input_pred))
|
||||
pbar.set_description('Computing Acc')
|
||||
|
||||
for (ta,pa) in pbar:
|
||||
score_list_acc.append(score_it(items2list(ta),items2list(pa),set_membership_acc))
|
||||
|
||||
#final_score=sum(map(lambda x:float(x)/float(len(score_list)),score_list))
|
||||
final_score_acc=float(sum(score_list_acc))/float(len(score_list_acc))
|
||||
final_score_acc *= 100.0
|
||||
|
||||
pbar = tqdm(zip(input_gt,input_pred))
|
||||
pbar.set_description('Computing Wups_0.0')
|
||||
for (ta,pa) in pbar:
|
||||
score_list_wups_0.append(score_it(items2list(ta),items2list(pa),set_membership_wups_0))
|
||||
#final_score=sum(map(lambda x:float(x)/float(len(score_list)),score_list))
|
||||
final_score_wups_0=float(sum(score_list_wups_0))/float(len(score_list_wups_0))
|
||||
final_score_wups_0 *= 100.0
|
||||
|
||||
pbar = tqdm(zip(input_gt,input_pred))
|
||||
pbar.set_description('Computing Wups_0.9')
|
||||
for (ta,pa) in pbar:
|
||||
score_list_wups_1.append(score_it(items2list(ta),items2list(pa),set_membership_wups_1))
|
||||
#final_score=sum(map(lambda x:float(x)/float(len(score_list)),score_list))
|
||||
final_score_wups_1=float(sum(score_list_wups_1))/float(len(score_list_wups_1))
|
||||
final_score_wups_1 *= 100.0
|
||||
|
||||
# filtering to obtain the results
|
||||
#print 'full score:', score_list
|
||||
# print('accuracy = {0:.2f} | WUPS@{1} = {2:.2f} | WUPS@{3} = {4:.2f}'.format(
|
||||
# final_score_acc, threshold_0, final_score_wups_0, threshold_1, final_score_wups_1))
|
||||
return final_score_acc, final_score_wups_0, final_score_wups_1
|
||||
|
||||
def get_acc(gts, preds):
|
||||
sum_correct = 0
|
||||
assert len(gts) == len(preds)
|
||||
for gt, pred in zip(gts, preds):
|
||||
if gt == pred:
|
||||
sum_correct += 1
|
||||
acc = 100.0 * float(sum_correct/ len(gts))
|
||||
return acc
|
0
core/model/.gitkeep
Normal file
0
core/model/.gitkeep
Normal file
80
core/model/C3D.py
Normal file
80
core/model/C3D.py
Normal file
|
@ -0,0 +1,80 @@
|
|||
"""
|
||||
from https://github.com/DavideA/c3d-pytorch/blob/master/C3D_model.py
|
||||
"""
|
||||
|
||||
|
||||
import torch.nn as nn
|
||||
|
||||
|
||||
class C3D(nn.Module):
|
||||
"""
|
||||
The C3D network as described in [1].
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
super(C3D, self).__init__()
|
||||
|
||||
self.conv1 = nn.Conv3d(3, 64, kernel_size=(3, 3, 3), padding=(1, 1, 1))
|
||||
self.pool1 = nn.MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2))
|
||||
|
||||
self.conv2 = nn.Conv3d(64, 128, kernel_size=(3, 3, 3), padding=(1, 1, 1))
|
||||
self.pool2 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2))
|
||||
|
||||
self.conv3a = nn.Conv3d(128, 256, kernel_size=(3, 3, 3), padding=(1, 1, 1))
|
||||
self.conv3b = nn.Conv3d(256, 256, kernel_size=(3, 3, 3), padding=(1, 1, 1))
|
||||
self.pool3 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2))
|
||||
|
||||
self.conv4a = nn.Conv3d(256, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
|
||||
self.conv4b = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
|
||||
self.pool4 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2))
|
||||
|
||||
self.conv5a = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
|
||||
self.conv5b = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
|
||||
self.pool5 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=(0, 1, 1))
|
||||
|
||||
self.fc6 = nn.Linear(8192, 4096)
|
||||
self.fc7 = nn.Linear(4096, 4096)
|
||||
self.fc8 = nn.Linear(4096, 487)
|
||||
|
||||
self.dropout = nn.Dropout(p=0.5)
|
||||
|
||||
self.relu = nn.ReLU()
|
||||
self.softmax = nn.Softmax()
|
||||
|
||||
def forward(self, x):
|
||||
|
||||
h = self.relu(self.conv1(x))
|
||||
h = self.pool1(h)
|
||||
|
||||
h = self.relu(self.conv2(h))
|
||||
h = self.pool2(h)
|
||||
|
||||
h = self.relu(self.conv3a(h))
|
||||
h = self.relu(self.conv3b(h))
|
||||
h = self.pool3(h)
|
||||
|
||||
h = self.relu(self.conv4a(h))
|
||||
h = self.relu(self.conv4b(h))
|
||||
h = self.pool4(h)
|
||||
|
||||
h = self.relu(self.conv5a(h))
|
||||
h = self.relu(self.conv5b(h))
|
||||
h = self.pool5(h)
|
||||
|
||||
h = h.view(-1, 8192)
|
||||
h = self.relu(self.fc6(h))
|
||||
h = self.dropout(h)
|
||||
h = self.relu(self.fc7(h))
|
||||
# h = self.dropout(h)
|
||||
|
||||
# logits = self.fc8(h)
|
||||
# probs = self.softmax(logits)
|
||||
|
||||
return h
|
||||
|
||||
"""
|
||||
References
|
||||
----------
|
||||
[1] Tran, Du, et al. "Learning spatiotemporal features with 3d convolutional networks."
|
||||
Proceedings of the IEEE international conference on computer vision. 2015.
|
||||
"""
|
323
core/model/dnc.py
Normal file
323
core/model/dnc.py
Normal file
|
@ -0,0 +1,323 @@
|
|||
"""
|
||||
PyTorch DNC implementation from
|
||||
-->
|
||||
https://github.com/ixaxaar/pytorch-dnc
|
||||
<--
|
||||
"""
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
|
||||
import torch.nn as nn
|
||||
import torch as T
|
||||
from torch.autograd import Variable as var
|
||||
import numpy as np
|
||||
|
||||
from torch.nn.utils.rnn import pad_packed_sequence as pad
|
||||
from torch.nn.utils.rnn import pack_padded_sequence as pack
|
||||
from torch.nn.utils.rnn import PackedSequence
|
||||
|
||||
from .util import *
|
||||
from .memory import *
|
||||
|
||||
from torch.nn.init import orthogonal_, xavier_uniform_
|
||||
|
||||
|
||||
class DNC(nn.Module):
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
input_size,
|
||||
hidden_size,
|
||||
rnn_type='lstm',
|
||||
num_layers=1,
|
||||
num_hidden_layers=2,
|
||||
bias=True,
|
||||
batch_first=True,
|
||||
dropout=0,
|
||||
bidirectional=False,
|
||||
nr_cells=5,
|
||||
read_heads=2,
|
||||
cell_size=10,
|
||||
nonlinearity='tanh',
|
||||
gpu_id=-1,
|
||||
independent_linears=False,
|
||||
share_memory=True,
|
||||
debug=False,
|
||||
clip=20
|
||||
):
|
||||
super(DNC, self).__init__()
|
||||
# todo: separate weights and RNNs for the interface and output vectors
|
||||
|
||||
self.input_size = input_size
|
||||
self.hidden_size = hidden_size
|
||||
self.rnn_type = rnn_type
|
||||
self.num_layers = num_layers
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.bias = bias
|
||||
self.batch_first = batch_first
|
||||
self.dropout = dropout
|
||||
self.bidirectional = bidirectional
|
||||
self.nr_cells = nr_cells
|
||||
self.read_heads = read_heads
|
||||
self.cell_size = cell_size
|
||||
self.nonlinearity = nonlinearity
|
||||
self.gpu_id = gpu_id
|
||||
self.independent_linears = independent_linears
|
||||
self.share_memory = share_memory
|
||||
self.debug = debug
|
||||
self.clip = clip
|
||||
|
||||
self.w = self.cell_size
|
||||
self.r = self.read_heads
|
||||
|
||||
self.read_vectors_size = self.r * self.w
|
||||
self.output_size = self.hidden_size
|
||||
|
||||
self.nn_input_size = self.input_size + self.read_vectors_size
|
||||
self.nn_output_size = self.output_size + self.read_vectors_size
|
||||
|
||||
self.rnns = []
|
||||
self.memories = []
|
||||
|
||||
for layer in range(self.num_layers):
|
||||
if self.rnn_type.lower() == 'rnn':
|
||||
self.rnns.append(nn.RNN((self.nn_input_size if layer == 0 else self.nn_output_size), self.output_size,
|
||||
bias=self.bias, nonlinearity=self.nonlinearity, batch_first=True, dropout=self.dropout, num_layers=self.num_hidden_layers))
|
||||
elif self.rnn_type.lower() == 'gru':
|
||||
self.rnns.append(nn.GRU((self.nn_input_size if layer == 0 else self.nn_output_size),
|
||||
self.output_size, bias=self.bias, batch_first=True, dropout=self.dropout, num_layers=self.num_hidden_layers))
|
||||
if self.rnn_type.lower() == 'lstm':
|
||||
self.rnns.append(nn.LSTM((self.nn_input_size if layer == 0 else self.nn_output_size),
|
||||
self.output_size, bias=self.bias, batch_first=True, dropout=self.dropout, num_layers=self.num_hidden_layers))
|
||||
setattr(self, self.rnn_type.lower() + '_layer_' + str(layer), self.rnns[layer])
|
||||
|
||||
# memories for each layer
|
||||
if not self.share_memory:
|
||||
self.memories.append(
|
||||
Memory(
|
||||
input_size=self.output_size,
|
||||
mem_size=self.nr_cells,
|
||||
cell_size=self.w,
|
||||
read_heads=self.r,
|
||||
gpu_id=self.gpu_id,
|
||||
independent_linears=self.independent_linears
|
||||
)
|
||||
)
|
||||
setattr(self, 'rnn_layer_memory_' + str(layer), self.memories[layer])
|
||||
|
||||
# only one memory shared by all layers
|
||||
if self.share_memory:
|
||||
self.memories.append(
|
||||
Memory(
|
||||
input_size=self.output_size,
|
||||
mem_size=self.nr_cells,
|
||||
cell_size=self.w,
|
||||
read_heads=self.r,
|
||||
gpu_id=self.gpu_id,
|
||||
independent_linears=self.independent_linears
|
||||
)
|
||||
)
|
||||
setattr(self, 'rnn_layer_memory_shared', self.memories[0])
|
||||
|
||||
# final output layer
|
||||
self.output = nn.Linear(self.nn_output_size, self.output_size)
|
||||
orthogonal_(self.output.weight)
|
||||
|
||||
if self.gpu_id != -1:
|
||||
[x.cuda(self.gpu_id) for x in self.rnns]
|
||||
[x.cuda(self.gpu_id) for x in self.memories]
|
||||
self.output.cuda()
|
||||
|
||||
def _init_hidden(self, hx, batch_size, reset_experience):
|
||||
# create empty hidden states if not provided
|
||||
if hx is None:
|
||||
hx = (None, None, None)
|
||||
(chx, mhx, last_read) = hx
|
||||
|
||||
# initialize hidden state of the controller RNN
|
||||
if chx is None:
|
||||
h = cuda(T.zeros(self.num_hidden_layers, batch_size, self.output_size), gpu_id=self.gpu_id)
|
||||
xavier_uniform_(h)
|
||||
|
||||
chx = [ (h, h) if self.rnn_type.lower() == 'lstm' else h for x in range(self.num_layers)]
|
||||
|
||||
# Last read vectors
|
||||
if last_read is None:
|
||||
last_read = cuda(T.zeros(batch_size, self.w * self.r), gpu_id=self.gpu_id)
|
||||
|
||||
# memory states
|
||||
if mhx is None:
|
||||
if self.share_memory:
|
||||
mhx = self.memories[0].reset(batch_size, erase=reset_experience)
|
||||
else:
|
||||
mhx = [m.reset(batch_size, erase=reset_experience) for m in self.memories]
|
||||
else:
|
||||
if self.share_memory:
|
||||
mhx = self.memories[0].reset(batch_size, mhx, erase=reset_experience)
|
||||
else:
|
||||
mhx = [m.reset(batch_size, h, erase=reset_experience) for m, h in zip(self.memories, mhx)]
|
||||
|
||||
return chx, mhx, last_read
|
||||
|
||||
def _debug(self, mhx, debug_obj):
|
||||
if not debug_obj:
|
||||
debug_obj = {
|
||||
'memory': [],
|
||||
'link_matrix': [],
|
||||
'precedence': [],
|
||||
'read_weights': [],
|
||||
'write_weights': [],
|
||||
'usage_vector': [],
|
||||
}
|
||||
|
||||
debug_obj['memory'].append(mhx['memory'][0].data.cpu().numpy())
|
||||
debug_obj['link_matrix'].append(mhx['link_matrix'][0][0].data.cpu().numpy())
|
||||
debug_obj['precedence'].append(mhx['precedence'][0].data.cpu().numpy())
|
||||
debug_obj['read_weights'].append(mhx['read_weights'][0].data.cpu().numpy())
|
||||
debug_obj['write_weights'].append(mhx['write_weights'][0].data.cpu().numpy())
|
||||
debug_obj['usage_vector'].append(mhx['usage_vector'][0].unsqueeze(0).data.cpu().numpy())
|
||||
return debug_obj
|
||||
|
||||
def _layer_forward(self, input, layer, hx=(None, None), pass_through_memory=True):
|
||||
(chx, mhx) = hx
|
||||
|
||||
# pass through the controller layer
|
||||
input, chx = self.rnns[layer](input.unsqueeze(1), chx)
|
||||
input = input.squeeze(1)
|
||||
|
||||
# clip the controller output
|
||||
if self.clip != 0:
|
||||
output = T.clamp(input, -self.clip, self.clip)
|
||||
else:
|
||||
output = input
|
||||
|
||||
# the interface vector
|
||||
ξ = output
|
||||
|
||||
# pass through memory
|
||||
if pass_through_memory:
|
||||
if self.share_memory:
|
||||
read_vecs, mhx = self.memories[0](ξ, mhx)
|
||||
else:
|
||||
read_vecs, mhx = self.memories[layer](ξ, mhx)
|
||||
# the read vectors
|
||||
read_vectors = read_vecs.view(-1, self.w * self.r)
|
||||
else:
|
||||
read_vectors = None
|
||||
|
||||
return output, (chx, mhx, read_vectors)
|
||||
|
||||
def forward(self, input, hx=(None, None, None), reset_experience=False, pass_through_memory=True):
|
||||
# handle packed data
|
||||
is_packed = type(input) is PackedSequence
|
||||
if is_packed:
|
||||
input, lengths = pad(input)
|
||||
max_length = lengths[0]
|
||||
else:
|
||||
max_length = input.size(1) if self.batch_first else input.size(0)
|
||||
lengths = [input.size(1)] * max_length if self.batch_first else [input.size(0)] * max_length
|
||||
|
||||
batch_size = input.size(0) if self.batch_first else input.size(1)
|
||||
|
||||
if not self.batch_first:
|
||||
input = input.transpose(0, 1)
|
||||
# make the data time-first
|
||||
|
||||
controller_hidden, mem_hidden, last_read = self._init_hidden(hx, batch_size, reset_experience)
|
||||
|
||||
# concat input with last read (or padding) vectors
|
||||
inputs = [T.cat([input[:, x, :], last_read], 1) for x in range(max_length)]
|
||||
|
||||
# batched forward pass per element / word / etc
|
||||
if self.debug:
|
||||
viz = None
|
||||
|
||||
outs = [None] * max_length
|
||||
read_vectors = None
|
||||
rv = [None] * max_length
|
||||
# pass through time
|
||||
for time in range(max_length):
|
||||
# pass thorugh layers
|
||||
for layer in range(self.num_layers):
|
||||
# this layer's hidden states
|
||||
chx = controller_hidden[layer]
|
||||
m = mem_hidden if self.share_memory else mem_hidden[layer]
|
||||
# pass through controller
|
||||
outs[time], (chx, m, read_vectors) = \
|
||||
self._layer_forward(inputs[time], layer, (chx, m), pass_through_memory)
|
||||
|
||||
# debug memory
|
||||
if self.debug:
|
||||
viz = self._debug(m, viz)
|
||||
|
||||
# store the memory back (per layer or shared)
|
||||
if self.share_memory:
|
||||
mem_hidden = m
|
||||
else:
|
||||
mem_hidden[layer] = m
|
||||
controller_hidden[layer] = chx
|
||||
|
||||
if read_vectors is not None:
|
||||
# the controller output + read vectors go into next layer
|
||||
outs[time] = T.cat([outs[time], read_vectors], 1)
|
||||
if layer == self.num_layers - 1:
|
||||
rv[time] = read_vectors.reshape(batch_size, self.r, self.w)
|
||||
else:
|
||||
outs[time] = T.cat([outs[time], last_read], 1)
|
||||
inputs[time] = outs[time]
|
||||
|
||||
if self.debug:
|
||||
viz = {k: np.array(v) for k, v in viz.items()}
|
||||
viz = {k: v.reshape(v.shape[0], v.shape[1] * v.shape[2]) for k, v in viz.items()}
|
||||
|
||||
# pass through final output layer
|
||||
inputs = [self.output(i) for i in inputs]
|
||||
outputs = T.stack(inputs, 1 if self.batch_first else 0)
|
||||
|
||||
if is_packed:
|
||||
outputs = pack(output, lengths)
|
||||
|
||||
if self.debug:
|
||||
return outputs, (controller_hidden, mem_hidden, read_vectors), rv, viz
|
||||
else:
|
||||
return outputs, (controller_hidden, mem_hidden, read_vectors), rv
|
||||
|
||||
def __repr__(self):
|
||||
s = "\n----------------------------------------\n"
|
||||
s += '{name}({input_size}, {hidden_size}'
|
||||
if self.rnn_type != 'lstm':
|
||||
s += ', rnn_type={rnn_type}'
|
||||
if self.num_layers != 1:
|
||||
s += ', num_layers={num_layers}'
|
||||
if self.num_hidden_layers != 2:
|
||||
s += ', num_hidden_layers={num_hidden_layers}'
|
||||
if self.bias != True:
|
||||
s += ', bias={bias}'
|
||||
if self.batch_first != True:
|
||||
s += ', batch_first={batch_first}'
|
||||
if self.dropout != 0:
|
||||
s += ', dropout={dropout}'
|
||||
if self.bidirectional != False:
|
||||
s += ', bidirectional={bidirectional}'
|
||||
if self.nr_cells != 5:
|
||||
s += ', nr_cells={nr_cells}'
|
||||
if self.read_heads != 2:
|
||||
s += ', read_heads={read_heads}'
|
||||
if self.cell_size != 10:
|
||||
s += ', cell_size={cell_size}'
|
||||
if self.nonlinearity != 'tanh':
|
||||
s += ', nonlinearity={nonlinearity}'
|
||||
if self.gpu_id != -1:
|
||||
s += ', gpu_id={gpu_id}'
|
||||
if self.independent_linears != False:
|
||||
s += ', independent_linears={independent_linears}'
|
||||
if self.share_memory != True:
|
||||
s += ', share_memory={share_memory}'
|
||||
if self.debug != False:
|
||||
s += ', debug={debug}'
|
||||
if self.clip != 20:
|
||||
s += ', clip={clip}'
|
||||
|
||||
s += ")\n" + super(DNC, self).__repr__() + \
|
||||
"\n----------------------------------------\n"
|
||||
return s.format(name=self.__class__.__name__, **self.__dict__)
|
208
core/model/mca.py
Normal file
208
core/model/mca.py
Normal file
|
@ -0,0 +1,208 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
from core.model.net_utils import FC, MLP, LayerNorm
|
||||
from core.model.dnc_improved import DNC, SharedMemDNC
|
||||
from core.model.dnc_improved import FeedforwardController
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
import torch, math
|
||||
import time
|
||||
|
||||
|
||||
# ------------------------------
|
||||
# ---- Multi-Head Attention ----
|
||||
# ------------------------------
|
||||
|
||||
class MHAtt(nn.Module):
|
||||
def __init__(self, __C):
|
||||
super(MHAtt, self).__init__()
|
||||
self.__C = __C
|
||||
|
||||
self.linear_v = nn.Linear(__C.HIDDEN_SIZE, __C.HIDDEN_SIZE)
|
||||
self.linear_k = nn.Linear(__C.HIDDEN_SIZE, __C.HIDDEN_SIZE)
|
||||
self.linear_q = nn.Linear(__C.HIDDEN_SIZE, __C.HIDDEN_SIZE)
|
||||
self.linear_merge = nn.Linear(__C.HIDDEN_SIZE, __C.HIDDEN_SIZE)
|
||||
|
||||
self.dropout = nn.Dropout(__C.DROPOUT_R)
|
||||
|
||||
def forward(self, v, k, q, mask):
|
||||
n_batches = q.size(0)
|
||||
|
||||
v = self.linear_v(v).view(
|
||||
n_batches,
|
||||
-1,
|
||||
self.__C.MULTI_HEAD,
|
||||
self.__C.HIDDEN_SIZE_HEAD
|
||||
).transpose(1, 2)
|
||||
|
||||
k = self.linear_k(k).view(
|
||||
n_batches,
|
||||
-1,
|
||||
self.__C.MULTI_HEAD,
|
||||
self.__C.HIDDEN_SIZE_HEAD
|
||||
).transpose(1, 2)
|
||||
|
||||
q = self.linear_q(q).view(
|
||||
n_batches,
|
||||
-1,
|
||||
self.__C.MULTI_HEAD,
|
||||
self.__C.HIDDEN_SIZE_HEAD
|
||||
).transpose(1, 2)
|
||||
|
||||
atted = self.att(v, k, q, mask)
|
||||
atted = atted.transpose(1, 2).contiguous().view(
|
||||
n_batches,
|
||||
-1,
|
||||
self.__C.HIDDEN_SIZE
|
||||
)
|
||||
|
||||
atted = self.linear_merge(atted)
|
||||
|
||||
return atted
|
||||
|
||||
def att(self, value, key, query, mask):
|
||||
d_k = query.size(-1)
|
||||
|
||||
scores = torch.matmul(
|
||||
query, key.transpose(-2, -1)
|
||||
) / math.sqrt(d_k)
|
||||
|
||||
if mask is not None:
|
||||
scores = scores.masked_fill(mask, -1e9)
|
||||
|
||||
att_map = F.softmax(scores, dim=-1)
|
||||
att_map = self.dropout(att_map)
|
||||
|
||||
return torch.matmul(att_map, value)
|
||||
|
||||
|
||||
|
||||
# ---------------------------
|
||||
# ---- Feed Forward Nets ----
|
||||
# ---------------------------
|
||||
|
||||
class FFN(nn.Module):
|
||||
def __init__(self, __C):
|
||||
super(FFN, self).__init__()
|
||||
|
||||
self.mlp = MLP(
|
||||
in_size=__C.HIDDEN_SIZE,
|
||||
mid_size=__C.FF_SIZE,
|
||||
out_size=__C.HIDDEN_SIZE,
|
||||
dropout_r=__C.DROPOUT_R,
|
||||
use_relu=True
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
return self.mlp(x)
|
||||
|
||||
|
||||
# ------------------------
|
||||
# ---- Self Attention ----
|
||||
# ------------------------
|
||||
|
||||
class SA(nn.Module):
|
||||
def __init__(self, __C):
|
||||
super(SA, self).__init__()
|
||||
self.mhatt = MHAtt(__C)
|
||||
self.ffn = FFN(__C)
|
||||
|
||||
self.dropout1 = nn.Dropout(__C.DROPOUT_R)
|
||||
self.norm1 = LayerNorm(__C.HIDDEN_SIZE)
|
||||
|
||||
self.dropout2 = nn.Dropout(__C.DROPOUT_R)
|
||||
self.norm2 = LayerNorm(__C.HIDDEN_SIZE)
|
||||
|
||||
def forward(self, x, x_mask):
|
||||
x = self.norm1(x + self.dropout1(
|
||||
self.mhatt(x, x, x, x_mask)
|
||||
))
|
||||
|
||||
x = self.norm2(x + self.dropout2(
|
||||
self.ffn(x)
|
||||
))
|
||||
|
||||
return x
|
||||
|
||||
# -------------------------------
|
||||
# ---- Self Guided Attention ----
|
||||
# -------------------------------
|
||||
|
||||
class SGA(nn.Module):
|
||||
def __init__(self, __C):
|
||||
super(SGA, self).__init__()
|
||||
|
||||
self.mhatt1 = MHAtt(__C)
|
||||
self.mhatt2 = MHAtt(__C)
|
||||
self.ffn = FFN(__C)
|
||||
|
||||
self.dropout1 = nn.Dropout(__C.DROPOUT_R)
|
||||
self.norm1 = LayerNorm(__C.HIDDEN_SIZE)
|
||||
|
||||
self.dropout2 = nn.Dropout(__C.DROPOUT_R)
|
||||
self.norm2 = LayerNorm(__C.HIDDEN_SIZE)
|
||||
|
||||
self.dropout3 = nn.Dropout(__C.DROPOUT_R)
|
||||
self.norm3 = LayerNorm(__C.HIDDEN_SIZE)
|
||||
|
||||
def forward(self, x, y, x_mask, y_mask):
|
||||
x = self.norm1(x + self.dropout1(
|
||||
self.mhatt1(x, x, x, x_mask)
|
||||
))
|
||||
|
||||
x = self.norm2(x + self.dropout2(
|
||||
self.mhatt2(y, y, x, y_mask)
|
||||
))
|
||||
|
||||
x = self.norm3(x + self.dropout3(
|
||||
self.ffn(x)
|
||||
))
|
||||
|
||||
return x
|
||||
|
||||
|
||||
# ------------------------------------------------
|
||||
# ---- MAC Layers Cascaded by Encoder-Decoder ----
|
||||
# ------------------------------------------------
|
||||
|
||||
class MCA_ED(nn.Module):
|
||||
def __init__(self, __C):
|
||||
super(MCA_ED, self).__init__()
|
||||
|
||||
self.enc_list = nn.ModuleList([SA(__C) for _ in range(__C.LAYER)])
|
||||
self.dec_list = nn.ModuleList([SGA(__C) for _ in range(__C.LAYER)])
|
||||
|
||||
def forward(self, x, y, x_mask, y_mask):
|
||||
# Get hidden vector
|
||||
for enc in self.enc_list:
|
||||
x = enc(x, x_mask)
|
||||
|
||||
for dec in self.dec_list:
|
||||
y = dec(y, x, y_mask, x_mask)
|
||||
return x, y
|
||||
|
||||
class VLC(nn.Module):
|
||||
def __init__(self, __C):
|
||||
super(VLC, self).__init__()
|
||||
|
||||
self.enc_list = nn.ModuleList([SA(__C) for _ in range(__C.LAYER)])
|
||||
self.dec_lang_frames_list = nn.ModuleList([SGA(__C) for _ in range(__C.LAYER)])
|
||||
self.dec_lang_clips_list = nn.ModuleList([SGA(__C) for _ in range(__C.LAYER)])
|
||||
|
||||
|
||||
def forward(self, x, y, z, x_mask, y_mask, z_mask):
|
||||
# Get hidden vector
|
||||
for enc in self.enc_list:
|
||||
x = enc(x, x_mask)
|
||||
|
||||
for dec in self.dec_lang_frames_list:
|
||||
y = dec(y, x, y_mask, x_mask)
|
||||
|
||||
for dec in self.dec_lang_clips_list:
|
||||
z = dec(z, x, z_mask, x_mask)
|
||||
return x, y, z
|
||||
|
314
core/model/memory.py
Normal file
314
core/model/memory.py
Normal file
|
@ -0,0 +1,314 @@
|
|||
"""
|
||||
PyTorch DNC implementation from
|
||||
-->
|
||||
https://github.com/ixaxaar/pytorch-dnc
|
||||
<--
|
||||
"""
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
import torch.nn as nn
|
||||
import torch as T
|
||||
from torch.autograd import Variable as var
|
||||
import torch.nn.functional as F
|
||||
import numpy as np
|
||||
|
||||
from core.model.util import *
|
||||
|
||||
|
||||
class Memory(nn.Module):
|
||||
|
||||
def __init__(self, input_size, mem_size=512, cell_size=32, read_heads=4, gpu_id=-1, independent_linears=True):
|
||||
super(Memory, self).__init__()
|
||||
|
||||
self.input_size = input_size
|
||||
self.mem_size = mem_size
|
||||
self.cell_size = cell_size
|
||||
self.read_heads = read_heads
|
||||
self.gpu_id = gpu_id
|
||||
self.independent_linears = independent_linears
|
||||
|
||||
m = self.mem_size
|
||||
w = self.cell_size
|
||||
r = self.read_heads
|
||||
|
||||
if self.independent_linears:
|
||||
self.read_keys_transform = nn.Linear(self.input_size, w * r)
|
||||
self.read_strengths_transform = nn.Linear(self.input_size, r)
|
||||
self.write_key_transform = nn.Linear(self.input_size, w)
|
||||
self.write_strength_transform = nn.Linear(self.input_size, 1)
|
||||
self.erase_vector_transform = nn.Linear(self.input_size, w)
|
||||
self.write_vector_transform = nn.Linear(self.input_size, w)
|
||||
self.free_gates_transform = nn.Linear(self.input_size, r)
|
||||
self.allocation_gate_transform = nn.Linear(self.input_size, 1)
|
||||
self.write_gate_transform = nn.Linear(self.input_size, 1)
|
||||
self.read_modes_transform = nn.Linear(self.input_size, 3 * r)
|
||||
else:
|
||||
self.interface_size = (w * r) + (3 * w) + (5 * r) + 3
|
||||
self.interface_weights = nn.Linear(
|
||||
self.input_size, self.interface_size)
|
||||
|
||||
self.I = cuda(1 - T.eye(m).unsqueeze(0),
|
||||
gpu_id=self.gpu_id) # (1 * n * n)
|
||||
|
||||
def reset(self, batch_size=1, hidden=None, erase=True):
|
||||
m = self.mem_size
|
||||
w = self.cell_size
|
||||
r = self.read_heads
|
||||
b = batch_size
|
||||
|
||||
if hidden is None:
|
||||
return {
|
||||
'memory': cuda(T.zeros(b, m, w).fill_(0), gpu_id=self.gpu_id),
|
||||
'link_matrix': cuda(T.zeros(b, 1, m, m), gpu_id=self.gpu_id),
|
||||
'precedence': cuda(T.zeros(b, 1, m), gpu_id=self.gpu_id),
|
||||
'read_weights': cuda(T.zeros(b, r, m).fill_(0), gpu_id=self.gpu_id),
|
||||
'write_weights': cuda(T.zeros(b, 1, m).fill_(0), gpu_id=self.gpu_id),
|
||||
'usage_vector': cuda(T.zeros(b, m), gpu_id=self.gpu_id),
|
||||
# 'free_gates': cuda(T.zeros(b, r), gpu_id=self.gpu_id),
|
||||
# 'alloc_gates': cuda(T.zeros(b, 1), gpu_id=self.gpu_id),
|
||||
# 'write_gates': cuda(T.zeros(b, 1), gpu_id=self.gpu_id),
|
||||
# 'read_modes': cuda(T.zeros(b, r, 3), gpu_id=self.gpu_id)
|
||||
}
|
||||
else:
|
||||
hidden['memory'] = hidden['memory'].clone()
|
||||
hidden['link_matrix'] = hidden['link_matrix'].clone()
|
||||
hidden['precedence'] = hidden['precedence'].clone()
|
||||
hidden['read_weights'] = hidden['read_weights'].clone()
|
||||
hidden['write_weights'] = hidden['write_weights'].clone()
|
||||
hidden['usage_vector'] = hidden['usage_vector'].clone()
|
||||
# hidden['free_gates'] = hidden['free_gates'].clone()
|
||||
# hidden['alloc_gates'] = hidden['alloc_gates'].clone()
|
||||
# hidden['write_gates'] = hidden['write_gates'].clone()
|
||||
# hidden['read_modes'] = hidden['read_modes'].clone()
|
||||
|
||||
if erase:
|
||||
hidden['memory'].data.fill_(0)
|
||||
hidden['link_matrix'].data.zero_()
|
||||
hidden['precedence'].data.zero_()
|
||||
hidden['read_weights'].data.fill_(0)
|
||||
hidden['write_weights'].data.fill_(0)
|
||||
hidden['usage_vector'].data.zero_()
|
||||
# hidden['free_gates'].data.fill_()
|
||||
# hidden['alloc_gates'].data.fill_()
|
||||
# hidden['write_gates'].data.fill_()
|
||||
# hidden['read_modes'].data.fill_()
|
||||
|
||||
return hidden
|
||||
|
||||
def get_usage_vector(self, usage, free_gates, read_weights, write_weights):
|
||||
# write_weights = write_weights.detach() # detach from the computation graph
|
||||
# if read_weights.size(0) > free_gates.size(0):
|
||||
# read_weights = read_weights[:free_gates.size(0), :, :]
|
||||
# if usage.size(0) > free_gates.size(0):
|
||||
# usage = usage[:free_gates.size(0), :]
|
||||
# if write_weights.size(0) > free_gates.size(0):
|
||||
# write_weights = write_weights[:free_gates.size(0), :, :]
|
||||
usage = usage + (1 - usage) * (1 - T.prod(1 - write_weights, 1))
|
||||
ψ = T.prod(1 - free_gates.unsqueeze(2) * read_weights, 1)
|
||||
return usage * ψ
|
||||
|
||||
def allocate(self, usage, write_gate):
|
||||
# ensure values are not too small prior to cumprod.
|
||||
usage = δ + (1 - δ) * usage
|
||||
batch_size = usage.size(0)
|
||||
# free list
|
||||
sorted_usage, φ = T.topk(usage, self.mem_size, dim=1, largest=False)
|
||||
|
||||
# cumprod with exclusive=True
|
||||
# https://discuss.pytorch.org/t/cumprod-exclusive-true-equivalences/2614/8
|
||||
v = var(sorted_usage.data.new(batch_size, 1).fill_(1))
|
||||
cat_sorted_usage = T.cat((v, sorted_usage), 1)
|
||||
prod_sorted_usage = T.cumprod(cat_sorted_usage, 1)[:, :-1]
|
||||
|
||||
sorted_allocation_weights = (1 - sorted_usage) * prod_sorted_usage.squeeze()
|
||||
|
||||
# construct the reverse sorting index https://stackoverflow.com/questions/2483696/undo-or-reverse-argsort-python
|
||||
_, φ_rev = T.topk(φ, k=self.mem_size, dim=1, largest=False)
|
||||
allocation_weights = sorted_allocation_weights.gather(1, φ_rev.long())
|
||||
|
||||
return allocation_weights.unsqueeze(1), usage
|
||||
|
||||
def write_weighting(self, memory, write_content_weights, allocation_weights, write_gate, allocation_gate):
|
||||
ag = allocation_gate.unsqueeze(-1)
|
||||
wg = write_gate.unsqueeze(-1)
|
||||
|
||||
return wg * (ag * allocation_weights + (1 - ag) * write_content_weights)
|
||||
|
||||
def get_link_matrix(self, link_matrix, write_weights, precedence):
|
||||
precedence = precedence.unsqueeze(2)
|
||||
write_weights_i = write_weights.unsqueeze(3)
|
||||
write_weights_j = write_weights.unsqueeze(2)
|
||||
|
||||
prev_scale = 1 - write_weights_i - write_weights_j
|
||||
new_link_matrix = write_weights_i * precedence
|
||||
|
||||
link_matrix = prev_scale * link_matrix + new_link_matrix
|
||||
# trick to delete diag elems
|
||||
return self.I.expand_as(link_matrix) * link_matrix
|
||||
|
||||
def update_precedence(self, precedence, write_weights):
|
||||
return (1 - T.sum(write_weights, 2, keepdim=True)) * precedence + write_weights
|
||||
|
||||
def write(self, write_key, write_vector, erase_vector, free_gates, read_strengths, write_strength, write_gate, allocation_gate, hidden):
|
||||
# get current usage
|
||||
hidden['usage_vector'] = self.get_usage_vector(
|
||||
hidden['usage_vector'],
|
||||
free_gates,
|
||||
hidden['read_weights'],
|
||||
hidden['write_weights']
|
||||
)
|
||||
|
||||
# lookup memory with write_key and write_strength
|
||||
write_content_weights = self.content_weightings(
|
||||
hidden['memory'], write_key, write_strength)
|
||||
|
||||
# get memory allocation
|
||||
alloc, _ = self.allocate(
|
||||
hidden['usage_vector'],
|
||||
allocation_gate * write_gate
|
||||
)
|
||||
|
||||
# get write weightings
|
||||
hidden['write_weights'] = self.write_weighting(
|
||||
hidden['memory'],
|
||||
write_content_weights,
|
||||
alloc,
|
||||
write_gate,
|
||||
allocation_gate
|
||||
)
|
||||
|
||||
weighted_resets = hidden['write_weights'].unsqueeze(
|
||||
3) * erase_vector.unsqueeze(2)
|
||||
reset_gate = T.prod(1 - weighted_resets, 1)
|
||||
# Update memory
|
||||
hidden['memory'] = hidden['memory'] * reset_gate
|
||||
|
||||
hidden['memory'] = hidden['memory'] + \
|
||||
T.bmm(hidden['write_weights'].transpose(1, 2), write_vector)
|
||||
|
||||
# update link_matrix
|
||||
hidden['link_matrix'] = self.get_link_matrix(
|
||||
hidden['link_matrix'],
|
||||
hidden['write_weights'],
|
||||
hidden['precedence']
|
||||
)
|
||||
hidden['precedence'] = self.update_precedence(
|
||||
hidden['precedence'], hidden['write_weights'])
|
||||
|
||||
return hidden
|
||||
|
||||
def content_weightings(self, memory, keys, strengths):
|
||||
# if memory.size(0) > keys.size(0):
|
||||
# memory = memory[:keys.size(0), :, :]
|
||||
d = θ(memory, keys)
|
||||
return σ(d * strengths.unsqueeze(2), 2)
|
||||
|
||||
def directional_weightings(self, link_matrix, read_weights):
|
||||
rw = read_weights.unsqueeze(1)
|
||||
|
||||
f = T.matmul(link_matrix, rw.transpose(2, 3)).transpose(2, 3)
|
||||
b = T.matmul(rw, link_matrix)
|
||||
return f.transpose(1, 2), b.transpose(1, 2)
|
||||
|
||||
def read_weightings(self, memory, content_weights, link_matrix, read_modes, read_weights):
|
||||
forward_weight, backward_weight = self.directional_weightings(
|
||||
link_matrix, read_weights)
|
||||
|
||||
content_mode = read_modes[:, :, 2].contiguous(
|
||||
).unsqueeze(2) * content_weights
|
||||
backward_mode = T.sum(
|
||||
read_modes[:, :, 0:1].contiguous().unsqueeze(3) * backward_weight, 2)
|
||||
forward_mode = T.sum(
|
||||
read_modes[:, :, 1:2].contiguous().unsqueeze(3) * forward_weight, 2)
|
||||
|
||||
return backward_mode + content_mode + forward_mode
|
||||
|
||||
def read_vectors(self, memory, read_weights):
|
||||
return T.bmm(read_weights, memory)
|
||||
|
||||
def read(self, read_keys, read_strengths, read_modes, hidden):
|
||||
content_weights = self.content_weightings(
|
||||
hidden['memory'], read_keys, read_strengths)
|
||||
|
||||
hidden['read_weights'] = self.read_weightings(
|
||||
hidden['memory'],
|
||||
content_weights,
|
||||
hidden['link_matrix'],
|
||||
read_modes,
|
||||
hidden['read_weights']
|
||||
)
|
||||
read_vectors = self.read_vectors(
|
||||
hidden['memory'], hidden['read_weights'])
|
||||
return read_vectors, hidden
|
||||
|
||||
def forward(self, ξ, hidden):
|
||||
|
||||
# ξ = ξ.detach()
|
||||
m = self.mem_size
|
||||
w = self.cell_size
|
||||
r = self.read_heads
|
||||
b = ξ.size()[0]
|
||||
|
||||
if self.independent_linears:
|
||||
# r read keys (b * r * w)
|
||||
read_keys = self.read_keys_transform(ξ).view(b, r, w)
|
||||
# r read strengths (b * r)
|
||||
read_strengths = F.softplus(
|
||||
self.read_strengths_transform(ξ).view(b, r))
|
||||
# write key (b * 1 * w)
|
||||
write_key = self.write_key_transform(ξ).view(b, 1, w)
|
||||
# write strength (b * 1)
|
||||
write_strength = F.softplus(
|
||||
self.write_strength_transform(ξ).view(b, 1))
|
||||
# erase vector (b * 1 * w)
|
||||
erase_vector = T.sigmoid(
|
||||
self.erase_vector_transform(ξ).view(b, 1, w))
|
||||
# write vector (b * 1 * w)
|
||||
write_vector = self.write_vector_transform(ξ).view(b, 1, w)
|
||||
# r free gates (b * r)
|
||||
free_gates = T.sigmoid(self.free_gates_transform(ξ).view(b, r))
|
||||
# allocation gate (b * 1)
|
||||
allocation_gate = T.sigmoid(
|
||||
self.allocation_gate_transform(ξ).view(b, 1))
|
||||
# write gate (b * 1)
|
||||
write_gate = T.sigmoid(self.write_gate_transform(ξ).view(b, 1))
|
||||
# read modes (b * r * 3)
|
||||
read_modes = σ(self.read_modes_transform(ξ).view(b, r, 3), -1)
|
||||
else:
|
||||
ξ = self.interface_weights(ξ)
|
||||
# r read keys (b * w * r)
|
||||
read_keys = ξ[:, :r * w].contiguous().view(b, r, w)
|
||||
# r read strengths (b * r)
|
||||
read_strengths = F.softplus(
|
||||
ξ[:, r * w:r * w + r].contiguous().view(b, r))
|
||||
# write key (b * w * 1)
|
||||
write_key = ξ[:, r * w + r:r * w + r + w].contiguous().view(b, 1, w)
|
||||
# write strength (b * 1)
|
||||
write_strength = F.softplus(
|
||||
ξ[:, r * w + r + w].contiguous().view(b, 1))
|
||||
# erase vector (b * w)
|
||||
erase_vector = T.sigmoid(
|
||||
ξ[:, r * w + r + w + 1: r * w + r + 2 * w + 1].contiguous().view(b, 1, w))
|
||||
# write vector (b * w)
|
||||
write_vector = ξ[:, r * w + r + 2 * w + 1: r * w + r + 3 * w + 1].contiguous().view(b, 1, w)
|
||||
# r free gates (b * r)
|
||||
free_gates = T.sigmoid(
|
||||
ξ[:, r * w + r + 3 * w + 1: r * w + 2 * r + 3 * w + 1].contiguous().view(b, r))
|
||||
# allocation gate (b * 1)
|
||||
allocation_gate = T.sigmoid(
|
||||
ξ[:, r * w + 2 * r + 3 * w + 1].contiguous().unsqueeze(1).view(b, 1))
|
||||
# write gate (b * 1)
|
||||
write_gate = T.sigmoid(
|
||||
ξ[:, r * w + 2 * r + 3 * w + 2].contiguous()).unsqueeze(1).view(b, 1)
|
||||
# read modes (b * 3*r)
|
||||
read_modes = σ(ξ[:, r * w + 2 * r + 3 * w + 3: r *
|
||||
w + 5 * r + 3 * w + 3].contiguous().view(b, r, 3), -1)
|
||||
|
||||
hidden = self.write(write_key, write_vector, erase_vector, free_gates,
|
||||
read_strengths, write_strength, write_gate, allocation_gate, hidden)
|
||||
hidden["free_gates"] = free_gates.clone().detach()
|
||||
hidden["allocation_gate"] = allocation_gate.clone().detach()
|
||||
hidden["write_gate"] = write_gate.clone().detach()
|
||||
hidden["read_modes"] = read_modes.clone().detach()
|
||||
|
||||
return self.read(read_keys, read_strengths, read_modes, hidden)
|
501
core/model/net.py
Normal file
501
core/model/net.py
Normal file
|
@ -0,0 +1,501 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
from core.model.net_utils import FC, MLP, LayerNorm
|
||||
from core.model.mca import SA, MCA_ED, VLC
|
||||
from core.model.dnc import DNC
|
||||
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
import torch
|
||||
|
||||
# ------------------------------
|
||||
# ---- Flatten the sequence ----
|
||||
# ------------------------------
|
||||
|
||||
class AttFlat(nn.Module):
|
||||
def __init__(self, __C):
|
||||
super(AttFlat, self).__init__()
|
||||
self.__C = __C
|
||||
|
||||
self.mlp = MLP(
|
||||
in_size=__C.HIDDEN_SIZE,
|
||||
mid_size=__C.FLAT_MLP_SIZE,
|
||||
out_size=__C.FLAT_GLIMPSES,
|
||||
dropout_r=__C.DROPOUT_R,
|
||||
use_relu=True
|
||||
)
|
||||
|
||||
self.linear_merge = nn.Linear(
|
||||
__C.HIDDEN_SIZE * __C.FLAT_GLIMPSES,
|
||||
__C.FLAT_OUT_SIZE
|
||||
)
|
||||
|
||||
def forward(self, x, x_mask):
|
||||
att = self.mlp(x)
|
||||
att = att.masked_fill(
|
||||
x_mask.squeeze(1).squeeze(1).unsqueeze(2),
|
||||
-1e9
|
||||
)
|
||||
att = F.softmax(att, dim=1)
|
||||
|
||||
att_list = []
|
||||
for i in range(self.__C.FLAT_GLIMPSES):
|
||||
att_list.append(
|
||||
torch.sum(att[:, :, i: i + 1] * x, dim=1)
|
||||
)
|
||||
|
||||
x_atted = torch.cat(att_list, dim=1)
|
||||
x_atted = self.linear_merge(x_atted)
|
||||
|
||||
return x_atted
|
||||
|
||||
class AttFlatMem(AttFlat):
|
||||
def __init__(self, __C):
|
||||
super(AttFlatMem, self).__init__(__C)
|
||||
self.__C = __C
|
||||
|
||||
def forward(self, x_mem, x, x_mask):
|
||||
att = self.mlp(x_mem)
|
||||
att = att.masked_fill(
|
||||
x_mask.squeeze(1).squeeze(1).unsqueeze(2),
|
||||
float('-inf')
|
||||
)
|
||||
att = F.softmax(att, dim=1)
|
||||
att_list = []
|
||||
for i in range(self.__C.FLAT_GLIMPSES):
|
||||
att_list.append(
|
||||
torch.sum(att[:, :, i: i + 1] * x, dim=1)
|
||||
)
|
||||
x_atted = torch.cat(att_list, dim=1)
|
||||
x_atted = self.linear_merge(x_atted)
|
||||
|
||||
return x_atted
|
||||
# -------------------------
|
||||
# ---- Main MCAN Model ----
|
||||
# -------------------------
|
||||
|
||||
class Net1(nn.Module):
|
||||
def __init__(self, __C, pretrained_emb, token_size, answer_size):
|
||||
super(Net1, self).__init__()
|
||||
print('Training with Network type 1: VLCN')
|
||||
self.pretrained_path = __C.PRETRAINED_PATH
|
||||
self.embedding = nn.Embedding(
|
||||
num_embeddings=token_size,
|
||||
embedding_dim=__C.WORD_EMBED_SIZE
|
||||
)
|
||||
|
||||
# Loading the GloVe embedding weights
|
||||
if __C.USE_GLOVE:
|
||||
self.embedding.weight.data.copy_(torch.from_numpy(pretrained_emb))
|
||||
|
||||
self.lstm = nn.LSTM(
|
||||
input_size=__C.WORD_EMBED_SIZE,
|
||||
hidden_size=__C.HIDDEN_SIZE,
|
||||
num_layers=1,
|
||||
batch_first=True
|
||||
)
|
||||
|
||||
self.frame_feat_linear = nn.Linear(
|
||||
__C.FRAME_FEAT_SIZE,
|
||||
__C.HIDDEN_SIZE
|
||||
)
|
||||
|
||||
self.clip_feat_linear = nn.Linear(
|
||||
__C.CLIP_FEAT_SIZE,
|
||||
__C.HIDDEN_SIZE
|
||||
)
|
||||
self.backbone = VLC(__C)
|
||||
|
||||
self.attflat_lang = AttFlat(__C)
|
||||
self.attflat_frame = AttFlat(__C)
|
||||
self.attflat_clip = AttFlat(__C)
|
||||
|
||||
self.dnc = DNC(
|
||||
__C.FLAT_OUT_SIZE,
|
||||
__C.FLAT_OUT_SIZE,
|
||||
rnn_type='lstm',
|
||||
num_layers=2,
|
||||
num_hidden_layers=2,
|
||||
bias=True,
|
||||
batch_first=True,
|
||||
dropout=0,
|
||||
bidirectional=True,
|
||||
nr_cells=__C.CELL_COUNT_DNC,
|
||||
read_heads=__C.N_READ_HEADS_DNC,
|
||||
cell_size=__C.WORD_LENGTH_DNC,
|
||||
nonlinearity='tanh',
|
||||
gpu_id=0,
|
||||
independent_linears=False,
|
||||
share_memory=False,
|
||||
debug=False,
|
||||
clip=20,
|
||||
)
|
||||
|
||||
self.proj_norm = LayerNorm(__C.FLAT_OUT_SIZE)
|
||||
|
||||
self.proj_norm_dnc = LayerNorm(__C.FLAT_OUT_SIZE + __C.N_READ_HEADS_DNC * __C.WORD_LENGTH_DNC)
|
||||
self.linear_dnc = FC(__C.FLAT_OUT_SIZE + __C.N_READ_HEADS_DNC * __C.WORD_LENGTH_DNC, __C.FLAT_OUT_SIZE, dropout_r=0.2)
|
||||
self.proj = nn.Linear(__C.FLAT_OUT_SIZE, answer_size)
|
||||
|
||||
def forward(self, frame_feat, clip_feat, ques_ix):
|
||||
|
||||
# Make mask
|
||||
lang_feat_mask = self.make_mask(ques_ix.unsqueeze(2))
|
||||
frame_feat_mask = self.make_mask(frame_feat)
|
||||
clip_feat_mask = self.make_mask(clip_feat)
|
||||
|
||||
# Pre-process Language Feature
|
||||
lang_feat = self.embedding(ques_ix)
|
||||
lang_feat, _ = self.lstm(lang_feat)
|
||||
|
||||
|
||||
# Pre-process Video Feature
|
||||
frame_feat = self.frame_feat_linear(frame_feat)
|
||||
clip_feat = self.clip_feat_linear(clip_feat)
|
||||
|
||||
# Backbone Framework
|
||||
lang_feat, frame_feat, clip_feat = self.backbone(
|
||||
lang_feat,
|
||||
frame_feat,
|
||||
clip_feat,
|
||||
lang_feat_mask,
|
||||
frame_feat_mask,
|
||||
clip_feat_mask
|
||||
)
|
||||
|
||||
lang_feat = self.attflat_lang(
|
||||
lang_feat,
|
||||
lang_feat_mask
|
||||
)
|
||||
|
||||
frame_feat = self.attflat_frame(
|
||||
frame_feat,
|
||||
frame_feat_mask
|
||||
)
|
||||
|
||||
clip_feat = self.attflat_clip(
|
||||
clip_feat,
|
||||
clip_feat_mask
|
||||
)
|
||||
proj_feat_0 = lang_feat + frame_feat + clip_feat
|
||||
proj_feat_0 = self.proj_norm(proj_feat_0)
|
||||
|
||||
proj_feat_1 = torch.stack([lang_feat, frame_feat, clip_feat], dim=1)
|
||||
proj_feat_1, (_, _, rv), _ = self.dnc(proj_feat_1, (None, None, None), reset_experience=True, pass_through_memory=True)
|
||||
proj_feat_1 = proj_feat_1.sum(1)
|
||||
proj_feat_1 = torch.cat([proj_feat_1, rv], dim=-1)
|
||||
proj_feat_1 = self.proj_norm_dnc(proj_feat_1)
|
||||
proj_feat_1 = self.linear_dnc(proj_feat_1)
|
||||
# proj_feat_1 = self.proj_norm(proj_feat_1)
|
||||
|
||||
proj_feat = torch.sigmoid(self.proj(proj_feat_0 + proj_feat_1))
|
||||
|
||||
return proj_feat
|
||||
|
||||
def load_pretrained_weights(self):
|
||||
pretrained_msvd = torch.load(self.pretrained_path)['state_dict']
|
||||
for n_pretrained, p_pretrained in pretrained_msvd.items():
|
||||
if 'dnc' in n_pretrained:
|
||||
self.state_dict()[n_pretrained].copy_(p_pretrained)
|
||||
print('Pre-trained dnc-weights successfully loaded!')
|
||||
|
||||
# Masking
|
||||
def make_mask(self, feature):
|
||||
return (torch.sum(
|
||||
torch.abs(feature),
|
||||
dim=-1
|
||||
) == 0).unsqueeze(1).unsqueeze(2)
|
||||
|
||||
class Net2(nn.Module):
|
||||
def __init__(self, __C, pretrained_emb, token_size, answer_size):
|
||||
super(Net2, self).__init__()
|
||||
print('Training with Network type 2: VLCN-FLF')
|
||||
self.embedding = nn.Embedding(
|
||||
num_embeddings=token_size,
|
||||
embedding_dim=__C.WORD_EMBED_SIZE
|
||||
)
|
||||
# Loading the GloVe embedding weights
|
||||
if __C.USE_GLOVE:
|
||||
self.embedding.weight.data.copy_(torch.from_numpy(pretrained_emb))
|
||||
|
||||
self.lstm = nn.LSTM(
|
||||
input_size=__C.WORD_EMBED_SIZE,
|
||||
hidden_size=__C.HIDDEN_SIZE,
|
||||
num_layers=1,
|
||||
batch_first=True
|
||||
)
|
||||
|
||||
self.frame_feat_linear = nn.Linear(
|
||||
__C.FRAME_FEAT_SIZE,
|
||||
__C.HIDDEN_SIZE
|
||||
)
|
||||
|
||||
self.clip_feat_linear = nn.Linear(
|
||||
__C.CLIP_FEAT_SIZE,
|
||||
__C.HIDDEN_SIZE
|
||||
)
|
||||
self.backbone = VLC(__C)
|
||||
|
||||
self.attflat_lang = AttFlat(__C)
|
||||
self.attflat_frame = AttFlat(__C)
|
||||
self.attflat_clip = AttFlat(__C)
|
||||
|
||||
self.proj_norm = LayerNorm(__C.FLAT_OUT_SIZE)
|
||||
self.proj = nn.Linear(__C.FLAT_OUT_SIZE, answer_size)
|
||||
|
||||
|
||||
def forward(self, frame_feat, clip_feat, ques_ix):
|
||||
|
||||
# Make mask
|
||||
lang_feat_mask = self.make_mask(ques_ix.unsqueeze(2))
|
||||
frame_feat_mask = self.make_mask(frame_feat)
|
||||
clip_feat_mask = self.make_mask(clip_feat)
|
||||
|
||||
# Pre-process Language Feature
|
||||
lang_feat = self.embedding(ques_ix)
|
||||
lang_feat, _ = self.lstm(lang_feat)
|
||||
|
||||
|
||||
# Pre-process Video Feature
|
||||
frame_feat = self.frame_feat_linear(frame_feat)
|
||||
clip_feat = self.clip_feat_linear(clip_feat)
|
||||
|
||||
# Backbone Framework
|
||||
lang_feat, frame_feat, clip_feat = self.backbone(
|
||||
lang_feat,
|
||||
frame_feat,
|
||||
clip_feat,
|
||||
lang_feat_mask,
|
||||
frame_feat_mask,
|
||||
clip_feat_mask
|
||||
)
|
||||
|
||||
lang_feat = self.attflat_lang(
|
||||
lang_feat,
|
||||
lang_feat_mask
|
||||
)
|
||||
|
||||
frame_feat = self.attflat_frame(
|
||||
frame_feat,
|
||||
frame_feat_mask
|
||||
)
|
||||
|
||||
clip_feat = self.attflat_clip(
|
||||
clip_feat,
|
||||
clip_feat_mask
|
||||
)
|
||||
proj_feat = lang_feat + frame_feat + clip_feat
|
||||
proj_feat = self.proj_norm(proj_feat)
|
||||
proj_feat = torch.sigmoid(self.proj(proj_feat))
|
||||
|
||||
return proj_feat
|
||||
# Masking
|
||||
def make_mask(self, feature):
|
||||
return (torch.sum(
|
||||
torch.abs(feature),
|
||||
dim=-1
|
||||
) == 0).unsqueeze(1).unsqueeze(2)
|
||||
|
||||
class Net3(nn.Module):
|
||||
def __init__(self, __C, pretrained_emb, token_size, answer_size):
|
||||
super(Net3, self).__init__()
|
||||
print('Training with Network type 3: VLCN+LSTM')
|
||||
|
||||
self.embedding = nn.Embedding(
|
||||
num_embeddings=token_size,
|
||||
embedding_dim=__C.WORD_EMBED_SIZE
|
||||
)
|
||||
|
||||
# Loading the GloVe embedding weights
|
||||
if __C.USE_GLOVE:
|
||||
self.embedding.weight.data.copy_(torch.from_numpy(pretrained_emb))
|
||||
|
||||
self.lstm = nn.LSTM(
|
||||
input_size=__C.WORD_EMBED_SIZE,
|
||||
hidden_size=__C.HIDDEN_SIZE,
|
||||
num_layers=1,
|
||||
batch_first=True
|
||||
)
|
||||
|
||||
self.frame_feat_linear = nn.Linear(
|
||||
__C.FRAME_FEAT_SIZE,
|
||||
__C.HIDDEN_SIZE
|
||||
)
|
||||
|
||||
self.clip_feat_linear = nn.Linear(
|
||||
__C.CLIP_FEAT_SIZE,
|
||||
__C.HIDDEN_SIZE
|
||||
)
|
||||
self.backbone = VLC(__C)
|
||||
|
||||
self.attflat_lang = AttFlat(__C)
|
||||
self.attflat_frame = AttFlat(__C)
|
||||
self.attflat_clip = AttFlat(__C)
|
||||
|
||||
self.lstm_fusion = nn.LSTM(
|
||||
input_size=__C.FLAT_OUT_SIZE,
|
||||
hidden_size=__C.FLAT_OUT_SIZE,
|
||||
num_layers=2,
|
||||
batch_first=True,
|
||||
bidirectional=True
|
||||
)
|
||||
|
||||
self.proj_norm = LayerNorm(__C.FLAT_OUT_SIZE)
|
||||
self.proj_feat_1 = nn.Linear(__C.FLAT_OUT_SIZE * 2, __C.FLAT_OUT_SIZE)
|
||||
|
||||
self.proj_norm_lstm = LayerNorm(__C.FLAT_OUT_SIZE)
|
||||
self.proj = nn.Linear(__C.FLAT_OUT_SIZE, answer_size)
|
||||
|
||||
def forward(self, frame_feat, clip_feat, ques_ix):
|
||||
|
||||
# Make mask
|
||||
lang_feat_mask = self.make_mask(ques_ix.unsqueeze(2))
|
||||
frame_feat_mask = self.make_mask(frame_feat)
|
||||
clip_feat_mask = self.make_mask(clip_feat)
|
||||
|
||||
# Pre-process Language Feature
|
||||
lang_feat = self.embedding(ques_ix)
|
||||
lang_feat, _ = self.lstm(lang_feat)
|
||||
|
||||
|
||||
# Pre-process Video Feature
|
||||
frame_feat = self.frame_feat_linear(frame_feat)
|
||||
clip_feat = self.clip_feat_linear(clip_feat)
|
||||
|
||||
# Backbone Framework
|
||||
lang_feat, frame_feat, clip_feat = self.backbone(
|
||||
lang_feat,
|
||||
frame_feat,
|
||||
clip_feat,
|
||||
lang_feat_mask,
|
||||
frame_feat_mask,
|
||||
clip_feat_mask
|
||||
)
|
||||
|
||||
lang_feat = self.attflat_lang(
|
||||
lang_feat,
|
||||
lang_feat_mask
|
||||
)
|
||||
|
||||
frame_feat = self.attflat_frame(
|
||||
frame_feat,
|
||||
frame_feat_mask
|
||||
)
|
||||
|
||||
clip_feat = self.attflat_clip(
|
||||
clip_feat,
|
||||
clip_feat_mask
|
||||
)
|
||||
proj_feat_0 = lang_feat + frame_feat + clip_feat
|
||||
proj_feat_0 = self.proj_norm(proj_feat_0)
|
||||
|
||||
proj_feat_1 = torch.stack([lang_feat, frame_feat, clip_feat], dim=1)
|
||||
proj_feat_1, _ = self.lstm_fusion(proj_feat_1)
|
||||
proj_feat_1 = proj_feat_1.sum(1)
|
||||
proj_feat_1 = self.proj_feat_1(proj_feat_1)
|
||||
proj_feat_1 = self.proj_norm_lstm(proj_feat_1)
|
||||
|
||||
proj_feat = torch.sigmoid(self.proj(proj_feat_0 + proj_feat_1))
|
||||
|
||||
return proj_feat
|
||||
|
||||
# Masking
|
||||
def make_mask(self, feature):
|
||||
return (torch.sum(
|
||||
torch.abs(feature),
|
||||
dim=-1
|
||||
) == 0).unsqueeze(1).unsqueeze(2)
|
||||
|
||||
class Net4(nn.Module):
|
||||
def __init__(self, __C, pretrained_emb, token_size, answer_size):
|
||||
super(Net4, self).__init__()
|
||||
print('Training with Network type 4: MCAN')
|
||||
self.embedding = nn.Embedding(
|
||||
num_embeddings=token_size,
|
||||
embedding_dim=__C.WORD_EMBED_SIZE
|
||||
)
|
||||
|
||||
# Loading the GloVe embedding weights
|
||||
if __C.USE_GLOVE:
|
||||
self.embedding.weight.data.copy_(torch.from_numpy(pretrained_emb))
|
||||
|
||||
self.lstm = nn.LSTM(
|
||||
input_size=__C.WORD_EMBED_SIZE,
|
||||
hidden_size=__C.HIDDEN_SIZE,
|
||||
num_layers=1,
|
||||
batch_first=True
|
||||
)
|
||||
|
||||
self.frame_feat_linear = nn.Linear(
|
||||
__C.FRAME_FEAT_SIZE,
|
||||
__C.HIDDEN_SIZE
|
||||
)
|
||||
|
||||
self.clip_feat_linear = nn.Linear(
|
||||
__C.CLIP_FEAT_SIZE,
|
||||
__C.HIDDEN_SIZE
|
||||
)
|
||||
self.backbone = MCA_ED(__C)
|
||||
|
||||
self.attflat_lang = AttFlat(__C)
|
||||
self.attflat_vid = AttFlat(__C)
|
||||
|
||||
self.proj_norm = LayerNorm(__C.FLAT_OUT_SIZE)
|
||||
self.proj = nn.Linear(__C.FLAT_OUT_SIZE, answer_size)
|
||||
|
||||
|
||||
def forward(self, frame_feat, clip_feat, ques_ix):
|
||||
|
||||
# Make mask
|
||||
lang_feat_mask = self.make_mask(ques_ix.unsqueeze(2))
|
||||
frame_feat_mask = self.make_mask(frame_feat)
|
||||
clip_feat_mask = self.make_mask(clip_feat)
|
||||
|
||||
# Pre-process Language Feature
|
||||
lang_feat = self.embedding(ques_ix)
|
||||
lang_feat, _ = self.lstm(lang_feat)
|
||||
|
||||
|
||||
# Pre-process Video Feature
|
||||
frame_feat = self.frame_feat_linear(frame_feat)
|
||||
clip_feat = self.clip_feat_linear(clip_feat)
|
||||
|
||||
# concat frame and clip features
|
||||
vid_feat = torch.cat([frame_feat, clip_feat], dim=1)
|
||||
vid_feat_mask = torch.cat([frame_feat_mask, clip_feat_mask], dim=-1)
|
||||
# Backbone Framework
|
||||
lang_feat, vid_feat = self.backbone(
|
||||
lang_feat,
|
||||
vid_feat,
|
||||
lang_feat_mask,
|
||||
vid_feat_mask,
|
||||
)
|
||||
|
||||
lang_feat = self.attflat_lang(
|
||||
lang_feat,
|
||||
lang_feat_mask
|
||||
)
|
||||
|
||||
vid_feat = self.attflat_vid(
|
||||
vid_feat,
|
||||
vid_feat_mask
|
||||
)
|
||||
|
||||
proj_feat = lang_feat + vid_feat
|
||||
proj_feat = self.proj_norm(proj_feat)
|
||||
proj_feat = torch.sigmoid(self.proj(proj_feat))
|
||||
|
||||
return proj_feat
|
||||
|
||||
# Masking
|
||||
def make_mask(self, feature):
|
||||
return (torch.sum(
|
||||
torch.abs(feature),
|
||||
dim=-1
|
||||
) == 0).unsqueeze(1).unsqueeze(2)
|
||||
|
||||
|
62
core/model/net_utils.py
Normal file
62
core/model/net_utils.py
Normal file
|
@ -0,0 +1,62 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
import torch.nn as nn
|
||||
import os
|
||||
import torch
|
||||
|
||||
|
||||
class FC(nn.Module):
|
||||
def __init__(self, in_size, out_size, dropout_r=0., use_relu=True):
|
||||
super(FC, self).__init__()
|
||||
self.dropout_r = dropout_r
|
||||
self.use_relu = use_relu
|
||||
|
||||
self.linear = nn.Linear(in_size, out_size)
|
||||
|
||||
if use_relu:
|
||||
self.relu = nn.ReLU(inplace=True)
|
||||
|
||||
if dropout_r > 0:
|
||||
self.dropout = nn.Dropout(dropout_r)
|
||||
|
||||
def forward(self, x):
|
||||
x = self.linear(x)
|
||||
|
||||
if self.use_relu:
|
||||
x = self.relu(x)
|
||||
|
||||
if self.dropout_r > 0:
|
||||
x = self.dropout(x)
|
||||
|
||||
return x
|
||||
|
||||
|
||||
class MLP(nn.Module):
|
||||
def __init__(self, in_size, mid_size, out_size, dropout_r=0., use_relu=True):
|
||||
super(MLP, self).__init__()
|
||||
|
||||
self.fc = FC(in_size, mid_size, dropout_r=dropout_r, use_relu=use_relu)
|
||||
self.linear = nn.Linear(mid_size, out_size)
|
||||
|
||||
def forward(self, x):
|
||||
return self.linear(self.fc(x))
|
||||
|
||||
|
||||
class LayerNorm(nn.Module):
|
||||
def __init__(self, size, eps=1e-6):
|
||||
super(LayerNorm, self).__init__()
|
||||
self.eps = eps
|
||||
|
||||
self.a_2 = nn.Parameter(torch.ones(size))
|
||||
self.b_2 = nn.Parameter(torch.zeros(size))
|
||||
|
||||
def forward(self, x):
|
||||
mean = x.mean(-1, keepdim=True)
|
||||
std = x.std(-1, keepdim=True)
|
||||
|
||||
return self.a_2 * (x - mean) / (std + self.eps) + self.b_2
|
||||
|
98
core/model/optim.py
Normal file
98
core/model/optim.py
Normal file
|
@ -0,0 +1,98 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
import torch
|
||||
import torch.optim as Optim
|
||||
|
||||
|
||||
class WarmupOptimizer(object):
|
||||
def __init__(self, lr_base, optimizer, data_size, batch_size):
|
||||
self.optimizer = optimizer
|
||||
self._step = 0
|
||||
self.lr_base = lr_base
|
||||
self._rate = 0
|
||||
self.data_size = data_size
|
||||
self.batch_size = batch_size
|
||||
|
||||
def step(self):
|
||||
self._step += 1
|
||||
|
||||
rate = self.rate()
|
||||
for p in self.optimizer.param_groups:
|
||||
p['lr'] = rate
|
||||
self._rate = rate
|
||||
|
||||
self.optimizer.step()
|
||||
|
||||
|
||||
def zero_grad(self):
|
||||
self.optimizer.zero_grad()
|
||||
|
||||
|
||||
def rate(self, step=None):
|
||||
if step is None:
|
||||
step = self._step
|
||||
|
||||
if step <= int(self.data_size / self.batch_size * 1):
|
||||
r = self.lr_base * 1/4.
|
||||
elif step <= int(self.data_size / self.batch_size * 2):
|
||||
r = self.lr_base * 2/4.
|
||||
elif step <= int(self.data_size / self.batch_size * 3):
|
||||
r = self.lr_base * 3/4.
|
||||
else:
|
||||
r = self.lr_base
|
||||
|
||||
return r
|
||||
|
||||
|
||||
def get_optim(__C, model, data_size, optimizer, lr_base=None):
|
||||
if lr_base is None:
|
||||
lr_base = __C.LR_BASE
|
||||
|
||||
# modules = model._modules
|
||||
# params_list = []
|
||||
# for m in modules:
|
||||
# if 'dnc' in m:
|
||||
# params_list.append({
|
||||
# 'params': filter(lambda p: p.requires_grad, modules[m].parameters()),
|
||||
# 'lr': __C.LR_DNC_BASE,
|
||||
# 'flag': True
|
||||
# })
|
||||
# else:
|
||||
# params_list.append({
|
||||
# 'params': filter(lambda p: p.requires_grad, modules[m].parameters()),
|
||||
|
||||
# })
|
||||
if optimizer == 'adam':
|
||||
optim = Optim.Adam(
|
||||
filter(lambda p: p.requires_grad, model.parameters()),
|
||||
lr=0,
|
||||
betas=__C.OPT_BETAS,
|
||||
eps=__C.OPT_EPS,
|
||||
|
||||
)
|
||||
elif optimizer == 'rmsprop':
|
||||
optim = Optim.RMSprop(
|
||||
filter(lambda p: p.requires_grad, model.parameters()),
|
||||
lr=0,
|
||||
eps=__C.OPT_EPS,
|
||||
weight_decay=__C.OPT_WEIGHT_DECAY
|
||||
)
|
||||
else:
|
||||
raise ValueError('{} optimizer is not supported'.fromat(optimizer))
|
||||
return WarmupOptimizer(
|
||||
lr_base,
|
||||
optim,
|
||||
data_size,
|
||||
__C.BATCH_SIZE
|
||||
)
|
||||
|
||||
|
||||
def adjust_lr(optim, decay_r):
|
||||
optim.lr_base *= decay_r
|
||||
|
||||
def adjust_lr_dnc(optim, decay_r):
|
||||
optim.lr_dnc_base *= decay_r
|
163
core/model/utils.py
Normal file
163
core/model/utils.py
Normal file
|
@ -0,0 +1,163 @@
|
|||
"""
|
||||
PyTorch DNC implementation from
|
||||
-->
|
||||
https://github.com/ixaxaar/pytorch-dnc
|
||||
<--
|
||||
"""
|
||||
|
||||
import torch.nn as nn
|
||||
import torch as T
|
||||
import torch.nn.functional as F
|
||||
import numpy as np
|
||||
import torch
|
||||
from torch.autograd import Variable
|
||||
import re
|
||||
import string
|
||||
|
||||
|
||||
def recursiveTrace(obj):
|
||||
print(type(obj))
|
||||
if hasattr(obj, 'grad_fn'):
|
||||
print(obj.grad_fn)
|
||||
recursiveTrace(obj.grad_fn)
|
||||
elif hasattr(obj, 'saved_variables'):
|
||||
print(obj.requires_grad, len(obj.saved_tensors), len(obj.saved_variables))
|
||||
[print(v) for v in obj.saved_variables]
|
||||
[recursiveTrace(v.grad_fn) for v in obj.saved_variables]
|
||||
|
||||
|
||||
def cuda(x, grad=False, gpu_id=-1):
|
||||
x = x.float() if T.is_tensor(x) else x
|
||||
if gpu_id == -1:
|
||||
t = T.FloatTensor(x)
|
||||
t.requires_grad=grad
|
||||
return t
|
||||
else:
|
||||
t = T.FloatTensor(x.pin_memory()).cuda(gpu_id)
|
||||
t.requires_grad=grad
|
||||
return t
|
||||
|
||||
|
||||
def cudavec(x, grad=False, gpu_id=-1):
|
||||
if gpu_id == -1:
|
||||
t = T.Tensor(T.from_numpy(x))
|
||||
t.requires_grad = grad
|
||||
return t
|
||||
else:
|
||||
t = T.Tensor(T.from_numpy(x).pin_memory()).cuda(gpu_id)
|
||||
t.requires_grad = grad
|
||||
return t
|
||||
|
||||
|
||||
def cudalong(x, grad=False, gpu_id=-1):
|
||||
if gpu_id == -1:
|
||||
t = T.LongTensor(T.from_numpy(x.astype(np.long)))
|
||||
t.requires_grad = grad
|
||||
return t
|
||||
else:
|
||||
t = T.LongTensor(T.from_numpy(x.astype(np.long)).pin_memory()).cuda(gpu_id)
|
||||
t.requires_grad = grad
|
||||
return t
|
||||
|
||||
|
||||
def θ(a, b, normBy=2):
|
||||
"""Batchwise Cosine similarity
|
||||
Cosine similarity
|
||||
Arguments:
|
||||
a {Tensor} -- A 3D Tensor (b * m * w)
|
||||
b {Tensor} -- A 3D Tensor (b * r * w)
|
||||
Returns:
|
||||
Tensor -- Batchwise cosine similarity (b * r * m)
|
||||
"""
|
||||
dot = T.bmm(a, b.transpose(1,2))
|
||||
a_norm = T.norm(a, normBy, dim=2).unsqueeze(2)
|
||||
b_norm = T.norm(b, normBy, dim=2).unsqueeze(1)
|
||||
cos = dot / (a_norm * b_norm + δ)
|
||||
return cos.transpose(1,2).contiguous()
|
||||
|
||||
|
||||
def σ(input, axis=1):
|
||||
"""Softmax on an axis
|
||||
Softmax on an axis
|
||||
Arguments:
|
||||
input {Tensor} -- input Tensor
|
||||
Keyword Arguments:
|
||||
axis {number} -- axis on which to take softmax on (default: {1})
|
||||
Returns:
|
||||
Tensor -- Softmax output Tensor
|
||||
"""
|
||||
input_size = input.size()
|
||||
|
||||
trans_input = input.transpose(axis, len(input_size) - 1)
|
||||
trans_size = trans_input.size()
|
||||
|
||||
input_2d = trans_input.contiguous().view(-1, trans_size[-1])
|
||||
soft_max_2d = F.softmax(input_2d, -1)
|
||||
soft_max_nd = soft_max_2d.view(*trans_size)
|
||||
return soft_max_nd.transpose(axis, len(input_size) - 1)
|
||||
|
||||
δ = 1e-6
|
||||
|
||||
|
||||
def register_nan_checks(model):
|
||||
def check_grad(module, grad_input, grad_output):
|
||||
# print(module) you can add this to see that the hook is called
|
||||
# print('hook called for ' + str(type(module)))
|
||||
if any(np.all(np.isnan(gi.data.cpu().numpy())) for gi in grad_input if gi is not None):
|
||||
print('NaN gradient in grad_input ' + type(module).__name__)
|
||||
|
||||
model.apply(lambda module: module.register_backward_hook(check_grad))
|
||||
|
||||
|
||||
def apply_dict(dic):
|
||||
for k, v in dic.items():
|
||||
apply_var(v, k)
|
||||
if isinstance(v, nn.Module):
|
||||
key_list = [a for a in dir(v) if not a.startswith('__')]
|
||||
for key in key_list:
|
||||
apply_var(getattr(v, key), key)
|
||||
for pk, pv in v._parameters.items():
|
||||
apply_var(pv, pk)
|
||||
|
||||
|
||||
def apply_var(v, k):
|
||||
if isinstance(v, Variable) and v.requires_grad:
|
||||
v.register_hook(check_nan_gradient(k))
|
||||
|
||||
|
||||
def check_nan_gradient(name=''):
|
||||
def f(tensor):
|
||||
if np.isnan(T.mean(tensor).data.cpu().numpy()):
|
||||
print('\nnan gradient of {} :'.format(name))
|
||||
# print(tensor)
|
||||
# assert 0, 'nan gradient'
|
||||
return tensor
|
||||
return f
|
||||
|
||||
def ptr(tensor):
|
||||
if T.is_tensor(tensor):
|
||||
return tensor.storage().data_ptr()
|
||||
elif hasattr(tensor, 'data'):
|
||||
return tensor.clone().data.storage().data_ptr()
|
||||
else:
|
||||
return tensor
|
||||
|
||||
# TODO: EWW change this shit
|
||||
def ensure_gpu(tensor, gpu_id):
|
||||
if "cuda" in str(type(tensor)) and gpu_id != -1:
|
||||
return tensor.cuda(gpu_id)
|
||||
elif "cuda" in str(type(tensor)):
|
||||
return tensor.cpu()
|
||||
elif "Tensor" in str(type(tensor)) and gpu_id != -1:
|
||||
return tensor.cuda(gpu_id)
|
||||
elif "Tensor" in str(type(tensor)):
|
||||
return tensor
|
||||
elif type(tensor) is np.ndarray:
|
||||
return cudavec(tensor, gpu_id=gpu_id).data
|
||||
else:
|
||||
return tensor
|
||||
|
||||
|
||||
def print_gradient(x, name):
|
||||
s = "Gradient of " + name + " ----------------------------------"
|
||||
x.register_hook(lambda y: print(s, y.squeeze()))
|
48
requirements.txt
Normal file
48
requirements.txt
Normal file
|
@ -0,0 +1,48 @@
|
|||
absl-py==0.12.0
|
||||
blis==0.7.4
|
||||
cachetools==4.2.1
|
||||
catalogue==1.0.0
|
||||
certifi==2020.12.5
|
||||
chardet==4.0.0
|
||||
click==7.1.2
|
||||
cycler==0.10.0
|
||||
cymem==2.0.5
|
||||
google-auth==1.28.0
|
||||
google-auth-oauthlib==0.4.3
|
||||
grpcio==1.36.1
|
||||
idna==2.10
|
||||
importlib-metadata==3.7.3
|
||||
joblib==1.0.1
|
||||
Markdown==3.3.4
|
||||
mkl-fft==1.3.0
|
||||
mkl-random==1.1.1
|
||||
mkl-service==2.3.0
|
||||
murmurhash==1.0.5
|
||||
nltk==3.6.2
|
||||
oauthlib==3.1.0
|
||||
olefile==0.46
|
||||
plac==1.1.3
|
||||
positional-encodings==3.0.0
|
||||
preshed==3.0.5
|
||||
protobuf==3.15.6
|
||||
pyasn1==0.4.8
|
||||
pyasn1-modules==0.2.8
|
||||
PyYAML==5.4.1
|
||||
regex==2021.4.4
|
||||
requests==2.25.1
|
||||
requests-oauthlib==1.3.0
|
||||
rsa==4.7.2
|
||||
scikit-video==1.1.11
|
||||
scipy==1.5.4
|
||||
spacy==2.3.5
|
||||
srsly==1.0.5
|
||||
tensorboard==2.4.1
|
||||
tensorboard-plugin-wit==1.8.0
|
||||
tensorboardX==2.1
|
||||
thinc==7.4.5
|
||||
tqdm==4.59.0
|
||||
typing-extensions==3.7.4.3
|
||||
urllib3==1.26.4
|
||||
wasabi==0.8.2
|
||||
Werkzeug==1.0.1
|
||||
zipp==3.4.1
|
198
run.py
Normal file
198
run.py
Normal file
|
@ -0,0 +1,198 @@
|
|||
# --------------------------------------------------------
|
||||
# mcan-vqa (Deep Modular Co-Attention Networks)
|
||||
# Licensed under The MIT License [see LICENSE for details]
|
||||
# Written by Yuhao Cui https://github.com/cuiyuhao1996
|
||||
# --------------------------------------------------------
|
||||
|
||||
from cfgs.base_cfgs import Cfgs
|
||||
from core.exec import Execution
|
||||
import argparse, yaml, os
|
||||
|
||||
def str2bool(v):
|
||||
if isinstance(v, bool):
|
||||
return v
|
||||
if v.lower() in ('yes', 'true', 't', 'y', '1'):
|
||||
return True
|
||||
elif v.lower() in ('no', 'false', 'f', 'n', '0'):
|
||||
return False
|
||||
else:
|
||||
raise argparse.ArgumentTypeError('Boolean value expected.')
|
||||
|
||||
def parse_args():
|
||||
'''
|
||||
Parse input arguments
|
||||
'''
|
||||
parser = argparse.ArgumentParser(description='VLCN Args')
|
||||
|
||||
parser.add_argument('--RUN', dest='RUN_MODE',
|
||||
default='train',
|
||||
choices=['train', 'val', 'test'],
|
||||
help='{train, val, test}',
|
||||
type=str) # , required=True)
|
||||
|
||||
parser.add_argument('--MODEL', dest='MODEL',
|
||||
choices=['small', 'large'],
|
||||
help='{small, large}',
|
||||
default='small', type=str)
|
||||
|
||||
parser.add_argument('--OPTIM', dest='OPTIM',
|
||||
choices=['adam', 'rmsprop'],
|
||||
help='The optimizer',
|
||||
default='rmsprop', type=str)
|
||||
|
||||
parser.add_argument('--SPLIT', dest='TRAIN_SPLIT',
|
||||
choices=['train', 'train+val'],
|
||||
help="set training split, "
|
||||
"eg.'train', 'train+val'"
|
||||
"set 'train' can trigger the "
|
||||
"eval after every epoch",
|
||||
default='train',
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--EVAL_EE', dest='EVAL_EVERY_EPOCH',
|
||||
default=True,
|
||||
help='set True to evaluate the '
|
||||
'val split when an epoch finished'
|
||||
"(only work when train with "
|
||||
"'train' split)",
|
||||
type=bool)
|
||||
|
||||
parser.add_argument('--SAVE_PRED', dest='TEST_SAVE_PRED',
|
||||
help='set True to save the '
|
||||
'prediction vectors'
|
||||
'(only work in testing)',
|
||||
default=False,
|
||||
type=bool)
|
||||
|
||||
parser.add_argument('--BS', dest='BATCH_SIZE',
|
||||
help='batch size during training',
|
||||
default=64,
|
||||
type=int)
|
||||
|
||||
parser.add_argument('--MAX_EPOCH', dest='MAX_EPOCH',
|
||||
default=30,
|
||||
help='max training epoch',
|
||||
type=int)
|
||||
|
||||
parser.add_argument('--PRELOAD', dest='PRELOAD',
|
||||
help='pre-load the features into memory'
|
||||
'to increase the I/O speed',
|
||||
default=False,
|
||||
type=bool)
|
||||
|
||||
parser.add_argument('--GPU', dest='GPU',
|
||||
help="gpu select, eg.'0, 1, 2'",
|
||||
default='0',
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--SEED', dest='SEED',
|
||||
help='fix random seed',
|
||||
default=42,
|
||||
type=int)
|
||||
|
||||
parser.add_argument('--VERSION', dest='VERSION',
|
||||
help='version control',
|
||||
default='1.0.0',
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--RESUME', dest='RESUME',
|
||||
default=False,
|
||||
help='resume training',
|
||||
type=str2bool)
|
||||
|
||||
parser.add_argument('--CKPT_V', dest='CKPT_VERSION',
|
||||
help='checkpoint version',
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--CKPT_E', dest='CKPT_EPOCH',
|
||||
help='checkpoint epoch',
|
||||
type=int)
|
||||
|
||||
parser.add_argument('--CKPT_PATH', dest='CKPT_PATH',
|
||||
help='load checkpoint path, we '
|
||||
'recommend that you use '
|
||||
'CKPT_VERSION and CKPT_EPOCH '
|
||||
'instead',
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--ACCU', dest='GRAD_ACCU_STEPS',
|
||||
help='reduce gpu memory usage',
|
||||
type=int)
|
||||
|
||||
parser.add_argument('--NW', dest='NUM_WORKERS',
|
||||
help='multithreaded loading',
|
||||
default=0,
|
||||
type=int)
|
||||
|
||||
parser.add_argument('--PINM', dest='PIN_MEM',
|
||||
help='use pin memory',
|
||||
type=bool)
|
||||
|
||||
parser.add_argument('--VERB', dest='VERBOSE',
|
||||
help='verbose print',
|
||||
type=bool)
|
||||
|
||||
parser.add_argument('--DATA_PATH', dest='DATASET_PATH',
|
||||
default='/projects/abdessaied/data/MSRVTT-QA/',
|
||||
help='Dataset root path',
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--EXP_NAME', dest='EXP_NAME',
|
||||
help='The name of the experiment',
|
||||
default="test",
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--DEBUG', dest='DEBUG',
|
||||
help='Triggeres debug mode: small fractions of the data are loaded ',
|
||||
default='0',
|
||||
type=str2bool)
|
||||
|
||||
parser.add_argument('--ENABLE_TIME_MONITORING', dest='ENABLE_TIME_MONITORING',
|
||||
help='Triggeres time monitoring when training',
|
||||
default='0',
|
||||
type=str2bool)
|
||||
|
||||
parser.add_argument('--MODEL_TYPE', dest='MODEL_TYPE',
|
||||
help='The model type to be used\n 1: VLCN \n 2:VLCN-FLF \n 3: VLCN+LSTM \n 4: MCAN',
|
||||
default=1,
|
||||
type=int)
|
||||
|
||||
parser.add_argument('--PRETRAINED_PATH', dest='PRETRAINED_PATH',
|
||||
help='Pretrained weights on msvd',
|
||||
default='-',
|
||||
type=str)
|
||||
|
||||
parser.add_argument('--TEST_EPOCH', dest='TEST_EPOCH',
|
||||
help='',
|
||||
default=7,
|
||||
type=int)
|
||||
|
||||
args = parser.parse_args()
|
||||
return args
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
args = parse_args()
|
||||
os.chdir(os.path.dirname(os.path.abspath(__file__)))
|
||||
__C = Cfgs(args.EXP_NAME, args.DATASET_PATH)
|
||||
args_dict = __C.parse_to_dict(args)
|
||||
|
||||
cfg_file = "cfgs/{}_model.yml".format(args.MODEL)
|
||||
with open(cfg_file, 'r') as f:
|
||||
yaml_dict = yaml.load(f)
|
||||
|
||||
args_dict = {**yaml_dict, **args_dict}
|
||||
|
||||
__C.add_args(args_dict)
|
||||
__C.proc()
|
||||
|
||||
print('Hyper Parameters:')
|
||||
print(__C)
|
||||
|
||||
__C.check_path()
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = __C.GPU
|
||||
|
||||
execution = Execution(__C)
|
||||
execution.run(__C.RUN_MODE)
|
||||
|
||||
#execution.run('test', epoch=__C.TEST_EPOCH)
|
Loading…
Reference in a new issue