Make code public

2024-07-08 11:41:28 +02:00 · 2024-07-08 11:41:28 +02:00 · 8e03ef1c38
commit 8e03ef1c38
49 changed files with 545354 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@ -0,0 +1,3 @@
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.csv filter=lfs diff=lfs merge=lfs -text
--- a/21
+++ b/21
@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2023 Anonymous
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/README.md
+++ b/README.md
@ -0,0 +1,174 @@
+<div align="center">
+<h1> MST-MIXER <img src="misc/mixer.png" width="3%" align="bottom">: Multi-Modal Video Dialog State Tracking in the Wild </h1>
+    
+**[Adnen Abdessaied][16], &nbsp; [Lei Shi][17], &nbsp; [Andreas Bulling][18]**  <br> <br>
+**ECCV 2024, Milan, Italy <img src="misc/italy.png" width="3%" align="center">** <br>
+**[[Paper][19]]**
+
+---------------------------
+<img src="misc/teaser.png" width="70%" align="middle"><br><br>
+
+</div>
+
+# Citation 
+If you find our code useful or use it in your own projects, please cite our paper:
+
+```bibtex
+@InProceedings{Abdessaied_2024_eccv,
+    author    = {Abdessaied, Adnen and Shi, Lei and Bulling, Andreas},
+    title     = {{Multi-Modal Video Dialog State Tracking in the Wild}},
+    booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
+    year      = {2024}
+    }
+```
+
+# Table of Contents
+* [Setup and Dependencies](#Setup-and-Dependencies)
+* [Download Data](#Download-Data)
+* [Training](#Training)
+* [Response Generation](#Response-Generation)
+* [Results](#Results)
+* [Acknowledgements](#Acknowledgements)
+
+# Setup and Dependencies
+We implemented our model using Python 3.7 and PyTorch 1.12.0 (CUDA 11.3, CuDNN 8.3.2). We recommend to setup a virtual environment using Anaconda. <br>
+1. Install [git lfs][1] on your system
+2. Clone our repository to download a checpint of our best model and our code
+   ```shell
+       git lfs install
+       git clone this_repo.git
+   ```
+3. Create a conda environment and install dependencies
+   ```shell
+       conda create -n mst_mixer python=3.7
+       conda activate mst_mixer
+       conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch
+       conda install pyg -c pyg
+       conda install pytorch-scatter -c pyg  # pytorch >= 1.8.0
+       conda install pytorch-sparse -c pyg  # pytorch >= 1.8.0
+       conda install -c huggingface transformers
+       pip install evaluate wandb glog pyhocon attrs
+    ```
+# Download Data
+## AVSD
+1. Download the [AVSD-DSTC7][2], [AVSD-DSTC8][3] and [AVSD-DSTC10][10] data
+2. Place the raw json files in ```raw_data/``` and the features in ```features/```
+3. Prepeocess and save the input features for faster training as indicated in ```custom_datasets/```
+## NExT-QA
+1. For convenience, we included the features/data in this git repo.
+
+# Training
+We trained our model on 8 Nvidia Tesla V100-32GB GPUs. The default hyperparameters in ```config/mst_mixer.conf``` need to be adjusted if your setup differs from ours.
+## AVSD
+1. Set ```task=avsd``` in ```config/mst_mixer.conf```
+2. ```shell
+   CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py \
+   --mode train \ 
+   --tag mst_mixer_avsd \
+   --wandb_mode online \
+   --wandb_project mst_mixer_avsd
+   ```
+To deactivate [wandb][4] logging, use ```--wandb_mode disabled```.
+On a similar setup to ours, this will take roughly 20h to complete.
+
+## NExT-QA
+1. Set ```task=nextqa``` in ```config/mst_mixer.conf```
+2. ```shell
+   CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py \
+   --mode train \ 
+   --tag mst_mixer_nextqa \
+   --wandb_mode online \
+   --wandb_project mst_mixer_nextqa
+   ```
+
+# Response Generation
+## AVSD-DSTC7
+1. Set ```dstc=7``` in the ```.conf``` file of your trained networks. in The default setting, can find this under ```logs/unique_training_tag/code/config/mst_mixer.conf``` 
+2. Generate the responses
+```shell
+./generate_parallel_avsd.sh mst_mixer/mixer results_avsd_dstc7 generate logs/mst_mixer_avsd 7
+```
+3. All responses will be saved in ```output/dstc7/```
+## AVSD-DSTC8
+1. Set ```dstc=8``` in the ```.conf``` file of your trained networks. in The default setting, can find this under ```logs/unique_training_tag/code/config/mst_mixer.conf``` 
+2. Generate the responses
+```shell
+./generate_parallel_avsd.sh mst_mixer/mixer results_avsd_dstc8 generate logs/mst_mixer_avsd 8
+```
+3. All responses will be saved in ```output/dstc8/```
+
+## AVSD-DSTC10
+1. Set ```dstc=10``` in the ```.conf``` file of your trained networks. in The default setting, can find this under ```logs/unique_training_tag/code/config/mst_mixer.conf``` 
+2. Generate the responses
+```shell
+./generate_parallel_avsd.sh mst_mixer/mixer results_avsd_dstc10 generate logs/mst_mixer_avsd 10
+```
+3. All responses will be saved in ```output/dstc10/```
+
+## NExT-QA
+1. Generate the responses
+```shell
+./generate_parallel_nextqa.sh mst_mixer/mixer results_nextqa generate logs/mst_mixer_nextqa
+```
+2. All responses will be saved in ```output/nextqa/```
+3. Evalute using this [script][15] 
+
+
+# Results
+To evaluate our best model on 
+## AVSD-DSTC7
+Executing the [eval_tool][7] of AVSD-DSTC7 using the generated repsonses will output the following metrics
+| Model    | BLUE-1 | BLUE-2 | BLUE-3 | BLUE-4 | METEOR | ROUGE-L | CIDEr |
+|:--------:|:------:|:------:|:------:|:------:|:------:|:-------:|:-----:| 
+| Prev. SOTA | 78.2 | 65.5 | 55.2 | 46.9 | 30.8 | 61.9 | 135.2 | 
+| MST_MIXER | **78.7** | **66.5** | **56.3** | **47.6** | **31.3** | **62.5** | **138.8**| 
+
+## AVSD-DSTC8
+1. Set ```dstc=8``` in the ```ckpt/code/mst_mixer.conf```
+2. run
+```shell
+./generate_parallel_avsd.sh mst_mixer/mixer results_avsd_dstc8_best_model generate ckpt/avsd 8
+```
+3. The responses will be saved in ```output/dstc8/```
+4. Executing the [eval_tool][7] of AVSD-DSTC8 using the generated repsonses will output the following metrics
+
+| Model    | BLUE-1 | BLUE-2 | BLUE-3 | BLUE-4 | METEOR | ROUGE-L | CIDEr |
+|:--------:|:------:|:------:|:------:|:------:|:------:|:-------:|:-----:| 
+| Prev. SOTA | 76.4 | 64.1 | 54.3 | 46.0 | 30.1 | 61.0 | 130.4 | 
+| MST_MIXER | **77.5** | **66.0** | **56.1** | **47.7** | **30.6** | **62.4** | **135.4**|
+
+## AVSD-DSTC10
+Executing the [eval_tool][11] of AVSD-DSTC10 using the generated repsonses will output the following metrics
+
+| Model    | BLUE-1 | BLUE-2 | BLUE-3 | BLUE-4 | METEOR | ROUGE-L | CIDEr |
+|:--------:|:------:|:------:|:------:|:------:|:------:|:-------:|:-----:| 
+| Prev. SOTA | 69.3 | 55.6 | 45.0 | 37.2 | 24.9 | 53.6 | 91.2 | 
+| MST_MIXER | **70.0** | **57.4** | **47.6** | **40.0** | **25.7** | **54.5** | **99.8**|
+
+## NExT-QA
+Executing the [eval script][15]  of NExT-QA using the generated repsonses will output the following metrics
+
+| Model    | WUPS_C | WUPS_T | WUPS_D | WUPS |
+|:--------:|:------:|:------:|:------:|:------:|
+| Prev. SOTA | 17.98| 17.95  | 50.84  | 28.40  |
+| MST_MIXER | **22.12** | **22.20** | **55.64** | **29.50** |  
+
+
+# Acknowledgements
+We thank the authors of [RLM][8] for providing their [code][9] that greatly influenced this work.
+
+[1]: https://git-lfs.com/
+[2]: https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge
+[3]: https://github.com/dialogtekgeek/DSTC8-AVSD_official
+[4]: https://wandb.ai/site
+[5]: https://drive.google.com/drive/folders/1SlZTySJAk_2tiMG5F8ivxCfOl_OWwd_Q
+[7]: https://drive.google.com/file/d/1EKfPtrNBQ5ciKRl6XggImweGRP84XuPi/view?usp=sharing
+[8]: https://arxiv.org/abs/2002.00163
+[9]: https://github.com/ictnlp/DSTC8-AVSD
+[10]: https://drive.google.com/file/d/1zvC6FuPRVRiLQCXZcYpzYUI9r1tiWls6/view
+[11]: https://github.com/ankitshah009/AVSD-DSTC10_baseline
+[15]: https://github.com/doc-doc/NExT-OE/blob/main/eval_oe.py
+[16]: https://adnenabdessaied.de/
+[17]: https://perceptualui.org/people/shi/
+[18]: https://perceptualui.org/people/bulling/
+[19]: https://arxiv.org/abs/2407.02218
--- a/config/avsd_bart_base.json
+++ b/config/avsd_bart_base.json
@ -0,0 +1,118 @@
+{
+    "_name_or_path": "bart-base",
+    "activation_dropout": 0.1,
+    "activation_function": "gelu",
+    "add_bias_logits": false,
+    "add_final_layer_norm": false,
+    "architectures": [
+      "BartModel"
+    ],
+    "attention_dropout": 0.1,
+    "bos_token_id": 0,
+    "classif_dropout": 0.1,
+    "classifier_dropout": 0.0,
+    "d_model": 768,
+    "decoder_attention_heads": 12,
+    "decoder_ffn_dim": 3072,
+    "decoder_layerdrop": 0.0,
+    "decoder_layers": 6,
+    "decoder_start_token_id": 2,
+    "dropout": 0.1,
+    "early_stopping": true,
+    "encoder_attention_heads": 12,
+    "encoder_ffn_dim": 3072,
+    "encoder_layerdrop": 0.0,
+    "encoder_layers": 6,
+    "eos_token_id": 2,
+    "forced_eos_token_id": 2,
+    "forced_bos_token_id": 0,
+    "gradient_checkpointing": false,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1",
+      "2": "LABEL_2"
+    },
+    "init_std": 0.02,
+    "is_encoder_decoder": true,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1,
+      "LABEL_2": 2
+    },
+    "max_position_embeddings": 1024,
+    "model_type": "bart",
+    "no_repeat_ngram_size": 3,
+    "normalize_before": false,
+    "normalize_embedding": true,
+    "num_beams": 4,
+    "num_hidden_layers": 6,
+    "pad_token_id": 1,
+    "scale_embedding": false,
+    "task_specific_params": {
+      "summarization": {
+        "length_penalty": 1.0,
+        "max_length": 128,
+        "min_length": 12,
+        "num_beams": 4
+      },
+      "summarization_cnn": {
+        "length_penalty": 2.0,
+        "max_length": 142,
+        "min_length": 56,
+        "num_beams": 4
+      },
+      "summarization_xsum": {
+        "length_penalty": 1.0,
+        "max_length": 62,
+        "min_length": 11,
+        "num_beams": 6
+      }
+    },
+    "torch_dtype": "float32",
+    "transformers_version": "4.12.0.dev0",
+    "use_cache": true,
+    "vocab_size": 50265,
+  
+    "d_i3d_flow": 2048,
+    "d_i3d_rgb": 2048,
+    "d_sam": 512,
+    "d_audio": 128,
+    "top_k": 10,
+    "num_nn": 4,
+    "gnn_type": "appnp",
+    "use_random_graphs": false,
+    "integrate_all_gnn_features": true,
+    "use_elbo_local": true,
+    "use_elbo_global": true,
+    "use_non_linear": true,
+    "gnn_K": 2,
+    "gnn_alpha": 0.1,
+    "num_modalities": 6,
+    "local_gnn_d_hidden": 768,
+    "global_gnn_d_hidden": 768,
+    "num_local_gnn_heads": 2,
+    "num_global_gnn_heads": 4,
+    "local_gnn_dropout": 0.1,
+    "global_gnn_dropout": 0.1,
+    "local_fc_dropout": 0.1,
+    "global_fc_dropout": 0.1,
+    "num_local_gnn_layers": 2,
+    "num_global_gnn_layers": 2,
+    "num_local_fc_layers": 2,
+    "num_global_fc_layers": 2,
+    "use_local_gnn_bn": true,
+    "use_global_gnn_bn": true,
+    "use_local_fc_bn": true,
+    "use_global_fc_bn": true,
+    "local_gnn_concat": true,
+    "global_gnn_concat": true,
+    "num_local_gr_learner_heads": 8,
+    "num_global_gr_learner_heads":8,
+    "init_adj_ratio": 0.5,
+    "adj_ratio": 0.5,
+    "alpha": 0.9,
+    "gnns_every": 2,
+    "num_layers_state_fc_decoder": 2,
+    "dropout_state_fc_decoder": 0.3
+  }
+  
--- a/config/avsd_bart_large.json
+++ b/config/avsd_bart_large.json
@ -0,0 +1,116 @@
+{
+    "activation_dropout": 0.1,
+    "activation_function": "gelu",
+    "add_bias_logits": false,
+    "add_final_layer_norm": false,
+    "architectures": [
+      "BartModel"
+    ],
+    "attention_dropout": 0.1,
+    "bos_token_id": 0,
+    "classif_dropout": 0.1,
+    "classifier_dropout": 0.0,
+    "d_model": 1024,
+    "decoder_attention_heads": 16,
+    "decoder_ffn_dim": 4096,
+    "decoder_layerdrop": 0.0,
+    "decoder_layers": 12,
+    "decoder_start_token_id": 2,
+    "dropout": 0.1,
+    "early_stopping": true,
+    "encoder_attention_heads": 16,
+    "encoder_ffn_dim": 4096,
+    "encoder_layerdrop": 0.0,
+    "encoder_layers": 12,
+    "eos_token_id": 2,
+    "forced_eos_token_id": 2,
+    "forced_bos_token_id": 0,
+    "gradient_checkpointing": false,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1",
+      "2": "LABEL_2"
+    },
+    "init_std": 0.02,
+    "is_encoder_decoder": true,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1,
+      "LABEL_2": 2
+    },
+    "max_position_embeddings": 1024,
+    "model_type": "bart",
+    "no_repeat_ngram_size": 3,
+    "normalize_before": false,
+    "num_beams": 4,
+    "num_hidden_layers": 12,
+    "pad_token_id": 1,
+    "scale_embedding": false,
+    "task_specific_params": {
+      "summarization": {
+        "length_penalty": 1.0,
+        "max_length": 128,
+        "min_length": 12,
+        "num_beams": 4
+      },
+      "summarization_cnn": {
+        "length_penalty": 2.0,
+        "max_length": 142,
+        "min_length": 56,
+        "num_beams": 4
+      },
+      "summarization_xsum": {
+        "length_penalty": 1.0,
+        "max_length": 62,
+        "min_length": 11,
+        "num_beams": 6
+      }
+    },
+    "transformers_version": "4.7.0.dev0",
+    "use_cache": true,
+    "vocab_size": 50265,
+  
+    "d_i3d_flow": 2048,
+    "d_i3d_rgb": 2048,
+    "d_sam": 512,
+    "d_audio": 128,
+    "top_k": 10,
+    "num_nn": 4,
+    "gnn_type": "appnp",
+    "use_random_graphs": false,
+    "integrate_all_gnn_features": true,
+    "use_elbo_local": true,
+    "use_elbo_global": true,
+    "use_non_linear": true,
+    "gnn_K": 2,
+    "gnn_alpha": 0.1,
+    "num_modalities": 6,
+    "local_gnn_d_hidden": 1024,
+    "global_gnn_d_hidden": 1024,
+    "num_local_gnn_heads": 2,
+    "num_global_gnn_heads": 4,
+    "local_gnn_dropout": 0.1,
+    "global_gnn_dropout": 0.1,
+    "local_fc_dropout": 0.1,
+    "global_fc_dropout": 0.1,
+    "num_local_gnn_layers": 1,
+    "num_global_gnn_layers": 1,
+    "num_local_fc_layers": 1,
+    "num_global_fc_layers": 1,
+    "use_local_gnn_bn": true,
+    "use_global_gnn_bn": true,
+    "use_local_fc_bn": true,
+    "use_global_fc_bn": true,
+    "local_gnn_concat": true,
+    "global_gnn_concat": true,
+    "num_local_gr_learner_heads": 8,
+    "num_global_gr_learner_heads": 8,
+    "init_adj_ratio": 0.5,
+    "adj_ratio": 0.5,
+    "alpha": 0.9,
+    "gnns_every": 4,
+    "num_layers_state_fc_decoder": 2,
+    "dropout_state_fc_decoder": 0.3
+  
+  }
+  
--- a/config/mst_mixer.conf
+++ b/config/mst_mixer.conf
@ -0,0 +1,94 @@
+mixer {
+    task = avsd  
+    #################################################################################
+    # datasets
+    # avsd
+
+    avsd_processed = features/
+    avsd_train = raw_data/train_set4DSTC7-AVSD.json
+    avsd_val = raw_data/valid_set4DSTC7-AVSD.json
+    avsd_test_dstc7 = raw_data/test_set4DSTC7-AVSD.json
+    avsd_test_dstc8 = raw_data/test_set4DSTC8-AVSD.json
+    avsd_test_dstc10 = raw_data/test_set4DSTC10-AVSD.json
+    avsd_feature_path = features/
+    avsd_i3d_rgb = features/i3d_rgb
+    avsd_i3d_rgb_test = features/i3d_rgb_testset
+    avsd_i3d_flow = features/i3d_flow_all
+    avsd_i3d_flow_test = features/i3d_flow_testset
+    avsd_audio = features/vggish_all
+    avsd_audio_test = features/vggish_testset
+    avsd_objects = features/sam
+    avsd_objects_test = features/sam_testset
+
+    dstc = 7
+
+    # NextQA
+    nextqa_root = processed/next_qa/annotations
+    nextqa_vid_feat = processed/next_qa/vid_feat
+    #################################################################################
+    # Model
+    bart_size = large  # base, large
+    avsd_bart_base_config = config/avsd_bart_base.json
+    avsd_bart_large_config = config/avsd_bart_large.json
+    nextqa_bart_large_config = config/nextqa_bart_large.json
+
+    #################################################################################
+    # Logging & Checkpointing
+    log_dir = logs
+    output_dir_dstc7 = output/dstc7
+    output_dir_dstc8 = output/dstc8
+    output_dir_dstc10 = output/dstc10
+    output_dir_nextqa = output/nextqa
+    max_ckpt_to_keep = 5
+    start_ckpt_for_generating = none
+    loads_start_path = false
+    next_logging_pct = 0.1
+    save_ckpt=true
+    skip_saving_ckpt = false
+    stop_epochs = -1
+    resets_min_val_loss = false
+    restarts = false
+    uses_new_optimizer = true
+    sets_new_lr = false
+    ################################################################################
+    # Data processing
+    expand_rnd = false
+    cap_sum = cap_sum
+    add_state_tokens = true
+    bart_max_input_len = 1024
+    num_workers = 0
+    n_history = 3
+    caption_drop_rate = 0.0
+    vis_feat_length = 36
+    #################################################################################
+    # Training
+    dp_type = ddp
+    batch_size = 16
+    num_epochs = 12
+    warmup_ratio = 0.1
+    batch_multiply = 1
+    skip_eval = false
+    stop_epoch = -1
+    random_seed = 54
+    learning_rate_bart = 1e-5
+    learning_rate_other = 1e-4
+    min_lr = 0
+    clip_grad_value = 1.0
+    print_output = false
+    eval_first = false
+    overfit_size = -1
+    elbo_global_coeff = 100
+    elbo_local_coeff = 100
+    gen_coeff = 1
+    #################################################################################
+    # Generation
+    gen_batch_size = 1
+    beam_depth = 5
+    max_generation_length = 20
+    min_generation_length = 1
+    length_penalty = 0.3
+    #################################################################################
+    # Misc.
+    master_port = 5101
+    use_cpu = false
+}
--- a/config/nextqa_bart_large.json
+++ b/config/nextqa_bart_large.json
@ -0,0 +1,116 @@
+{
+    "activation_dropout": 0.1,
+    "activation_function": "gelu",
+    "add_bias_logits": false,
+    "add_final_layer_norm": false,
+    "architectures": [
+      "BartModel"
+    ],
+    "attention_dropout": 0.1,
+    "bos_token_id": 0,
+    "classif_dropout": 0.1,
+    "classifier_dropout": 0.0,
+    "d_model": 1024,
+    "decoder_attention_heads": 16,
+    "decoder_ffn_dim": 4096,
+    "decoder_layerdrop": 0.0,
+    "decoder_layers": 12,
+    "decoder_start_token_id": 2,
+    "dropout": 0.1,
+    "early_stopping": true,
+    "encoder_attention_heads": 16,
+    "encoder_ffn_dim": 4096,
+    "encoder_layerdrop": 0.0,
+    "encoder_layers": 12,
+    "eos_token_id": 2,
+    "forced_eos_token_id": 2,
+    "forced_bos_token_id": 0,
+    "gradient_checkpointing": false,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1",
+      "2": "LABEL_2"
+    },
+    "init_std": 0.02,
+    "is_encoder_decoder": true,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1,
+      "LABEL_2": 2
+    },
+    "max_position_embeddings": 1024,
+    "model_type": "bart",
+    "no_repeat_ngram_size": 3,
+    "normalize_before": false,
+    "num_beams": 4,
+    "num_hidden_layers": 12,
+    "pad_token_id": 1,
+    "scale_embedding": false,
+    "task_specific_params": {
+      "summarization": {
+        "length_penalty": 1.0,
+        "max_length": 128,
+        "min_length": 12,
+        "num_beams": 4
+      },
+      "summarization_cnn": {
+        "length_penalty": 2.0,
+        "max_length": 142,
+        "min_length": 56,
+        "num_beams": 4
+      },
+      "summarization_xsum": {
+        "length_penalty": 1.0,
+        "max_length": 62,
+        "min_length": 11,
+        "num_beams": 6
+      }
+    },
+    "transformers_version": "4.7.0.dev0",
+    "use_cache": true,
+    "vocab_size": 50265,
+  
+    "d_i3d_flow": 2048,
+    "d_i3d_rgb": 2048,
+    "d_sam": 512,
+    "d_audio": 128,
+    "top_k": 10,
+    "num_nn": 4,
+    "gnn_type": "appnp",
+    "use_random_graphs": false,
+    "integrate_all_gnn_features": true,
+    "use_elbo_local": true,
+    "use_elbo_global": true,
+    "use_non_linear": true,
+    "gnn_K": 2,
+    "gnn_alpha": 0.1,
+    "num_modalities": 3,
+    "local_gnn_d_hidden": 1024,
+    "global_gnn_d_hidden": 1024,
+    "num_local_gnn_heads": 2,
+    "num_global_gnn_heads": 4,
+    "local_gnn_dropout": 0.1,
+    "global_gnn_dropout": 0.1,
+    "local_fc_dropout": 0.1,
+    "global_fc_dropout": 0.1,
+    "num_local_gnn_layers": 1,
+    "num_global_gnn_layers": 1,
+    "num_local_fc_layers": 1,
+    "num_global_fc_layers": 1,
+    "use_local_gnn_bn": true,
+    "use_global_gnn_bn": true,
+    "use_local_fc_bn": true,
+    "use_global_fc_bn": true,
+    "local_gnn_concat": true,
+    "global_gnn_concat": true,
+    "num_local_gr_learner_heads": 8,
+    "num_global_gr_learner_heads": 8,
+    "init_adj_ratio": 0.5,
+    "adj_ratio": 0.5,
+    "alpha": 0.9,
+    "gnns_every": 4,
+    "num_layers_state_fc_decoder": 2,
+    "dropout_state_fc_decoder": 0.3
+  
+  }
+  
--- a/custom_datasets/README.md
+++ b/custom_datasets/README.md
@ -0,0 +1,20 @@
+1. Download the raw [Charades train/val](https://prior.allenai.org/projects/charades) data
+2. Download the raw [Charades test](https://ai2-public-datasets.s3-us-west-2.amazonaws.com/charades/Charades_vu17_test_480.tar) data
+3. Install [SAM](https://github.com/facebookresearch/segment-anything.git)
+4. Segment the frames
+   ```shell
+   python segement.py --sam_ckpt path_to_sam_ckpt --avsd_root path_to_charades_trval_frames --crop_root path_to_save_the_trval_crops  --mode segment --start start_idx --end end_idx
+   python segement.py --sam_ckpt path_to_sam_ckpt --avsd_root path_to_charades_test_frames --crop_root path_to_save_the_test_crops  --mode segment --start start_idx --end end_id
+   ```
+5. Embed the crops
+   ```shell
+   python segement.py --sam_ckpt path_to_sam_ckpt --crop_root path_to_save_the_trval_crops  --mode emebed --embed_root ../features/sam  --start start_idx --end end_idx
+   python segement.py --sam_ckpt path_to_sam_ckpt --crop_root path_to_save_the_test_crops  --mode emebed --embed_root ../features/sam_testset  --start start_idx --end end_idx
+
+   ```
+6. Preprocess and log the data
+   ```shell
+   python dataset.py --split train
+   python dataset.py --split val
+   
+   ```
--- a/custom_datasets/init.py
+++ b/custom_datasets/init.py
--- a/custom_datasets/avsd.py
+++ b/custom_datasets/avsd.py
@ -0,0 +1,401 @@
+import os
+import pickle
+import pyhocon
+from copy import deepcopy
+import json
+from tqdm import tqdm
+import numpy as np
+import torch
+from argparse import ArgumentParser
+from torch.utils.data import Dataset, DataLoader
+from transformers import BartTokenizer
+from itertools import chain
+
+
+ADDITIONAL_SPECIAL_TOKENS = [
+    '<place_holder>', '<s0>', '<s1>', '<s2>', '<s3>', '<s4>', '<s5>']
+
+SPECIAL_TOKENS_DICT = {
+    'bos_token': '<s>',
+    'eos_token': '</s>',
+    'pad_token': '<pad>',
+    'additional_special_tokens': ADDITIONAL_SPECIAL_TOKENS
+}
+
+S0_TOK = '<s0>'  # I3D_flow
+S1_TOK = '<s1>'  # I3D_rgb
+S2_TOK = '<s2>'  # sam obj
+S3_TOK = '<s3>'  # audio
+S4_TOK = '<s4>'  # history
+S5_TOK = '<s5>'  # question
+
+
+
+def tokenize(obj, tokenizer):
+    if isinstance(obj, str):
+        return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(obj))
+    if isinstance(obj, dict):
+        return dict((n, tokenize(o)) for n, o in obj.items())
+    return list(tokenize(o) for o in obj)
+
+
+class AVSDDataset(Dataset):
+    def __init__(self, config, split):
+
+        super().__init__()
+        self.config = config
+        self.split = split
+        self.bart_max_input_len = config['bart_max_input_len']
+        self.bart_size = config['bart_size']
+        self.cap_sum = config['cap_sum']
+        self.tokenizer = BartTokenizer.from_pretrained('facebook/bart-{}'.format(self.bart_size))
+        self.vocab_size = self.tokenizer.vocab_size
+
+        self.tokenizer.add_special_tokens({'additional_special_tokens': ADDITIONAL_SPECIAL_TOKENS})
+        self.vocab_size += len(ADDITIONAL_SPECIAL_TOKENS)
+        self.tokenizer.save_pretrained(os.path.join(self.config['log_dir'], 'bart_tokenizer'))
+        self.processed_dir = os.path.join(self.config['avsd_processed'], 'hist_with_{}_rounds'.format(self.config['n_history']), split)
+        self.paths = list(map(lambda p: os.path.join(self.processed_dir, p), os.listdir(self.processed_dir)))
+
+        if self.config['overfit'] > 0:
+            self.paths = self.paths[:self.config['overfit_size']]
+        
+    def __len__(self):
+        return len(self.paths)
+
+    def __getitem__(self, index):
+        pth  = self.paths[index]
+        with open(pth, 'rb') as f:
+            item = pickle.load(f)
+
+        question_sep = self.tokenizer.convert_tokens_to_ids('<s5>')
+
+        input_ids = item['input_ids']
+        history_end = (input_ids == question_sep).nonzero(as_tuple=True)[0]
+
+        history_interval = [0, history_end.item()]  # The last token is the question state token (not part of the history)
+        question_interval = [history_end.item(), input_ids.size(0)]
+
+        lm_labels = item['lm_labels']
+        i3d_rgb = item['i3d_rgb']
+        i3d_flow = item['i3d_flow']
+        sam = item['sam']
+        vgg = item['vgg']
+        vid = item['vid']
+
+        return input_ids, lm_labels, history_interval, question_interval, i3d_rgb, i3d_flow, sam, vgg, vid
+
+    def padding(self, seq, pad_token, max_len=None):
+        if max_len is None:
+            max_len = max([i.size(0) for i in seq])
+        if len(seq[0].size()) == 1:
+            result = torch.ones((len(seq), max_len)).long() * pad_token
+        else:
+            result = torch.ones((len(seq), max_len, seq[0].size(-1))).float()
+        for i in range(len(seq)):
+            result[i, :seq[i].size(0)] = seq[i]
+        return result
+
+    def collate_fn(self, batch):
+        input_ids_list, lm_labels_list, history_interval_list, question_interval_list, i3d_rgb_list, i3d_flow_list, sam_list, vggish_list, vid_ids_list = [], [], [], [], [], [], [], [], []
+        for i in batch:
+            input_ids_list.append(i[0])
+            lm_labels_list.append(i[1])
+            history_interval_list.append(i[2])
+            question_interval_list.append(i[3])
+            i3d_rgb_list.append(i[4])
+            i3d_flow_list.append(i[5])
+            sam_list.append(i[6])
+            vggish_list.append(i[7])
+            vid_ids_list.append(i[8])
+
+        history_intervals = np.array(history_interval_list)
+        question_intervals = np.array(question_interval_list)
+
+
+        min_len_i3d_flow = min([feat.shape[0] for feat in i3d_flow_list])
+        min_len_i3d_rgb = min([feat.shape[0] for feat in i3d_rgb_list])
+        min_len_sam = min([feat.shape[0] for feat in sam_list])
+        min_len_vggish = min([feat.shape[0] for feat in vggish_list])
+
+        min_length = min([self.config['vis_feat_length'], min_len_i3d_flow, min_len_i3d_rgb, min_len_sam, min_len_vggish])
+
+        # Sample equally-distant features from the visual features for each sample within the batch
+        for i in range(len(i3d_rgb_list)):
+            sample_idx_i3d_rgb = np.round(np.linspace(0, i3d_rgb_list[i].shape[0] - 1, min_length)).astype(int)
+            i3d_rgb_list[i] = i3d_rgb_list[i][sample_idx_i3d_rgb, :]
+        i3d_rgb = torch.from_numpy(np.array(i3d_rgb_list)).float()
+
+        for i in range(len(i3d_flow_list)):
+            sample_idx_i3d_flow = np.round(np.linspace(0, i3d_flow_list[i].shape[0] - 1, min_length)).astype(int)
+            i3d_flow_list[i] = i3d_flow_list[i][sample_idx_i3d_flow, :]
+        i3d_flow = torch.from_numpy(np.array(i3d_flow_list)).float()
+
+        for i in range(len(sam_list)):
+            sample_idx_sam = np.round(np.linspace(0, sam_list[i].shape[0] - 1, min_length)).astype(int)
+            sam_list[i] = sam_list[i][sample_idx_sam, :]
+        sam = torch.from_numpy(np.array(sam_list)).float()
+
+        for i in range(len(vggish_list)):
+            sample_idx_vggish = np.round(np.linspace(0, vggish_list[i].shape[0] - 1, min_length)).astype(int)
+            vggish_list[i] = vggish_list[i][sample_idx_vggish, :]
+        vggish = torch.from_numpy(np.array(vggish_list)).float()
+
+        pad_token, i3d_flow_sep, i3d_rgb_sep, sam_sep, audio_sep, ph_token = self.tokenizer.convert_tokens_to_ids(
+            ['<pad>', '<s0>', '<s1>', '<s2>', '<s3>', '<place_holder>'])
+
+        # All the visual features will not be masked because we do not perform any padding on them
+        video_mask = torch.ones((len(batch), min_length*4 + 4)) == 1  # NOTE *4: 4 modalities | +4: the state tokens
+        # Now we create a dummy input for the video tokens (sole purpose is to reserve the spot of the seperators)
+        dummy = torch.ones((len(batch), min_length)) * ph_token
+        video_place_holder_ids = torch.cat(
+            [torch.ones((len(batch), 1)) * i3d_rgb_sep, dummy,
+             torch.ones((len(batch), 1)) * i3d_flow_sep, dummy,
+             torch.ones((len(batch), 1)) * sam_sep, dummy,
+             torch.ones((len(batch), 1)) * audio_sep, dummy,
+            ], dim=-1).long()
+
+        input_ids = self.padding(input_ids_list, pad_token)
+        lm_labels = self.padding(lm_labels_list, -100)
+        text_mask = input_ids != pad_token
+        input_mask = torch.cat([video_mask, text_mask], dim=1)
+
+        # Now we get the intervals of the visual input tokens
+        # Here the interval do not change across the batch dimension
+        i3d_rgb_interval = [0, min_length + 1]  # the last token is not part of this modality
+        i3d_flow_interval = [min_length + 1, 2 * min_length + 2]
+        sam_interval = [2 * min_length + 2, 3 * min_length + 3]
+        audio_interval = [3 * min_length + 3, 4 * min_length + 4]
+                
+        vis_state_vector_idx = [i3d_rgb_interval[0], i3d_flow_interval[0], sam_interval[0], audio_interval[0]]
+
+        # adapt the question and history interval -- shifted to the right by the visual input length
+        history_intervals += 4 * min_length + 4
+        question_intervals += 4 * min_length + 4
+        history_intervals = history_intervals.tolist()
+        question_intervals = question_intervals.tolist()
+        
+        history_state_vector_idx = [x[0] + 1 for x in history_intervals]  # +1 because the history starts with <s><s4> .....
+        question_state_vector_idx = [x[0] for x in question_intervals]  # +1 because the history starts with <s><s4> .....
+        
+        batch = {
+            'input_ids': input_ids,
+            'video_place_holder_ids': video_place_holder_ids,
+            'i3d_rgb': i3d_rgb,
+            'i3d_flow': i3d_flow,
+            'sam': sam,
+            'vggish': vggish,
+            'lm_labels': lm_labels,
+            'input_mask': input_mask,
+            'i3d_rgb_interval': i3d_rgb_interval,
+            'i3d_flow_interval': i3d_flow_interval,
+            'sam_interval': sam_interval,
+            'audio_interval': audio_interval,
+            'history_intervals': history_intervals,
+            'question_intervals': question_intervals,
+            'vis_state_vector_idx': vis_state_vector_idx,
+            'history_state_vector_idx': history_state_vector_idx,
+            'question_state_vector_idx': question_state_vector_idx
+        }
+        return batch
+
+
+def get_dataset(config, split, tokenizer):
+    if split != 'test':
+        dialog_pth = config[f'avsd_{split}']
+    else:
+        dialog_pth = config['avsd_test_dstc{}'.format(config['dstc'])]
+    n_history = config['n_history']
+    dialog_data = json.load(open(dialog_pth, 'r'))
+    dialog_list = []
+    vid_set = set()
+    undisclosed_only = split == 'test'
+    pbar = tqdm(dialog_data['dialogs'])
+
+    pbar.set_description('[INFO] Generating {} items | DSTC {}'.format(split, config['dstc']))
+    for dialog in pbar:
+        if config['dstc'] != 10:
+            caption = [tokenize(dialog['caption'], tokenizer)] + [tokenize(dialog['summary'], tokenizer)]
+        else:
+            caption = [tokenize('no', tokenizer)]
+
+        questions = [tokenize(d['question'], tokenizer) for d in dialog['dialog']]
+        answers = [tokenize(d['answer'], tokenizer) for d in dialog['dialog']]
+        vid = dialog["image_id"]
+        vid_set.add(vid)
+        if undisclosed_only:
+            it = range(len(questions) - 1, len(questions))
+        else:
+            it = range(len(questions))
+        qalist=[]
+        history = []
+        if undisclosed_only:
+            for n in range(len(questions)-1):
+                qalist.append(questions[n])
+                qalist.append(answers[n])
+            history=qalist[max(-len(qalist),-n_history*2):]
+        for n in it:
+            if undisclosed_only:
+                assert dialog['dialog'][n]['answer'] == '__UNDISCLOSED__'
+            question = questions[n]
+            answer = answers[n]
+            history.append(question)
+            if n_history == 0:
+                item = {'vid': vid, 'history': [question], 'answer': answer, 'caption': caption}
+            else:
+                item = {'vid': vid, 'history': history, 'answer': answer, 'caption': caption}
+            dialog_list.append(item)
+            qalist.append(question)
+            qalist.append(answer)
+            history=qalist[max(-len(qalist),-n_history*2):]
+
+    all_features = {}
+    fea_types = ['vggish', 'i3d_flow', 'i3d_rgb', 'sam']
+
+    dataname = '<FeaType>/<ImageID>.npy'
+    for ftype in fea_types:
+        if undisclosed_only:
+            basename = dataname.replace('<FeaType>', ftype+'_testset')
+        else:
+            basename = dataname.replace('<FeaType>', ftype)
+        features = {}
+        for vid in vid_set:
+            filename = basename.replace('<ImageID>', vid)
+            filepath = config['avsd_feature_path'] + filename
+            features[vid] = filepath
+        all_features[ftype] = features
+    return dialog_list, all_features
+
+
+def build_input_from_segments(caption, history_orig, reply, tokenizer, add_state_tokens=True, drop_caption=False):
+    """ Build a sequence of input from 3 segments: caption(caption+summary) history and last reply """
+
+    bos, eos, hist_state, ques_state = tokenizer.convert_tokens_to_ids(['<s>', '</s>', '<s4>', '<s5>'])
+    sep = eos
+    
+    instance = {}
+    instance["lm_labels"] = reply + [eos]
+    caption = list(chain(*caption))
+
+    # Add state tokens if applicable
+    if add_state_tokens:
+        caption.insert(0, hist_state)
+        history = deepcopy(history_orig)
+        history[-1].insert(0, ques_state)
+    else:
+        history = history_orig
+
+    if not drop_caption:
+        # sequence = [[bos] + list(chain(*caption))] + history + [reply + ([eos] if with_eos else [])]
+
+        # NOTE It is important not to include the reply in the input of the encoder -- > the decoder will just
+        # learn to copy it --> low train/val loss but no learning is happening
+        sequence = [[bos] + caption + [eos]] + [[sep] + s for s in history] + [[eos]]
+    else:
+        sequence = [[bos]] + [[hist_state]] + [[sep] + s for s in history] + [[eos]]
+
+    instance["input_ids"] = list(chain(*sequence))
+    return instance
+
+
+def parse_args():
+    parser = ArgumentParser(description='debug dataloader')
+    parser.add_argument(
+        '--split',
+        type=str,
+        default='train',
+        help='train or val')
+
+    parser.add_argument(
+        '--model',
+        type=str,
+        default='mixer',
+        help='model name to train or test')
+
+    parser.add_argument(
+        '--log_dataset',
+        action='store_true',
+        default=False,
+        help='Whether or not to log the processed data')
+
+    parser.add_argument(
+        '--add_state_tokens',
+        action='store_true',
+        default=True,
+        help='Whether or not to add state tokens')
+
+    parser.add_argument(
+        '--log_dir',
+        type=str,
+        default='processed/avsd',
+        help='Output directory')
+
+    args = parser.parse_args()
+
+    return args
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    split = args.split
+
+    config = pyhocon.ConfigFactory.parse_file(
+        'config/mst_mixer.conf')[args.model]
+    config['expand_rnd'] = False
+    config['debugging'] = False
+    config['overfit'] = False
+    args.log_dir = os.path.join(args.log_dir, 'hist_with_{}_rounds'.format(config['n_history']) ) 
+    if args.log_dataset:
+        log_dir = os.path.join(args.log_dir, split)
+        if not os.path.isdir(log_dir):
+            os.makedirs(log_dir)
+
+        tokenizer = BartTokenizer.from_pretrained('facebook/bart-{}'.format(config['bart_size']))
+        tokenizer.add_special_tokens({'additional_special_tokens': ADDITIONAL_SPECIAL_TOKENS})
+        dialogs, features = get_dataset(config, split, tokenizer)
+        pbar = tqdm(dialogs)
+        pbar.set_description('[{}] Logging processed data'.format(split))
+        counter = 0
+        for dialog in pbar:
+            vid = dialog['vid']
+            his = dialog['history']
+            cap = dialog['caption']
+            ans = dialog['answer']
+
+            if np.random.rand() < config['caption_drop_rate']:
+                instance = build_input_from_segments(
+                    cap, his, ans, tokenizer, add_state_tokens=args.add_state_tokens, drop_caption=True)
+            else:
+                instance = build_input_from_segments(
+                    cap, his, ans, tokenizer, add_state_tokens=args.add_state_tokens, drop_caption=False)
+            
+            input_ids = torch.Tensor(instance["input_ids"]).long()
+            lm_labels = torch.Tensor(instance["lm_labels"]).long()
+
+            vgg = np.load(features["vggish"][vid])
+            i3d_flow = np.load(features["i3d_flow"][vid])
+            i3d_rgb = np.load(features["i3d_rgb"][vid])
+            sam = np.load(features["sam"][vid])
+
+            item = {
+                'input_ids': input_ids,
+                'lm_labels': lm_labels,
+                'i3d_rgb': i3d_rgb,
+                'i3d_flow': i3d_flow,
+                'sam': sam,
+                'vgg': vgg,
+                'vid': vid
+            }
+            counter += 1
+            pth = os.path.join(log_dir, str(counter) + '.pkl')
+            with open(pth, 'wb') as f:
+                pickle.dump(item, f, protocol=pickle.HIGHEST_PROTOCOL)
+    else:
+        avsd_dataset = AVSDDataset(config, 'val')
+        avsd_dataloader = DataLoader(avsd_dataset, batch_size=4, shuffle=False, collate_fn=avsd_dataset.collate_fn)
+
+        for i, data in enumerate(avsd_dataloader):
+            print('{}/{}'.format(i, len(avsd_dataloader)))
+        print(avsd_dataset.max_len)
+
+    print('[INFO] Done...')
--- a/custom_datasets/nextqa.py
+++ b/custom_datasets/nextqa.py
@ -0,0 +1,211 @@
+import os
+import pandas as pd
+import h5py
+import json
+import numpy as np
+import torch
+from torch.utils.data import Dataset
+from transformers import BartTokenizer
+from itertools import chain
+
+
+ADDITIONAL_SPECIAL_TOKENS = [
+    '<place_holder>', '<s0>', '<s1>', '<s2>', '<s3>', '<s4>', '<s5>']
+
+SPECIAL_TOKENS_DICT = {
+    'bos_token': '<s>',
+    'eos_token': '</s>',
+    'pad_token': '<pad>',
+    'additional_special_tokens': ADDITIONAL_SPECIAL_TOKENS
+}
+
+S0_TOK = '<s0>'  # frame
+S1_TOK = '<s1>'  # mot
+S2_TOK = '<s2>'  # question
+
+def load_file(file_name):
+    annos = None
+    if os.path.splitext(file_name)[-1] == '.csv':
+        return pd.read_csv(file_name)
+    with open(file_name, 'r') as fp:
+        if os.path.splitext(file_name)[1]== '.txt':
+            annos = fp.readlines()
+            annos = [line.rstrip() for line in annos]
+        if os.path.splitext(file_name)[1] == '.json':
+            annos = json.load(fp)
+
+    return annos
+
+
+def tokenize(obj, tokenizer):
+    if isinstance(obj, str):
+        return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(obj))
+    if isinstance(obj, dict):
+        return dict((n, tokenize(o)) for n, o in obj.items())
+    return list(tokenize(o) for o in obj)
+
+
+class NextQADataset(Dataset):
+    def __init__(self, config, split):
+
+        super().__init__()
+        self.config = config
+        self.split = split
+        self.bart_max_input_len = config['bart_max_input_len']
+        self.bart_size = config['bart_size']
+        self.tokenizer = BartTokenizer.from_pretrained('facebook/bart-{}'.format(self.bart_size))
+        self.vocab_size = self.tokenizer.vocab_size
+
+        self.tokenizer.add_special_tokens({'additional_special_tokens': ADDITIONAL_SPECIAL_TOKENS})
+        self.vocab_size += len(ADDITIONAL_SPECIAL_TOKENS)
+        self.tokenizer.save_pretrained(os.path.join(self.config['log_dir'], 'bart_tokenizer'))
+
+        sample_list_file = os.path.join(self.config['nextqa_root'], '{}.csv'.format(split))
+        self.sample_list = load_file(sample_list_file)
+
+        vid_feat_file = os.path.join(self.config['nextqa_vid_feat'], 'app_mot_{}.h5'.format(split))
+        print('Load {}...'.format(vid_feat_file))
+        self.frame_feats = {}
+        self.mot_feats = {}
+        with h5py.File(vid_feat_file, 'r') as fp:
+            vids = fp['ids']
+            feats = fp['feat']
+            for vid, feat in zip(vids, feats):
+                self.frame_feats[str(vid)] = feat[:, :2048]  # (16, 2048)
+                self.mot_feats[str(vid)] = feat[:, 2048:]  # (16, 2048)
+
+        if self.config['overfit_size'] > 0:
+            self.sample_list = self.sample_list[:self.config['overfit_size']]
+        
+    def __len__(self):
+        return len(self.sample_list)
+
+    def get_video_feature(self, video_name):
+        """
+        :param video_name:
+        :return:
+        """
+       
+        app_feat = self.frame_feats[video_name]
+        app_feat = torch.from_numpy(app_feat).type(torch.float32)
+
+        mot_feat = self.mot_feats[video_name]
+        mot_feat = torch.from_numpy(mot_feat).type(torch.float32)
+
+        return app_feat, mot_feat
+
+
+    def __getitem__(self, idx):
+        cur_sample = self.sample_list.loc[idx]
+        video_name, ques, ans, qid = str(cur_sample['video']), str(cur_sample['question']),\
+                                    str(cur_sample['answer']), str(cur_sample['qid'])
+        
+        input_ids = tokenize(ques, self.tokenizer)
+        lm_labels = tokenize(ans, self.tokenizer)
+
+        app_feat, mot_feat = self.get_video_feature(video_name)
+
+        bos, eos, ques_state = self.tokenizer.convert_tokens_to_ids(['<s>', '</s>', '<s2>'])
+
+        # Add state tokens
+        input_ids.insert(0, ques_state)
+        lm_labels.append(eos)
+        question_interval = [0, len(input_ids)]
+
+        input_ids = torch.Tensor(input_ids).long()
+        lm_labels = torch.Tensor(lm_labels).long()
+
+        return input_ids, lm_labels, app_feat, mot_feat, question_interval, video_name
+
+
+    def padding(self, seq, pad_token, max_len=None):
+        if max_len is None:
+            max_len = max([i.size(0) for i in seq])
+        if len(seq[0].size()) == 1:
+            result = torch.ones((len(seq), max_len)).long() * pad_token
+        else:
+            result = torch.ones((len(seq), max_len, seq[0].size(-1))).float()
+        for i in range(len(seq)):
+            result[i, :seq[i].size(0)] = seq[i]
+        return result
+
+    def collate_fn(self, batch):
+        input_ids_list, lm_labels_list, app_feat_list, mot_feat_list, question_interval_list, vid_ids_list = [], [], [], [], [], []
+        for i in batch:
+            input_ids_list.append(i[0])
+            lm_labels_list.append(i[1])
+            app_feat_list.append(i[2])
+            mot_feat_list.append(i[3])
+            question_interval_list.append(i[4])
+            vid_ids_list.append(i[5])
+
+        app_feats = torch.stack(app_feat_list, dim=0).float()
+        mot_feats = torch.stack(mot_feat_list, dim=0).float()
+
+        question_intervals = np.array(question_interval_list)
+
+        pad_token, app_sep, mot_sep, ph_token = self.tokenizer.convert_tokens_to_ids(
+            ['<pad>', '<s0>', '<s1>', '<place_holder>'])
+
+        # All the visual features will not be masked because we do not perform any padding on them
+        video_mask = torch.ones((len(batch), 16*2 + 2)) == 1  # NOTE *2: 2 modalities | +2: the state tokens | each modality has length 16
+        # Now we create a dummy input for the video tokens (sole purpose is to reserve the spot of the seperators)
+        dummy = torch.ones((len(batch), 16)) * ph_token
+        video_place_holder_ids = torch.cat(
+            [torch.ones((len(batch), 1)) * app_sep, dummy,
+             torch.ones((len(batch), 1)) * mot_sep, dummy,
+            ], dim=-1).long()
+
+        input_ids = self.padding(input_ids_list, pad_token)
+        lm_labels = self.padding(lm_labels_list, -100)
+        text_mask = input_ids != pad_token
+        input_mask = torch.cat([video_mask, text_mask], dim=1)
+
+        # Now we get the intervals of the visual input tokens
+        # Here the interval do not change across the batch dimension
+        app_interval = [0, 16 + 1]  # the last token is not part of this modality
+        mot_interval = [16 + 1, 2 * 16 + 2]
+        vis_state_vector_idx = [app_interval[0], mot_interval[0]]
+
+        # adapt the question and history interval -- shifted to the right by the visual input length
+        question_intervals += 2 * 16 + 2
+        question_intervals = question_intervals.tolist()
+        
+        question_state_vector_idx = [x[0] for x in question_intervals]
+        
+        batch = {
+            'input_ids': input_ids,
+            'video_place_holder_ids': video_place_holder_ids,
+            'app_feats': app_feats,
+            'mot_feats': mot_feats,
+            'lm_labels': lm_labels,
+            'input_mask': input_mask,
+            'app_interval': app_interval,
+            'mot_interval': mot_interval,
+            'question_intervals': question_intervals,
+            'vis_state_vector_idx': vis_state_vector_idx,
+            'question_state_vector_idx': question_state_vector_idx
+        }
+        return batch
+
+def get_dataset(config, split):
+    
+    bart_max_input_len = config['bart_max_input_len']
+    bart_size = config['bart_size']
+
+    sample_list_file = os.path.join(config['nextqa_root'], '{}.csv'.format(split))
+    sample_list = load_file(sample_list_file)
+
+    vid_feat_file = os.path.join(config['nextqa_vid_feat'], 'app_mot_{}.h5'.format(split))
+    print('Load {}...'.format(vid_feat_file))
+    app_feats = {}
+    mot_feats = {}
+    with h5py.File(vid_feat_file, 'r') as fp:
+        vids = fp['ids']
+        feats = fp['feat']
+        for vid, feat in zip(vids, feats):
+            app_feats[str(vid)] = feat[:, :2048]  # (16, 2048)
+            mot_feats[str(vid)] = feat[:, 2048:]  # (16, 2048)
+    
+    return sample_list, app_feats, mot_feats
+
--- a/custom_datasets/segment.py
+++ b/custom_datasets/segment.py
@ -0,0 +1,179 @@
+from segment_anything import SamPredictor, SamAutomaticMaskGenerator, sam_model_registry
+from tqdm import tqdm
+from argparse import ArgumentParser
+import pickle
+import cv2
+import os
+import torch
+import numpy as np
+
+
+def parse_args():
+    parser = ArgumentParser()
+
+    parser.add_argument(
+        '--sam_ckpt',
+        type=str,
+        help='SAM checkpoint to be used'
+    )
+
+    parser.add_argument(
+        '--avsd_root',
+        type=str,
+        help='Directory where the individual AVSD frames are located'
+    )
+
+    parser.add_argument(
+        '--crop_root',
+        type=str,
+        help='Directory where the individual crops (objects) will be saved'
+    )
+
+    parser.add_argument(
+        '--embed_root',
+        type=str,
+        help='Directory where the individual embeddings will be saved'
+    )
+
+    parser.add_argument(
+        '--mode',
+        type=str,
+        choices=['segment', 'embed'],
+        help='segment: segment the image into regions | embed: embed the image crops detected during segmentation'
+    )
+
+    parser.add_argument(
+        '--start',
+        type=int,
+        default=0,
+        help='Start index of the partition'
+    )
+
+    parser.add_argument(
+        '--end',
+        type=int,
+        default=1968,
+        help='End index of the partition'
+    )
+
+    args = parser.parse_args()
+    return args
+
+
+def partition_ids(avsd_ids, start, end):
+    avsd_ids.sort()
+    assert start < end
+    assert start >= 0 and end <= len(avsd_ids)
+    avsd_ids_partition = avsd_ids[start:end]
+    return avsd_ids_partition
+
+
+def get_middle_frames(avsd_ids_partition, avsd_root):
+    pbar = tqdm(avsd_ids_partition)
+    pbar.set_description('[INFO] Preparing frames of {} videos'.format(len(avsd_ids_partition)))
+    path_list = []
+    for avsd_id in pbar:
+        frames = os.listdir(os.path.join(avsd_root, avsd_id))
+        if 'test' in avsd_root:
+            frames.sort(key=lambda f: int(f.split('_')[-1].split('.')[0]))
+        else:
+            frames.sort(key=lambda f: int(f.split('-')[-1].split('.')[0]))
+        middle_frame = frames[int(len(frames)/2)]
+        middle_frame = os.path.join(avsd_root, avsd_id, middle_frame)
+        path_list.append(middle_frame)
+    return path_list
+
+
+def segment_images(sam, path_list, crop_root):
+    mask_generator = SamAutomaticMaskGenerator(sam)
+    pbar = tqdm(path_list)
+    pbar.set_description('Detecting Objects')
+    for pth in pbar:
+        vid_id = pth.split('/')[-2]
+        crop_dir = os.path.join(crop_root, vid_id)
+        if not os.path.isdir(crop_dir):
+            os.makedirs(crop_dir)
+
+        image = cv2.imread(pth)
+        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
+        masks = mask_generator.generate(image)
+        masks.sort(key=lambda e: e['stability_score'], reverse=True)
+        if len(masks) > 36:
+            masks = masks[:36]
+        for i, mask in enumerate(masks):
+            crop = image[
+                int(mask['bbox'][1]):int(mask['bbox'][1] + mask['bbox'][3] + 1),
+                int(mask['bbox'][0]):int(mask['bbox'][0] + mask['bbox'][2] + 1),
+                :
+            ]
+            crop_flipped = cv2.flip(crop, 1)  # Horizontal flip
+            cv2.imwrite(os.path.join(crop_dir, f'obj_{i}.jpg'), crop)
+            cv2.imwrite(os.path.join(crop_dir, f'obj_{i}_flipped.jpg'), crop_flipped)
+
+    print('[INFO] Done...')
+
+
+def embed_objects(sam, crop_ids, crop_root, embed_root):
+    predictor = SamPredictor(sam)
+    pbar = tqdm(crop_ids)
+    pbar.set_description('Embedding Objects')
+    for vid_id in pbar:
+        embeds = []
+        crop_dir = os.path.join(crop_root, vid_id)
+        crop_paths = list(map(lambda p: os.path.join(crop_dir, p), os.listdir(crop_dir)))
+        crop_paths = list(filter(lambda p: 'flipped' not in p, crop_paths))
+        crop_paths.sort(key=lambda p: int(p.split('_')[-1].split('.')[0]))
+        for cp in crop_paths:
+            crop = cv2.imread(cp)
+            crop = cv2.cvtColor(crop, cv2.COLOR_BGR2RGB)
+            predictor.set_image(crop)
+            embed_crop = predictor.get_image_embedding()
+            embed_crop = embed_crop.mean(-1).mean(-1)
+            
+            crop_flipped = cv2.flip(crop, 1)
+            predictor.set_image(crop_flipped)
+            embed_crop_flipped = predictor.get_image_embedding()
+            embed_crop_flipped = embed_crop_flipped.mean(-1).mean(-1)
+            
+            embed = torch.cat((embed_crop, embed_crop_flipped), dim=-1)
+            # embed = embed.copy().cpu()
+            embeds.append(embed)
+
+        embeds = torch.cat(embeds, 0).cpu().numpy()
+        np.save(os.path.join(embed_root, f'{vid_id}.npy'), embeds)
+
+    print('[INFO] Done...')
+
+
+def segment(args, sam):
+    avsd_ids = os.listdir(args.avsd_root)
+    avsd_ids.sort()
+    avsd_ids_partition = partition_ids(avsd_ids, args.start, args.end)
+    path_list = get_middle_frames(avsd_ids_partition, args.avsd_root)
+
+    if not os.path.isdir(args.crop_root):
+        os.makedirs(args.crop_root)
+    segment_images(sam, path_list, args.crop_root)
+
+
+def embed(args, sam):
+    crop_ids = os.listdir(args.crop_root)
+    crop_ids.sort()
+    crop_ids_partition = partition_ids(crop_ids, args.start, args.end)
+    if not os.path.isdir(args.embed_root):
+        os.makedirs(args.embed_root)
+    embed_objects(sam, crop_ids_partition, args.crop_root, args.embed_root)
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    sam = sam_model_registry['vit_h'](
+        checkpoint=args.sam_ckpt)
+    device = 'cuda'
+    sam.to(device=device)
+
+    assert args.mode in ['segment', 'embed']
+    if args.mode == 'segment':
+        segment(args, sam)
+    else:
+        embed(args, sam)
--- a/features/.gitkeep
+++ b/features/.gitkeep
--- a/generate_parallel_avsd.sh
+++ b/generate_parallel_avsd.sh
@ -0,0 +1,63 @@
+export MODEL=$1
+export TAG=$2
+export MODE=$3
+export EVAL_DIR=$4
+export DSTC=$5
+
+
+# >>> conda initialize >>>
+# !! Contents within this block are managed by 'conda init' !!
+__conda_setup="$('/opt/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
+if [ $? -eq 0 ]; then
+    eval "$__conda_setup"
+else
+    if [ -f "/opt/anaconda3/etc/profile.d/conda.sh" ]; then
+        . "/opt/anaconda3/etc/profile.d/conda.sh"
+    else
+        export PATH="/opt/anaconda3/bin:$PATH"
+    fi
+fi
+unset __conda_setup
+# <<< conda initialize <<<
+
+conda activate mst_mixer
+
+if [ $DSTC -eq 10 ]; then
+    export CUDA_VISIBLE_DEVICES=0; python main.py --start_idx_gen 0000 --end_idx_gen 0112 --gen_subset_num 01 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+    export CUDA_VISIBLE_DEVICES=1; python main.py --start_idx_gen 0112 --end_idx_gen 0224 --gen_subset_num 02 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+    export CUDA_VISIBLE_DEVICES=2; python main.py --start_idx_gen 0224 --end_idx_gen 0336 --gen_subset_num 03 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+    export CUDA_VISIBLE_DEVICES=3; python main.py --start_idx_gen 0336 --end_idx_gen 0448 --gen_subset_num 04 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+    export CUDA_VISIBLE_DEVICES=4; python main.py --start_idx_gen 0448 --end_idx_gen 0560 --gen_subset_num 05 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+    export CUDA_VISIBLE_DEVICES=5; python main.py --start_idx_gen 0560 --end_idx_gen 0672 --gen_subset_num 06 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+    export CUDA_VISIBLE_DEVICES=6; python main.py --start_idx_gen 0672 --end_idx_gen 0784 --gen_subset_num 07 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+    export CUDA_VISIBLE_DEVICES=7; python main.py --start_idx_gen 0784 --end_idx_gen 0896 --gen_subset_num 08 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+    export CUDA_VISIBLE_DEVICES=0; python main.py --start_idx_gen 0896 --end_idx_gen 1008 --gen_subset_num 09 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+    export CUDA_VISIBLE_DEVICES=1; python main.py --start_idx_gen 1008 --end_idx_gen 1120 --gen_subset_num 10 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+    export CUDA_VISIBLE_DEVICES=2; python main.py --start_idx_gen 1120 --end_idx_gen 1232 --gen_subset_num 11 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+    export CUDA_VISIBLE_DEVICES=3; python main.py --start_idx_gen 1232 --end_idx_gen 1344 --gen_subset_num 12 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+    export CUDA_VISIBLE_DEVICES=4; python main.py --start_idx_gen 1344 --end_idx_gen 1456 --gen_subset_num 13 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+    export CUDA_VISIBLE_DEVICES=5; python main.py --start_idx_gen 1456 --end_idx_gen 1568 --gen_subset_num 14 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+    export CUDA_VISIBLE_DEVICES=6; python main.py --start_idx_gen 1568 --end_idx_gen 1680 --gen_subset_num 15 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+    export CUDA_VISIBLE_DEVICES=7; python main.py --start_idx_gen 1680 --end_idx_gen 1804 --gen_subset_num 16 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+else
+    export CUDA_VISIBLE_DEVICES=0; python main.py --start_idx_gen 0000 --end_idx_gen 0107 --gen_subset_num 01 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+    export CUDA_VISIBLE_DEVICES=1; python main.py --start_idx_gen 0107 --end_idx_gen 0214 --gen_subset_num 02 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+    export CUDA_VISIBLE_DEVICES=2; python main.py --start_idx_gen 0214 --end_idx_gen 0321 --gen_subset_num 03 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+    export CUDA_VISIBLE_DEVICES=3; python main.py --start_idx_gen 0321 --end_idx_gen 0428 --gen_subset_num 04 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+    export CUDA_VISIBLE_DEVICES=4; python main.py --start_idx_gen 0428 --end_idx_gen 0535 --gen_subset_num 05 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+    export CUDA_VISIBLE_DEVICES=5; python main.py --start_idx_gen 0535 --end_idx_gen 0642 --gen_subset_num 06 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+    export CUDA_VISIBLE_DEVICES=6; python main.py --start_idx_gen 0642 --end_idx_gen 0749 --gen_subset_num 07 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+    export CUDA_VISIBLE_DEVICES=7; python main.py --start_idx_gen 0749 --end_idx_gen 0856 --gen_subset_num 08 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+    export CUDA_VISIBLE_DEVICES=0; python main.py --start_idx_gen 0856 --end_idx_gen 0963 --gen_subset_num 09 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+    export CUDA_VISIBLE_DEVICES=1; python main.py --start_idx_gen 0963 --end_idx_gen 1070 --gen_subset_num 10 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+    export CUDA_VISIBLE_DEVICES=2; python main.py --start_idx_gen 1070 --end_idx_gen 1177 --gen_subset_num 11 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+    export CUDA_VISIBLE_DEVICES=3; python main.py --start_idx_gen 1177 --end_idx_gen 1284 --gen_subset_num 12 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+    export CUDA_VISIBLE_DEVICES=4; python main.py --start_idx_gen 1284 --end_idx_gen 1391 --gen_subset_num 13 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+    export CUDA_VISIBLE_DEVICES=5; python main.py --start_idx_gen 1391 --end_idx_gen 1498 --gen_subset_num 14 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+    export CUDA_VISIBLE_DEVICES=6; python main.py --start_idx_gen 1498 --end_idx_gen 1605 --gen_subset_num 15 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+    export CUDA_VISIBLE_DEVICES=7; python main.py --start_idx_gen 1605 --end_idx_gen 1710 --gen_subset_num 16 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \ 
+fi
+
+wait
+
+python merge_pred_avsd.py --dstc $DSTC
--- a/generate_parallel_nextqa.sh
+++ b/generate_parallel_nextqa.sh
@ -0,0 +1,67 @@
+export MODEL=$1
+export TAG=$2
+export MODE=$3
+export EVAL_DIR=$4
+export DSTC=$5
+
+
+# >>> conda initialize >>>
+# !! Contents within this block are managed by 'conda init' !!
+__conda_setup="$('/opt/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
+if [ $? -eq 0 ]; then
+    eval "$__conda_setup"
+else
+    if [ -f "/opt/anaconda3/etc/profile.d/conda.sh" ]; then
+        . "/opt/anaconda3/etc/profile.d/conda.sh"
+    else
+        export PATH="/opt/anaconda3/bin:$PATH"
+    fi
+fi
+unset __conda_setup
+# <<< conda initialize <<<
+
+conda activate mst_mixer
+
+export CUDA_VISIBLE_DEVICES=0; python main.py --start_idx_gen 0000 --end_idx_gen 0285 --gen_subset_num 01 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=0; python main.py --start_idx_gen 0285 --end_idx_gen 0570 --gen_subset_num 02 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=0; python main.py --start_idx_gen 0570 --end_idx_gen 0855 --gen_subset_num 03 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=0; python main.py --start_idx_gen 0855 --end_idx_gen 1140 --gen_subset_num 04 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+
+export CUDA_VISIBLE_DEVICES=1; python main.py --start_idx_gen 1140 --end_idx_gen 1425 --gen_subset_num 05 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=1; python main.py --start_idx_gen 1425 --end_idx_gen 1710 --gen_subset_num 06 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=1; python main.py --start_idx_gen 1710 --end_idx_gen 1995 --gen_subset_num 07 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=1; python main.py --start_idx_gen 1995 --end_idx_gen 2280 --gen_subset_num 08 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+
+export CUDA_VISIBLE_DEVICES=2; python main.py --start_idx_gen 2280 --end_idx_gen 2565 --gen_subset_num 09 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=2; python main.py --start_idx_gen 2565 --end_idx_gen 2850 --gen_subset_num 10 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=2; python main.py --start_idx_gen 2850 --end_idx_gen 3135 --gen_subset_num 11 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=2; python main.py --start_idx_gen 3135 --end_idx_gen 3420 --gen_subset_num 12 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+
+export CUDA_VISIBLE_DEVICES=3; python main.py --start_idx_gen 3420 --end_idx_gen 3705 --gen_subset_num 13 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=3; python main.py --start_idx_gen 3705 --end_idx_gen 3990 --gen_subset_num 14 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=3; python main.py --start_idx_gen 3990 --end_idx_gen 4275 --gen_subset_num 15 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=3; python main.py --start_idx_gen 4275 --end_idx_gen 4560 --gen_subset_num 16 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+
+export CUDA_VISIBLE_DEVICES=4; python main.py --start_idx_gen 4560 --end_idx_gen 4845 --gen_subset_num 17 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=4; python main.py --start_idx_gen 4845 --end_idx_gen 5130 --gen_subset_num 18 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=4; python main.py --start_idx_gen 5130 --end_idx_gen 5415 --gen_subset_num 19 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=4; python main.py --start_idx_gen 5415 --end_idx_gen 5700 --gen_subset_num 20 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+
+export CUDA_VISIBLE_DEVICES=5; python main.py --start_idx_gen 5700 --end_idx_gen 5985 --gen_subset_num 21 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=5; python main.py --start_idx_gen 5985 --end_idx_gen 6270 --gen_subset_num 22 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=5; python main.py --start_idx_gen 6270 --end_idx_gen 6555 --gen_subset_num 23 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=5; python main.py --start_idx_gen 6555 --end_idx_gen 6840 --gen_subset_num 24 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+
+export CUDA_VISIBLE_DEVICES=6; python main.py --start_idx_gen 6840 --end_idx_gen 7125 --gen_subset_num 25 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=6; python main.py --start_idx_gen 7125 --end_idx_gen 7410 --gen_subset_num 26 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=6; python main.py --start_idx_gen 7410 --end_idx_gen 7695 --gen_subset_num 27 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=6; python main.py --start_idx_gen 7695 --end_idx_gen 7980 --gen_subset_num 28 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+
+export CUDA_VISIBLE_DEVICES=7; python main.py --start_idx_gen 7980 --end_idx_gen 8265 --gen_subset_num 29 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=7; python main.py --start_idx_gen 8265 --end_idx_gen 8550 --gen_subset_num 30 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=7; python main.py --start_idx_gen 8550 --end_idx_gen 8835 --gen_subset_num 31 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+export CUDA_VISIBLE_DEVICES=7; python main.py --start_idx_gen 8835 --end_idx_gen 9178 --gen_subset_num 32 --model $MODEL --mode $MODE --eval_dir $EVAL_DIR --tag $TAG & \
+
+wait
+
+python merge_pred_nextqa.py
--- a/init_utils.py
+++ b/init_utils.py
@ -0,0 +1,153 @@
+import os
+import torch
+import random
+import pyhocon
+import datetime
+import json
+import subprocess
+import itertools
+import glob
+import glog as log
+import sys
+import re
+from os import path as osp
+import numpy as np
+
+from custom_datasets.avsd import AVSDDataset
+from custom_datasets.nextqa import NextQADataset
+
+from runners.runner_avsd import AVSDRunner
+from runners.runner_nextqa import NEXTQARunner
+
+
+def load_runner(config, tokenizer, vocab_size):
+    if config['task'] == 'avsd':
+        return AVSDRunner(config, tokenizer, vocab_size)
+    elif config['task'] == 'nextqa':
+        return NEXTQARunner(config, tokenizer, vocab_size)
+    else:
+        raise ValueError
+
+
+def load_dataset(config):
+    if config['task'] == 'avsd':
+        dataset = AVSDDataset(config, 'train')
+        dataset_eval = AVSDDataset(config, 'val')
+    elif config['task'] == 'nextqa':
+        dataset = NextQADataset(config, 'train')
+        dataset_eval = NextQADataset(config, 'val')
+    else:
+        raise ValueError
+    return dataset, dataset_eval
+
+
+def set_random_seed(random_seed):
+    torch.manual_seed(random_seed)
+    torch.cuda.manual_seed(random_seed)
+    random.seed(random_seed)
+    np.random.seed(random_seed)
+
+
+def copy_file_to_log(log_dir):
+    dirs_to_cp = ['.', 'config', 'datasets', 'runners', 'models']
+    files_to_cp = ['*.py', '*.json', '*.sh', '*.conf']
+    for dir_name in dirs_to_cp:
+        dir_name = osp.join(log_dir, 'code', dir_name)
+        if not osp.exists(dir_name):
+            os.makedirs(dir_name)
+    for dir_name, file_name in itertools.product(dirs_to_cp, files_to_cp):
+        filename = osp.join(dir_name, file_name)
+        if len(glob.glob(filename)) > 0:
+            os.system(f'cp {filename} {osp.join(log_dir, "code", dir_name)}')
+    log.info(f'Files copied to {osp.join(log_dir, "code")}')
+
+
+def set_log_file(fname, file_only=False):
+    # if fname already exists, find all log file under log dir,
+    # and name the current log file with a new number
+    if osp.exists(fname):
+        prefix, suffix = osp.splitext(fname)
+        log_files = glob.glob(prefix + '*' + suffix)
+        count = 0
+        for log_file in log_files:
+            num = re.search(r'(\d+)', log_file)
+            if num is not None:
+                num = int(num.group(0))
+                count = max(num, count)
+        fname = fname.replace(suffix, str(count + 1) + suffix)
+    # set log file
+    # simple tricks for duplicating logging destination in the logging module such as:
+    # logging.getLogger().addHandler(logging.FileHandler(filename))
+    # does NOT work well here, because python Traceback message (not via logging module) is not sent to the file,
+    # the following solution (copied from : https://stackoverflow.com/questions/616645) is a little bit
+    # complicated but simulates exactly the "tee" command in linux shell, and it redirects everything
+    if file_only:
+        # we only output messages to file, and stdout/stderr receives nothing.
+        # this feature is designed for executing the script via ssh:
+        # since ssh has a windowing kind of flow control, i.e., if the controller does not read data from a
+        # ssh channel and its buffer fills up, the execution machine will not be able to write anything into the
+        # channel and the process will be set to sleeping (S) status until someone reads all data from the channel.
+        # this is not desired since we do not want to read stdout/stderr from the controller machine.
+        # so, here we use a simple solution: disable output to stdout/stderr and only output messages to log file.
+        log.logger.handlers[0].stream = log.handler.stream = sys.stdout = sys.stderr = f = open(fname, 'w', buffering=1)
+    else:
+        # we output messages to both file and stdout/stderr
+        tee = subprocess.Popen(['tee', fname], stdin=subprocess.PIPE)
+        os.dup2(tee.stdin.fileno(), sys.stdout.fileno())
+        os.dup2(tee.stdin.fileno(), sys.stderr.fileno())
+
+
+def initialize_from_env(model, mode, model_type, eval_dir, tag=''):
+    if "GPU" in os.environ:
+        os.environ["CUDA_VISIBLE_DEVICES"] = os.environ['GPU']
+    if mode in ['train', 'debug']:
+        config = pyhocon.ConfigFactory.parse_file(f"config/{model_type}.conf")[model]
+    else:
+        path_config = os.path.join(eval_dir, 'code', f"config/{model_type}.conf")
+        config = pyhocon.ConfigFactory.parse_file(path_config)[model]
+        config['log_dir'] = eval_dir
+        
+    config['model_type'] = model_type
+    if "CUDA_VISIBLE_DEVICES" in os.environ:
+        config['num_gpus'] = len(os.environ["CUDA_VISIBLE_DEVICES"].split(','))
+        # multi-gpu setting
+        if config['num_gpus'] > 1:
+            os.environ['MASTER_ADDR'] = 'localhost'
+            os.environ["MASTER_PORT"] = str(config['master_port'])
+
+    if mode == 'debug':
+        model += '_debug'
+
+    if tag:
+        model += '-' + tag
+    if mode != 'generate':
+        config["log_dir"] = os.path.join(config["log_dir"], model)
+        if not os.path.exists(config["log_dir"]):
+            os.makedirs(config["log_dir"])
+
+    config['timestamp'] = datetime.datetime.now().strftime('%m%d-%H%M%S')
+
+    # Choose the correct config file and add the BART json file to config
+    if mode in ['train', 'debug']:
+        config['bart_config'] = config['{}_bart_{}_config'.format(
+            config['task'], config['bart_size'])]
+    else:
+        config['bart_config'] = os.path.join(eval_dir, 'code', 'config/{}_bart_{}.json'.format(
+            config['task'], config['bart_size']))
+
+    config['bart_config_json'] = json.load(open(config['bart_config'], 'r'))
+
+    config['overfit'] = config['overfit_size'] > 0
+    return config
+
+
+def set_training_steps(config, num_samples):
+    if config['parallel'] and config['dp_type'] == 'dp':
+        config['num_iter_per_epoch'] = int(np.ceil(num_samples / config['batch_size']))
+    else:
+        config['num_iter_per_epoch'] = int(np.ceil(num_samples / (config['batch_size'] * config['num_gpus'])))
+    if 'train_steps' not in config:
+        config['train_steps'] = config['num_iter_per_epoch'] * config['num_epochs']
+    if 'warmup_steps' not in config:
+        config['warmup_steps'] = int(config['train_steps'] * config['warmup_ratio'])
+    return config
--- a/main.py
+++ b/main.py
@ -0,0 +1,225 @@
+from init_utils import (
+    load_runner,
+    load_dataset,
+    set_random_seed,
+    set_training_steps,
+    initialize_from_env,
+    set_log_file,
+    copy_file_to_log
+)
+import os
+import sys
+import argparse
+import pyhocon
+import glog as log
+import socket
+import getpass
+
+import torch
+import torch.multiprocessing as mp
+import torch.nn as nn
+import torch.distributed as dist
+from transformers import BartTokenizer
+
+from custom_datasets.avsd import get_dataset as get_avsd_dataset
+from custom_datasets.nextqa import get_dataset as get_nextqa_dataset
+
+
+parser = argparse.ArgumentParser(description='Main script for MST-MIXER')
+parser.add_argument(
+    '--model',
+    type=str,
+    default='mst_mixer/mixer',
+    help='model name to train or test')
+
+parser.add_argument(
+    '--mode',
+    type=str,
+    default='train',
+    help='train, generate or debug'
+    )
+
+parser.add_argument(
+    '--eval_dir',
+    type=str,
+    default='ckpt/avsd'
+)
+
+parser.add_argument(
+    '--wandb_project',
+    type=str,
+    default='mst_mixer'
+)
+
+parser.add_argument(
+    '--wandb_mode',
+    type=str,
+    default='offline',
+    choices=['online', 'offline', 'disabled', 'run', 'dryrun']
+)
+
+parser.add_argument(
+    '--tag',
+    type=str,
+    default='full_model',
+    help="Tag to differentiate the models"
+)
+
+parser.add_argument(
+    '--start_idx_gen',
+    type=int,
+    default=0,
+    help="The start index for generation"
+)
+
+parser.add_argument(
+    '--end_idx_gen',
+    type=int,
+    default=10,
+    help="The end index for generation"
+)
+
+parser.add_argument(
+    '--gen_subset_num',
+    type=int,
+    default=1,
+    help="The index of the test split for generation"
+)
+
+
+parser.add_argument('--ssh', action='store_true',
+                    help='whether or not we are executing command via ssh. '
+                         'If set to True, we will not log.info anything to screen and only redirect them to log file')
+
+
+def main(gpu, config, args):
+    config['training'] = args.mode == 'train'
+    config['debugging'] = args.mode == 'debug'
+    config['generating'] = args.mode == 'generate'
+    config['wandb_project'] = args.wandb_project
+    config['wandb_mode'] = 'disabled'
+    if config['training']:
+        config['wandb_mode'] = args.wandb_mode
+
+    # When generating, only use 1 GPU
+    if config['generating']:
+        assert config['num_gpus'] == 1, 'When generating, only use 1 GPU!'
+
+    if config['parallel'] and config['dp_type'] != 'dp':
+        config['rank'] = gpu
+        dist.init_process_group(
+            backend='nccl',
+            # init_method='env://',
+            world_size=config['num_gpus'],
+            rank=gpu
+        )
+        config['display'] = gpu == 0
+        torch.cuda.set_device(gpu)
+    else:
+        config['display'] = True
+    if config['debugging'] or (config['parallel'] and config['dp_type'] != 'dp'):
+        config['num_workers'] = 0
+
+    # set logs
+    if config['training']:
+        log_file = os.path.join(config["log_dir"], f'{args.mode}.log')
+        set_log_file(log_file, file_only=args.ssh)
+
+    # print environment info
+    if config['display']:
+        log.info('Host: {}, user: {}, CUDA_VISIBLE_DEVICES: {}, cwd: {}'.format(
+            socket.gethostname(), getpass.getuser(), os.environ.get('CUDA_VISIBLE_DEVICES', ''), os.getcwd()))
+        log.info('Command line is: {}'.format(' '.join(sys.argv)))
+
+        if config['parallel'] and config['dp_type'] != 'dp':
+            log.info(f'World_size: {config["num_gpus"]}, cur rank: {config["rank"]}')
+        log.info(f"Running experiment: {args.model}")
+        log.info(f"Results saved to {config['log_dir']}")
+
+    # initialization
+    if config['display'] and config['training']:
+        copy_file_to_log(config['log_dir'])
+    set_random_seed(config['random_seed'])
+
+    device = torch.device(f"cuda:{gpu}")
+    if config["use_cpu"]:
+        device = torch.device("cpu")
+    config['device'] = device
+
+    # prepare datasets (train and validation)
+    dataset, dataset_eval = load_dataset(config)
+
+    # set training steps
+    if not config['generating'] or config['parallel']:
+        config = set_training_steps(config, len(dataset))
+
+    if config['display']:
+        log.info(pyhocon.HOCONConverter.convert(config, "hocon"))
+
+    # load runner
+    runner = load_runner(config, dataset.tokenizer, dataset.vocab_size)
+
+    # parallel
+    if config['parallel']:
+        if config['dp_type'] == 'ddp':
+            torch.cuda.set_device(gpu)
+            runner.model = runner.model.to(gpu)
+            runner.model = nn.parallel.DistributedDataParallel(
+                runner.model,
+                device_ids=[gpu],
+                output_device=gpu,
+                find_unused_parameters=True
+            )
+        else:
+            raise ValueError(f'Unrecognized dp_type: {config["dp_type"]}')
+
+    if config['training'] or config['debugging']:
+        ckpt_path = config.get('start_path', None)
+        runner.load_ckpt(ckpt_path=ckpt_path)
+        runner.train(dataset, dataset_eval)
+
+    elif config['generating']:
+        if config['loads_start_path']:
+            runner.load_ckpt(config['start_ckpt_for_generating'])
+        else:
+            runner.load_ckpt_best()
+        assert args.gen_subset_num > 0
+        # Load the data
+        if config['task'] == 'avsd':
+            # load the saved tokenizer
+            tokenizer = BartTokenizer.from_pretrained(os.path.join(config['log_dir'], 'bart_tokenizer'))
+            test_dataset, _ = get_avsd_dataset(config, 'test', tokenizer)
+            assert args.start_idx_gen >= 0 and args.end_idx_gen <= len(test_dataset) and args.start_idx_gen < args.end_idx_gen
+            test_dataset = test_dataset[args.start_idx_gen:args.end_idx_gen]
+            runner.generate(
+                test_dataset, args.tag, tokenizer, gen_subset_num=args.gen_subset_num
+            )
+
+        elif config['task'] == 'nextqa':
+            # load the saved tokenizer
+            tokenizer = BartTokenizer.from_pretrained(os.path.join(config['log_dir'], 'bart_tokenizer'))
+            test_dataset, app_feats, mot_feats = get_nextqa_dataset(config, 'test')
+            assert args.start_idx_gen >= 0 and args.end_idx_gen <= len(test_dataset) and args.start_idx_gen < args.end_idx_gen
+            test_dataset = test_dataset[args.start_idx_gen:args.end_idx_gen]
+            runner.generate(
+                test_dataset, app_feats, mot_feats, args.tag, tokenizer, args.start_idx_gen, args.end_idx_gen, gen_subset_num=args.gen_subset_num
+            )
+        else:
+            raise ValueError       
+
+    if config['parallel']:
+        dist.destroy_process_group()
+
+
+if __name__ == '__main__':
+    args = parser.parse_args()
+
+    # initialization
+    model_type, model_name = args.model.split('/')
+    config = initialize_from_env(model_name, args.mode, model_type, args.eval_dir, tag=args.tag)
+    if config['num_gpus'] > 1:
+        config['parallel'] = True
+        mp.spawn(main, nprocs=config['num_gpus'], args=(config, args))
+    else:
+        config['parallel'] = False
+        main(0, config, args)
--- a/merge_pred_avsd.py
+++ b/merge_pred_avsd.py
@ -0,0 +1,61 @@
+import os
+import json
+import argparse
+
+parser = argparse.ArgumentParser(description='Main script for MST-MIXER')
+parser.add_argument(
+    '--dstc',
+    type=int,
+    default=8,
+    choices=[7, 8, 10],
+    help='DSTC challenge identifier')
+
+args = parser.parse_args()
+
+assert args.dstc in [7, 8, 10]
+if args.dstc == 7:
+    output_dir = 'output/dstc7'
+    raw_data_path = 'raw_data/test_set4DSTC7-AVSD.json'
+
+elif args.dstc == 8:
+    output_dir = 'output/dstc8'
+    raw_data_path = 'raw_data/test_set4DSTC8-AVSD.json'
+else:
+    output_dir = 'output/dstc10'
+    raw_data_path = 'raw_data/test_set4DSTC10-AVSD.json'
+
+with open(raw_data_path, 'r') as f:
+    raw_dialogs = json.load(f)['dialogs']
+
+file_paths = os.listdir(output_dir)
+file_paths = list(filter(lambda f: 'part' in f , file_paths))
+name = file_paths[0]
+file_paths = list(map(lambda f: os.path.join(output_dir, f), file_paths))
+
+dialogs = {}
+for pth in file_paths:
+    with open(pth, 'r') as f:
+        data = json.load(f)
+    
+    for dialog in data['dialogs']:
+        vid_id = dialog['image_id']
+        dialogs[vid_id] = dialog
+    # dialogs.extend(data['dialogs'])
+    os.remove(pth)
+
+# Now, re-establish the original order of the dialogs
+res = []
+for dialog in raw_dialogs:
+    vid_id = dialog['image_id']
+    res.append(dialogs[vid_id])
+
+res = {
+    'dialogs': res
+}
+
+name = "".join(name.split('-')[:-1]) + '.json'
+output_path = os.path.join(output_dir, name)
+with open(output_path, 'w') as f:
+    json.dump(res, f, indent=4)
+
+print('[INFO] Files merged and saved in {}'.format(output_path))
--- a/merge_pred_nextqa.py
+++ b/merge_pred_nextqa.py
@ -0,0 +1,34 @@
+import os
+import json
+import argparse
+
+parser = argparse.ArgumentParser(description='Main script for MST-MIXER')
+
+args = parser.parse_args()
+
+output_dir = 'output/nextqa'
+
+file_paths = os.listdir(output_dir)
+file_paths = list(filter(lambda f: 'part' in f , file_paths))
+name = file_paths[0]
+file_paths = list(map(lambda f: os.path.join(output_dir, f), file_paths))
+
+results = {}
+for pth in file_paths:
+    with open(pth, 'r') as f:
+        data = json.load(f)
+    for video_id in data:
+        if video_id not in results:
+            results[video_id] = data[video_id]
+        else:
+            for qid in data[video_id]:
+                if qid not in results[video_id]:
+                    results[video_id][qid] = data[video_id][qid]
+    os.remove(pth)
+
+name = "".join(name.split('-')[:-1]) + '.json'
+output_path = os.path.join(output_dir, name)
+with open(output_path, 'w') as f:
+    json.dump(results, f, indent=4)
+
+print('[INFO] Files merged and saved in {}'.format(output_path))
--- a/misc/italy.png
+++ b/misc/italy.png
--- a/misc/mixer.png
+++ b/misc/mixer.png
--- a/misc/teaser.png
+++ b/misc/teaser.png
--- a/models/init.py
+++ b/models/init.py
--- a/models/avsd_bart.py
+++ b/models/avsd_bart.py
--- a/models/gnns.py
+++ b/models/gnns.py
@ -0,0 +1,801 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch_geometric.nn.dense import DenseGATConv, DenseGCNConv, DenseSAGEConv
+from torch.nn.parameter import Parameter
+from typing import Optional, Tuple
+from .utils import get_knn_graph
+import torch_sparse
+
+
+class BartAttention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(
+        self,
+        embed_dim: int,
+        num_heads: int,
+        dropout: float = 0.0,
+        is_decoder: bool = False,
+        bias: bool = True,
+    ):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.num_heads = num_heads
+        self.dropout = dropout
+        self.head_dim = embed_dim // num_heads
+
+        if (self.head_dim * num_heads) != self.embed_dim:
+            raise ValueError(
+                f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim}"
+                f" and `num_heads`: {num_heads})."
+            )
+        self.scaling = self.head_dim**-0.5
+        self.is_decoder = is_decoder
+
+        self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+        self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+        self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+
+    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
+        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        key_value_states: Optional[torch.Tensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        layer_head_mask: Optional[torch.Tensor] = None,
+        output_attentions: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        """Input shape: Batch x Time x Channel"""
+
+        # if key_value_states are provided this layer is used as a cross-attention layer
+        # for the decoder
+        is_cross_attention = key_value_states is not None
+
+        bsz, tgt_len, _ = hidden_states.size()
+
+        # get query proj
+        query_states = self.q_proj(hidden_states) * self.scaling
+        # get key, value proj
+        # `past_key_value[0].shape[2] == key_value_states.shape[1]`
+        # is checking that the `sequence_length` of the `past_key_value` is the same as
+        # the provided `key_value_states` to support prefix tuning
+        if (
+            is_cross_attention
+            and past_key_value is not None
+            and past_key_value[0].shape[2] == key_value_states.shape[1]
+        ):
+            # reuse k,v, cross_attentions
+            key_states = past_key_value[0]
+            value_states = past_key_value[1]
+        elif is_cross_attention:
+            # cross_attentions
+            key_states = self._shape(self.k_proj(key_value_states), -1, bsz)
+            value_states = self._shape(self.v_proj(key_value_states), -1, bsz)
+        elif past_key_value is not None:
+            # reuse k, v, self_attention
+            key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
+            value_states = self._shape(self.v_proj(hidden_states), -1, bsz)
+            key_states = torch.cat([past_key_value[0], key_states], dim=2)
+            value_states = torch.cat([past_key_value[1], value_states], dim=2)
+        else:
+            # self_attention
+            key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
+            value_states = self._shape(self.v_proj(hidden_states), -1, bsz)
+
+        if self.is_decoder:
+            # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.
+            # Further calls to cross_attention layer can then reuse all cross-attention
+            # key/value_states (first "if" case)
+            # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of
+            # all previous decoder key/value_states. Further calls to uni-directional self-attention
+            # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
+            # if encoder bi-directional self-attention `past_key_value` is always `None`
+            past_key_value = (key_states, value_states)
+
+        proj_shape = (bsz * self.num_heads, -1, self.head_dim)
+        query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape)
+        key_states = key_states.reshape(*proj_shape)
+        value_states = value_states.reshape(*proj_shape)
+
+        src_len = key_states.size(1)
+        attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
+
+        if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
+            raise ValueError(
+                f"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is"
+                f" {attn_weights.size()}"
+            )
+
+        if attention_mask is not None:
+            if attention_mask.size() != (bsz, 1, tgt_len, src_len):
+                raise ValueError(
+                    f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}"
+                )
+            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attention_mask
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
+
+        attn_weights = nn.functional.softmax(attn_weights, dim=-1)
+
+        if layer_head_mask is not None:
+            if layer_head_mask.size() != (self.num_heads,):
+                raise ValueError(
+                    f"Head mask for a single layer should be of size {(self.num_heads,)}, but is"
+                    f" {layer_head_mask.size()}"
+                )
+            attn_weights = layer_head_mask.view(1, -1, 1, 1) * attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
+
+        if output_attentions:
+            # this operation is a bit awkward, but it's required to
+            # make sure that attn_weights keeps its gradient.
+            # In order to do so, attn_weights have to be reshaped
+            # twice and have to be reused in the following
+            attn_weights_reshaped = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            attn_weights = attn_weights_reshaped.view(bsz * self.num_heads, tgt_len, src_len)
+        else:
+            attn_weights_reshaped = None
+
+        attn_probs = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
+
+        attn_output = torch.bmm(attn_probs, value_states)
+
+        if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
+            raise ValueError(
+                f"`attn_output` should be of size {(bsz * self.num_heads, tgt_len, self.head_dim)}, but is"
+                f" {attn_output.size()}"
+            )
+
+        attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim)
+        attn_output = attn_output.transpose(1, 2)
+
+        # Use the `embed_dim` from the config (stored in the class) rather than `hidden_state` because `attn_output` can be
+        # partitioned across GPUs when using tensor-parallelism.
+        attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim)
+
+        attn_output = self.out_proj(attn_output)
+
+        return attn_output, attn_weights_reshaped, past_key_value
+
+
+class MLPModule(nn.Module):
+    def __init__(self, d_in, d_hidden, d_out, num_layers=3, dropout=0.3, use_non_linear=False, use_batch_norm=False):
+        super(MLPModule, self).__init__()
+        
+        self.use_batch_norm = use_batch_norm
+        self.dropout = dropout 
+
+        self.fcs = nn.ModuleList()
+        self.batch_norms = nn.ModuleList()
+
+        if num_layers == 1:
+            self.fcs.append(nn.Linear(d_in, d_out))
+        else:
+            self.fcs.append(nn.Linear(d_in, d_hidden))
+            self.batch_norms.append(nn.BatchNorm1d(d_hidden))
+            for _ in range(num_layers - 2):
+                self.fcs.append(nn.Linear(d_hidden, d_hidden))
+                self.batch_norms.append(nn.BatchNorm1d(d_hidden))
+            
+            self.fcs.append(nn.Linear(d_hidden, d_out))
+
+        self.act_fn = nn.GELU()
+        self.dropout = nn.Dropout(dropout)
+        self.use_non_linear=use_non_linear
+
+
+    def reset_parameters(self):
+        for fc in self.fcs:
+            fc.reset_parameters()
+        for bn in self.batch_norms:
+            bn.reset_parameters()
+
+    def forward(self, X):
+        for fc, bn in zip(self.fcs[:-1], self.batch_norms):
+            X = fc(X)
+            X = self.act_fn(X)
+            if self.use_batch_norm:
+                if X.dim() > 2:
+                    X = X.transpose(1, 2)
+                X = bn(X)
+                if X.dim() > 2:
+                    X = X.transpose(1, 2)
+            X = self.dropout(X)
+        X = self.fcs[-1](X)
+        return X
+
+
+class GATModule(nn.Module):
+    def __init__(self, d_in, d_hidden, d_out, num_layers=3, dropout=0.3, concat=True, heads=2, use_non_linear=False, use_batch_norm=False):
+        super(GATModule, self).__init__()
+        self.gnns = nn.ModuleList()
+        if concat:
+            d_hidden = d_hidden // heads
+            d_out = d_out // heads
+
+        self.gnns.append(DenseGATConv(d_in, d_hidden, heads=heads, concat=concat, dropout=dropout))
+
+        self.batch_norms = nn.ModuleList()
+        self.batch_norms.append(nn.BatchNorm1d(d_hidden * heads if concat else d_hidden))
+
+        for _ in range(num_layers - 2):
+            self.gnns.append(DenseGATConv(
+                d_hidden * heads if concat else d_hidden, d_hidden,
+                heads=heads,
+                concat=concat,
+                dropout=dropout)
+            )
+            self.batch_norms.append(nn.BatchNorm1d(d_hidden * heads if concat else d_hidden))
+        
+        self.gnns.append(DenseGATConv(
+            d_hidden * heads if concat else d_hidden, d_out,
+            heads=heads,
+            concat=concat,
+            dropout=dropout)
+        )
+
+        self.dropout = nn.Dropout(dropout)
+        self.non_linear = nn.GELU()
+        self.use_batch_norm = use_batch_norm
+        self.use_non_linear = use_non_linear
+
+    def reset_parameters(self):
+        for gnn in self.gnns:
+            gnn.reset_parameters()
+        for batch_norm in self.batch_norms:
+            batch_norm.reset_parameters()
+
+    def forward(self, X, A):
+        Z = self.dropout(X)
+        for i in range(len(self.gnns) - 1):
+            Z = self.gnns[i](Z, A)
+            if self.use_batch_norm:
+                Z = Z.transpose(1, 2)
+                Z = self.batch_norms[i](Z)
+                Z = Z.transpose(1, 2)
+            if self.use_non_linear:
+                Z = self.non_linear(Z)
+            Z = self.dropout(Z)
+        Z = self.gnns[-1](Z, A)
+        return Z
+
+
+class GCNModule(nn.Module):
+    def __init__(self, d_in, d_hidden, d_out, num_layers=3, dropout=0.3, use_non_linear=False, use_batch_norm=False):
+        super(GCNModule, self).__init__()
+        self.gnns = nn.ModuleList()
+
+        self.gnns.append(DenseGCNConv(d_in, d_hidden))
+
+        self.batch_norms = nn.ModuleList()
+        self.batch_norms.append(nn.BatchNorm1d(d_hidden))
+
+        for _ in range(num_layers - 2):
+            self.gnns.append(DenseGCNConv(
+                d_hidden, d_hidden)
+            )
+            self.batch_norms.append(nn.BatchNorm1d(d_hidden))
+        
+        self.gnns.append(DenseGCNConv(
+            d_hidden, d_out)
+        )
+
+        self.dropout = nn.Dropout(dropout)
+        self.non_linear = nn.GELU()
+        self.use_batch_norm = use_batch_norm
+        self.use_non_linear = use_non_linear
+
+    def reset_parameters(self):
+        for gnn in self.gnns:
+            gnn.reset_parameters()
+        for batch_norm in self.batch_norms:
+            batch_norm.reset_parameters()
+
+    def forward(self, X, A):
+        Z = self.dropout(X)
+        for i in range(len(self.gnns) - 1):
+            Z = self.gnns[i](Z, A)
+            if self.use_batch_norm:
+                Z = Z.transpose(1, 2)
+                Z = self.batch_norms[i](Z)
+                Z = Z.transpose(1, 2)
+            if self.use_non_linear:
+                Z = self.non_linear(Z)
+            Z = self.dropout(Z)
+        Z = self.gnns[-1](Z, A)
+        return Z
+
+
+class SAGEModule(nn.Module):
+    def __init__(self, d_in, d_hidden, d_out, num_layers=3, dropout=0.3, use_non_linear=False, use_batch_norm=False):
+        super(SAGEModule, self).__init__()
+        self.gnns = nn.ModuleList()
+
+        self.gnns.append(DenseSAGEConv(d_in, d_hidden))
+
+        self.batch_norms = nn.ModuleList()
+        self.batch_norms.append(nn.BatchNorm1d(d_hidden))
+
+        for _ in range(num_layers - 2):
+            self.gnns.append(DenseSAGEConv(
+                d_hidden, d_hidden)
+            )
+            self.batch_norms.append(nn.BatchNorm1d(d_hidden))
+        
+        self.gnns.append(DenseSAGEConv(
+            d_hidden, d_out)
+        )
+
+        self.dropout = nn.Dropout(dropout)
+        self.non_linear = nn.GELU()
+        self.use_batch_norm = use_batch_norm
+        self.use_non_linear = use_non_linear
+
+    def reset_parameters(self):
+        for gnn in self.gnns:
+            gnn.reset_parameters()
+        for batch_norm in self.batch_norms:
+            batch_norm.reset_parameters()
+
+    def forward(self, X, A):
+        Z = self.dropout(X)
+        for i in range(len(self.gnns) - 1):
+            Z = self.gnns[i](Z, A)
+            if self.use_batch_norm:
+                Z = Z.transpose(1, 2)
+                Z = self.batch_norms[i](Z)
+                Z = Z.transpose(1, 2)
+            if self.use_non_linear:
+                Z = self.non_linear(Z)
+            Z = self.dropout(Z)
+        Z = self.gnns[-1](Z, A)
+        return Z
+
+
+class GlobalGraphLearner(nn.Module):
+    def __init__(self, d_in, num_heads, random=False):
+        super(GlobalGraphLearner, self).__init__()
+        self.random = random
+        if not self.random:
+            w = torch.Tensor(num_heads, d_in)
+            self.w = Parameter(nn.init.xavier_uniform_(w), requires_grad=True)
+
+    def reset_parameters(self):
+        if not self.random:
+            self.w = Parameter(nn.init.xavier_uniform_(self.w))
+
+    def forward(self, Z):
+        if self.random:
+            att_global = torch.randn((Z.size(0), Z.size(1), Z.size(1))).to(Z.device)
+        else:
+            w_expanded = self.w.unsqueeze(1).unsqueeze(1)
+            Z = Z.unsqueeze(0) * w_expanded
+            Z = F.normalize(Z, p=2, dim=-1)
+            att_global = torch.matmul(Z, Z.transpose(-1, -2)).mean(0)
+        mask_global = (att_global > 0).detach().float()
+        att_global = att_global * mask_global
+
+        return att_global
+
+
+class DenseAPPNP(nn.Module):
+    def __init__(self, K, alpha):
+        super().__init__()
+        self.K = K
+        self.alpha = alpha
+
+    def forward(self, x, adj_t):
+        h = x
+        for _ in range(self.K):
+            if adj_t.is_sparse:
+                x = torch_sparse.spmm(adj_t, x)
+            else:
+                x = torch.matmul(adj_t, x)
+            x = x * (1 - self.alpha)
+            x += self.alpha * h
+        x /= self.K
+        return x
+
+
+class Dense_APPNP_Net(nn.Module):
+    def __init__(self, d_in, d_hidden, d_out, dropout=.5, K=10, alpha=.1):
+        super(Dense_APPNP_Net, self).__init__()
+        self.lin1 = nn.Linear(d_in, d_hidden)
+        self.lin2 = nn.Linear(d_hidden, d_out)
+        self.prop1 = DenseAPPNP(K, alpha)
+        self.dropout = dropout
+
+    def reset_parameters(self):
+        self.lin1.reset_parameters()
+        self.lin2.reset_parameters()
+
+    def forward(self, x, adj_t):
+        x = F.dropout(x, p=self.dropout, training=self.training)
+        x = F.relu(self.lin1(x))
+        x = F.dropout(x, p=self.dropout, training=self.training)
+        x = self.lin2(x)
+        x = self.prop1(x, adj_t)
+        return x
+
+
+class MMGraphLearner(nn.Module):
+    def __init__(self, d_in, num_heads, random=False):
+        super(MMGraphLearner, self).__init__()
+        self.random = random
+        if not self.random:
+            w = torch.Tensor(num_heads, d_in)
+            self.w = Parameter(nn.init.xavier_uniform_(w), requires_grad=True)
+
+            self.fc = nn.Linear(d_in, d_in)
+
+    def reset_parameters(self):
+        if not self.random:
+            self.fc.reset_parameters()
+            self.w = Parameter(nn.init.xavier_uniform_(self.w), requires_grad=True)
+
+    def forward(self, features):
+        if self.random:
+            att = torch.randn((features.size(0), features.size(1), features.size(1))).to(features.device)
+        else:
+            features = self.fc(features)
+            w_expanded = self.w.unsqueeze(1).unsqueeze(1)
+            features = features.unsqueeze(0) * w_expanded
+            features = F.normalize(features, p=2, dim=-1)
+            att = torch.matmul(features, features.transpose(-1, -2)).mean(0)
+        mask = (att > 0).detach().float()
+        att = att * mask
+        return att
+      
+
+class QNetLocal(nn.Module):
+    def __init__(self, config):
+        super(QNetLocal, self).__init__()
+        self.config=config
+
+        self.mm_gnn_modules = nn.ModuleList()
+        self.mm_graph_learners_1 = nn.ModuleList()
+        self.mm_graph_learners_2 = nn.ModuleList()
+
+        for _ in range(self.config.num_modalities):
+            if self.config.gnn_type == 'gat':
+                self.mm_gnn_modules.append(GATModule(
+                    self.config.d_model, self.config.d_model, self.config.d_model,
+                    num_layers=self.config.num_local_gnn_layers,
+                    heads=self.config.num_local_gnn_heads,
+                    dropout=self.config.local_gnn_dropout,
+                    concat=self.config.local_gnn_concat,
+                    use_batch_norm=self.config.use_local_gnn_bn,
+                    use_non_linear=self.config.use_non_linear
+                    )
+                )
+            elif self.config.gnn_type == 'appnp':
+                self.mm_gnn_modules.append(Dense_APPNP_Net(
+                    self.config.d_model, self.config.d_model, self.config.d_model, dropout=self.config.local_gnn_dropout,
+                    K=self.config.gnn_K, alpha=self.config.gnn_alpha
+                    )
+                )
+            elif self.config.gnn_type == 'gcn':
+                self.mm_gnn_modules.append(GCNModule(
+                    self.config.d_model, self.config.d_model, self.config.d_model,
+                    num_layers=self.config.num_local_gnn_layers,
+                    dropout=self.config.local_gnn_dropout,
+                    use_batch_norm=self.config.use_local_gnn_bn,
+                    use_non_linear=self.config.use_non_linear
+                )
+            )
+            elif self.config.gnn_type == 'sage':
+                self.mm_gnn_modules.append(SAGEModule(
+                    self.config.d_model, self.config.d_model, self.config.d_model,
+                    num_layers=self.config.num_local_gnn_layers,
+                    dropout=self.config.local_gnn_dropout,
+                    use_batch_norm=self.config.use_local_gnn_bn,
+                    use_non_linear=self.config.use_non_linear
+                )
+            )
+            else:
+                raise ValueError
+            self.mm_graph_learners_1.append(MMGraphLearner(self.config.d_model, self.config.num_local_gr_learner_heads, random=self.config.use_random_graphs))
+            self.mm_graph_learners_2.append(MMGraphLearner(self.config.d_model * 2, self.config.num_local_gr_learner_heads, random=self.config.use_random_graphs))
+
+    def reset_parameters(self):
+        for i in range(self.config.num_modalities):
+            self.mm_gnn_modules[i].reset_parameters()
+            self.mm_graph_learners_1[i].reset_parameters()
+            self.mm_graph_learners_2[i].reset_parameters()
+
+    def forward(self, features, A_tildes=None):
+        mm_Xs = features#  []
+        device = features[0].device
+
+        if A_tildes is None:
+            A_tildes = []
+            for mm_X in mm_Xs:
+                A_tildes.append(get_knn_graph(mm_X, self.config.num_nn, device))
+
+        ################# Multi-modal graph learner (upper branch) #################
+        A_primes = []
+        for i, mm_X in enumerate(mm_Xs):  # iterate over the modalities
+            A_primes.append(self.mm_graph_learners_1[i](mm_X))
+
+        # Linear combination of A_primes with A_tildes
+        A_primes = [(1 - self.config.init_adj_ratio) * A_prime + self.config.init_adj_ratio * A_tilde for A_prime, A_tilde in zip(A_primes, A_tildes)]
+
+        ################# Multi-modal gnn (upper branch) #################
+        Z_primes = []
+        for i, (mm_X, A_prime) in enumerate(zip(mm_Xs, A_primes)):
+            Z_primes.append(self.mm_gnn_modules[i](mm_X, A_prime))
+
+        ################# Multi-modal gnn (lower branch) #################
+        Z_double_primes = []
+        for i, (mm_X, A_tilde) in enumerate(zip(mm_Xs, A_tildes)):
+            Z_double_primes.append(self.mm_gnn_modules[i](mm_X, A_tilde))
+
+        Z_concats = [torch.cat([Z_1, Z_2], dim=-1) for Z_1, Z_2 in zip(Z_primes, Z_double_primes)]
+                
+        ################# Multi-modal graph learner (lower branch) #################
+        A_double_primes = []
+        for i, Z_concat in enumerate(Z_concats):
+            A_double_primes.append(self.mm_graph_learners_2[i](Z_concat))
+
+        A_double_primes = [(1 - self.config.init_adj_ratio) * A_double_prime + self.config.init_adj_ratio * A_tilde for A_double_prime, A_tilde in zip(A_double_primes, A_tildes)]
+
+        As = [(1 - self.config.adj_ratio) * A_prime + self.config.adj_ratio * A_double_prime for A_prime, A_double_prime in zip(A_primes, A_double_primes)]
+
+        ################## Average across all multimodal inputs ##################
+
+        Zs = [0.5 * Z1 + 0.5 * Z2 for Z1, Z2 in zip(Z_primes, Z_double_primes)]
+        return As, Zs
+
+
+class PNetLocal(nn.Module):
+    def __init__(self, config):
+        super(PNetLocal, self).__init__()
+        self.config = config
+        self.mm_gnn_modules = nn.ModuleList()
+        self.mm_mlp_modules = nn.ModuleList()
+
+        self.mm_graph_learners_1 = nn.ModuleList()
+        self.mm_graph_learners_2 = nn.ModuleList()
+
+        for _ in range(self.config.num_modalities):
+            if self.config.gnn_type == 'gat':
+                self.mm_gnn_modules.append(GATModule(
+                    self.config.d_model, self.config.d_model, self.config.d_model,
+                    num_layers=self.config.num_local_gnn_layers,
+                    heads=self.config.num_local_gnn_heads,
+                    dropout=self.config.local_gnn_dropout,
+                    concat=self.config.local_gnn_concat,
+                    use_batch_norm=self.config.use_local_gnn_bn,
+                    use_non_linear=self.config.use_non_linear
+                    )
+                )
+            elif self.config.gnn_type == 'appnp':
+                self.mm_gnn_modules.append(Dense_APPNP_Net(
+                    self.config.d_model, self.config.d_model, self.config.d_model, dropout=self.config.local_gnn_dropout,
+                    K=self.config.gnn_K, alpha=self.config.gnn_alpha
+                    )
+                )
+            elif self.config.gnn_type == 'gcn':
+                self.mm_gnn_modules.append(GCNModule(
+                    self.config.d_model, self.config.d_model, self.config.d_model,
+                    num_layers=self.config.num_local_gnn_layers,
+                    dropout=self.config.local_gnn_dropout,
+                    use_batch_norm=self.config.use_local_gnn_bn,
+                    use_non_linear=self.config.use_non_linear
+                )
+            )
+            elif self.config.gnn_type == 'sage':
+                self.mm_gnn_modules.append(SAGEModule(
+                    self.config.d_model, self.config.d_model, self.config.d_model,
+                    num_layers=self.config.num_local_gnn_layers,
+                    dropout=self.config.local_gnn_dropout,
+                    use_batch_norm=self.config.use_local_gnn_bn,
+                    use_non_linear=self.config.use_non_linear
+                )
+            )
+            else:
+                raise ValueError
+            self.mm_mlp_modules.append(MLPModule(
+                self.config.d_model, self.config.d_model, self.config.d_model,
+                num_layers=self.config.num_local_fc_layers,
+                dropout=self.config.local_fc_dropout,
+                use_batch_norm=self.config.use_local_fc_bn,
+                use_non_linear=self.config.use_non_linear
+            ))
+
+            self.mm_graph_learners_1.append(MMGraphLearner(self.config.d_model, self.config.num_local_gr_learner_heads, random=self.config.use_random_graphs))
+            self.mm_graph_learners_2.append(MMGraphLearner(self.config.d_model * 2, self.config.num_local_gr_learner_heads, random=self.config.use_random_graphs))
+
+    def reset_parameters(self):
+        for i in range(self.config.num_modalities):
+            self.mm_gnn_modules[i].reset_parameters()
+            self.mm_mlp_modules[i].reset_parameters()
+            self.mm_graph_learners_1[i].reset_parameters()
+            self.mm_graph_learners_2[i].reset_parameters()
+
+    def forward(self, features):
+        mm_Xs = features
+       
+        ################# Multi-modal graph learner (upper branch) #################
+        A_primes = []
+        for i, mm_X in enumerate(mm_Xs):  # iterate over the modalities
+            A_primes.append(self.mm_graph_learners_1[i](mm_X))
+
+        ################# Multi-modal gnn (upper branch) #################
+        Z_primes = []
+        for i, (mm_X, A_prime) in enumerate(zip(mm_Xs, A_primes)):
+            Z_primes.append(self.mm_gnn_modules[i](mm_X, A_prime))
+
+        ################# Multi-modal gnn (lower branch) #################
+        Z_double_primes = []
+        for i, mm_X, in enumerate(mm_Xs):
+            Z_double_primes.append(self.mm_mlp_modules[i](mm_X))
+
+        Z_concats = [torch.cat([Z_1, Z_2], dim=-1) for Z_1, Z_2 in zip(Z_primes, Z_double_primes)]
+                
+        ################# Multi-modal graph learner (lower branch) #################
+        A_double_primes = []
+        for i, Z_concat in enumerate(Z_concats):
+            A_double_primes.append(self.mm_graph_learners_2[i](Z_concat))
+
+        As = [(1 - self.config.adj_ratio) * A_prime + self.config.adj_ratio * A_double_prime for A_prime, A_double_prime in zip(A_primes, A_double_primes)]
+
+        ################## Average across all multimodal inputs ##################
+
+        Zs = [0.5 * Z1 + 0.5 * Z2 for Z1, Z2 in zip(Z_primes, Z_double_primes)]
+
+        return As, Zs
+
+
+class QNetGlobal(nn.Module):
+    def __init__(self, config):
+        super(QNetGlobal, self).__init__()
+        self.config = config
+        if self.config.gnn_type == 'gat':
+            self.gnn = GATModule(
+                self.config.d_model, self.config.d_model, self.config.d_model,
+                num_layers=self.config.num_global_gnn_layers,
+                heads=self.config.num_global_gnn_heads,
+                dropout=self.config.global_gnn_dropout,
+                concat=self.config.global_gnn_concat,
+                use_batch_norm=self.config.use_global_gnn_bn,
+                use_non_linear=self.config.use_non_linear
+            )
+        elif self.config.gnn_type == 'appnp':
+            self.gnn = Dense_APPNP_Net(
+                self.config.d_model, self.config.d_model, self.config.d_model, dropout=self.config.local_gnn_dropout,
+                K=self.config.gnn_K, alpha=self.config.gnn_alpha
+            )
+        elif self.config.gnn_type == 'gcn':
+            self.gnn = GCNModule(
+                self.config.d_model, self.config.d_model, self.config.d_model,
+                num_layers=self.config.num_global_gnn_layers,
+                dropout=self.config.global_gnn_dropout,
+                use_batch_norm=self.config.use_global_gnn_bn,
+                use_non_linear=self.config.use_non_linear
+            )
+
+        elif self.config.gnn_type == 'sage':
+            self.gnn = SAGEModule(
+                self.config.d_model, self.config.d_model, self.config.d_model,
+                num_layers=self.config.num_global_gnn_layers,
+                dropout=self.config.global_gnn_dropout,
+                use_batch_norm=self.config.use_global_gnn_bn,
+                use_non_linear=self.config.use_non_linear
+            )
+        else:
+            raise ValueError
+
+        self.graph_learner_1 = GlobalGraphLearner(self.config.d_model, self.config.num_global_gr_learner_heads, self.config.use_random_graphs)
+        self.graph_learner_2 = GlobalGraphLearner(self.config.d_model * 2, self.config.num_global_gr_learner_heads, self.config.use_random_graphs)
+
+    def reset_parameters(self):
+        self.gnn.reset_parameters()
+        self.graph_learner_1.reset_parameters()
+        self.graph_learner_2.reset_parameters()
+    
+    def forward(self, Z, A):
+
+        ################# Graph learner (upper branch) #################
+        A_prime = self.graph_learner_1(Z)
+        A_prime = (1-self.config.init_adj_ratio) * A_prime + self.config.init_adj_ratio * A
+
+        ################# Gnn (upper branch) #################
+        Z_prime = self.gnn(Z, A_prime)
+
+        ################# Gnn (lower branch) #################
+        Z_double_prime = self.gnn(Z, A)
+        Z_concat = torch.cat([Z_prime, Z_double_prime], dim=-1)
+                
+        ################# Graph learner (lower branch) #################
+        A_double_prime = self.graph_learner_2(Z_concat)
+        A_double_prime = (1-self.config.init_adj_ratio) * A_double_prime + self.config.init_adj_ratio * A
+
+        ################## Average across  branches ##################
+        A_global = (1 - self.config.adj_ratio) * A_prime + self.config.adj_ratio * A_double_prime
+        Z_global = 0.5 * Z_prime + 0.5 * Z_double_prime 
+        return A_global, Z_global
+
+
+class PNetGlobal(nn.Module):
+    def __init__(self, config):
+        super(PNetGlobal, self).__init__()
+        self.config = config
+
+        if self.config.gnn_type == 'gat':
+            self.gnn = GATModule(
+                self.config.d_model, self.config.d_model, self.config.d_model,
+                num_layers=self.config.num_global_gnn_layers,
+                heads=self.config.num_global_gnn_heads,
+                dropout=self.config.global_gnn_dropout,
+                concat=self.config.global_gnn_concat,
+                use_batch_norm=self.config.use_global_gnn_bn,
+                use_non_linear=self.config.use_non_linear
+            )
+        elif self.config.gnn_type == 'appnp':
+            self.gnn = Dense_APPNP_Net(
+                self.config.d_model, self.config.d_model, self.config.d_model, dropout=self.config.local_gnn_dropout,
+                K=self.config.gnn_K, alpha=self.config.gnn_alpha
+            )
+        elif self.config.gnn_type == 'gcn':
+            self.gnn = GCNModule(
+                self.config.d_model, self.config.d_model, self.config.d_model,
+                num_layers=self.config.num_global_gnn_layers,
+                dropout=self.config.global_gnn_dropout,
+                use_batch_norm=self.config.use_global_gnn_bn,
+                use_non_linear=self.config.use_non_linear
+            )
+        elif self.config.gnn_type == 'sage':
+            self.gnn = SAGEModule(
+                self.config.d_model, self.config.d_model, self.config.d_model,
+                num_layers=self.config.num_global_gnn_layers,
+                dropout=self.config.global_gnn_dropout,
+                use_batch_norm=self.config.use_global_gnn_bn,
+                use_non_linear=self.config.use_non_linear
+            )
+        else:
+            raise ValueError
+
+        self.mlp = MLPModule(
+            self.config.d_model, self.config.d_model, self.config.d_model,
+            num_layers=self.config.num_global_fc_layers,
+            dropout=self.config.global_fc_dropout,
+            use_batch_norm=self.config.use_global_fc_bn,
+            use_non_linear=self.config.use_non_linear
+
+        )
+
+        self.graph_learner_1 = GlobalGraphLearner(self.config.d_model, self.config.num_global_gr_learner_heads, random=self.config.use_random_graphs)
+        self.graph_learner_2 = GlobalGraphLearner(self.config.d_model * 2, self.config.num_global_gr_learner_heads, random=self.config.use_random_graphs)
+
+    def reset_parameters(self):
+        self.gnn.reset_parameters()
+        self.mlp.reset_parameters()
+        self.graph_learner_1.reset_parameters()
+        self.graph_learner_2.reset_parameters()
+
+    def forward(self, Z, A):
+
+        ################# Graph learner (upper branch) #################
+        A_prime = self.graph_learner_1(Z)
+
+        ################# Gnn (upper branch) #################
+        Z_prime = self.gnn(Z, A_prime)
+
+        ################# mlp (lower branch) #################
+        Z_double_prime = self.mlp(Z)
+        Z_concat = torch.cat([Z_prime, Z_double_prime], dim=-1)
+                
+        ################# Graph learner (lower branch) #################
+        A_double_prime = self.graph_learner_2(Z_concat)
+        # A_double_prime = (1-self.config.init_adj_ratio) * A_double_prime + self.config.init_adj_ratio * A
+
+        ################## Average across braches ##################
+        A_global = (1 - self.config.adj_ratio) * A_prime + self.config.adj_ratio * A_double_prime
+        Z_global = 0.5 * Z_prime + 0.5 * Z_double_prime 
+        return A_global, Z_global
+
--- a/models/nextqa_bart.py
+++ b/models/nextqa_bart.py
--- a/models/utils.py
+++ b/models/utils.py
@ -0,0 +1,249 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers.utils import ModelOutput
+from typing import Optional, Tuple
+
+
+class ELBO(nn.Module):
+    def __init__(self):
+        super(ELBO, self).__init__()
+    
+    def forward(self, QA, PA):
+        QA_flattened = QA.view(-1).unsqueeze(-1)
+        PA_flattened = PA.view(-1).unsqueeze(-1)
+
+        QA_flattened = torch.cat([torch.zeros_like(QA_flattened), QA_flattened], dim=-1)
+        PA_flattened = torch.cat([torch.zeros_like(PA_flattened), PA_flattened], dim=-1)
+
+        log_QA = F.log_softmax(QA_flattened, dim=1)
+        log_PA = F.log_softmax(PA_flattened, dim=1)
+
+        QA_dist = torch.exp(log_QA)
+
+        loss_QA = torch.mean(log_QA * QA_dist)
+        loss_PA = torch.mean(log_PA * QA_dist)
+
+        loss = loss_QA - loss_PA
+
+        return loss
+
+def seperate_nextqa_input_modalities(
+    features, i3d_rgb_interval, i3d_flow_interval, question_intervals,
+    vis_state_vector_idx, question_state_vector_idx,
+    attention_values=None):
+    """ We separate the multimodal input hidden states. The state token embeddings are left out (+1 while indexing)
+
+    Args:
+        features (_type_): _description_
+        i3d_rgb_interval (_type_): _description_
+        i3d_flow_interval (_type_): _description_
+        sam_interval (_type_): _description_
+        audio_interval (_type_): _description_
+        history_intervals (_type_): _description_
+        question_intervals (_type_): _description_
+
+    Returns:
+        _type_: _description_
+    """
+    features_copy = features.clone() # .detach()
+    i3d_rgb_hidden = features_copy[:, i3d_rgb_interval[0]+1:i3d_rgb_interval[1], :]
+    i3d_flow_hidden = features_copy[:, i3d_flow_interval[0]+1:i3d_flow_interval[1], :]
+    
+    question_hidden = []
+    features_split = torch.split(features_copy, 1, dim=0)
+    for ques_inter, feat in zip(question_intervals, features_split):
+        ques_idx = torch.arange(ques_inter[0]+1, ques_inter[1]).unsqueeze(0).unsqueeze(-1).repeat(1, 1, feat.size(-1)).to(feat.device)
+        question_hidden.append(torch.gather(feat, 1, ques_idx))
+    
+    if attention_values is None:
+        i3d_rgb_att = None
+        i3d_flow_att = None
+        question_att = None
+    else:
+        attention_values = attention_values.mean(1)
+        i3d_rgb_att = attention_values[:, vis_state_vector_idx[0], vis_state_vector_idx[0]+1:vis_state_vector_idx[1]]
+        i3d_flow_att = attention_values[:, vis_state_vector_idx[1], vis_state_vector_idx[1]+1:question_state_vector_idx[0]]
+        question_att = [attention_values[i, question_state_vector_idx[i], question_intervals[i][0] + 1: question_intervals[i][1]] for i in range(len(question_state_vector_idx))]
+
+    features_list = [i3d_rgb_hidden, i3d_flow_hidden, question_hidden]
+    att = [i3d_rgb_att, i3d_flow_att, question_att]
+    
+    return features_list, att
+
+
+def seperate_input_modalities(
+    features, i3d_rgb_interval, i3d_flow_interval, sam_interval, audio_interval, history_intervals, question_intervals,
+    vis_state_vector_idx, history_state_vector_idx, question_state_vector_idx,
+    attention_values=None):
+    """ We separate the multimodal input hidden states. The state token embeddings are left out (+1 while indexing)
+
+    Args:
+        features (_type_): _description_
+        i3d_rgb_interval (_type_): _description_
+        i3d_flow_interval (_type_): _description_
+        sam_interval (_type_): _description_
+        audio_interval (_type_): _description_
+        history_intervals (_type_): _description_
+        question_intervals (_type_): _description_
+
+    Returns:
+        _type_: _description_
+    """
+    features_copy = features.clone() # .detach()
+    i3d_rgb_hidden = features_copy[:, i3d_rgb_interval[0]+1:i3d_rgb_interval[1], :]
+    i3d_flow_hidden = features_copy[:, i3d_flow_interval[0]+1:i3d_flow_interval[1], :]
+    sam_hidden = features_copy[:, sam_interval[0]+1:sam_interval[1], :]
+    audio_hidden = features_copy[:, audio_interval[0]+1:audio_interval[1], :]
+    
+    history_hidden = []
+    question_hidden = []
+    features_split = torch.split(features_copy, 1, dim=0)
+    for hist_inter, ques_inter, feat in zip(history_intervals, question_intervals, features_split):
+        hist_idx = torch.arange(hist_inter[0]+1, hist_inter[1]).unsqueeze(0).unsqueeze(-1).repeat(1, 1, feat.size(-1)).to(feat.device)
+        history_hidden.append(torch.gather(feat, 1, hist_idx))
+
+        ques_idx = torch.arange(ques_inter[0]+1, ques_inter[1]).unsqueeze(0).unsqueeze(-1).repeat(1, 1, feat.size(-1)).to(feat.device)
+        question_hidden.append(torch.gather(feat, 1, ques_idx))
+    
+    if attention_values is None:
+        i3d_rgb_att = None
+        i3d_flow_att = None
+        sam_att = None
+        audio_att = None
+        history_att = None
+        question_att = None
+    else:
+        attention_values = attention_values.mean(1)
+        i3d_rgb_att = attention_values[:, vis_state_vector_idx[0], vis_state_vector_idx[0]+1:vis_state_vector_idx[1]]
+        i3d_flow_att = attention_values[:, vis_state_vector_idx[1], vis_state_vector_idx[1]+1:vis_state_vector_idx[2]]
+        sam_att = attention_values[:, vis_state_vector_idx[2], vis_state_vector_idx[2]+1:vis_state_vector_idx[3]]
+        audio_att = attention_values[:, vis_state_vector_idx[3], vis_state_vector_idx[3]+1:history_state_vector_idx[0] - 1]
+        history_att = [attention_values[i, history_state_vector_idx[i], history_intervals[i][0] + 1 : history_intervals[i][1]] for i in range(len(history_state_vector_idx))]
+        question_att = [attention_values[i, question_state_vector_idx[i], question_intervals[i][0] + 1: question_intervals[i][1]] for i in range(len(question_state_vector_idx))]
+
+    features_list = [i3d_rgb_hidden, i3d_flow_hidden, sam_hidden, audio_hidden, history_hidden, question_hidden]
+    att = [i3d_rgb_att, i3d_flow_att, sam_att, audio_att, history_att, question_att]
+    
+    return features_list, att
+
+
+def get_knn_graph(features, num_nn, device):
+    features = features.permute((1, 2, 0))
+    cosine_sim_pairwise = F.cosine_similarity(features, features.unsqueeze(1), dim=-2)
+    cosine_sim_pairwise = cosine_sim_pairwise.permute((2, 0, 1))
+    num_nn = min(num_nn, cosine_sim_pairwise.size(-1))
+    adj_mat = torch.zeros_like(cosine_sim_pairwise).to(device)
+    _, to_keep = torch.topk(cosine_sim_pairwise, num_nn, dim=-1, sorted=False)
+    adj_mat = adj_mat.scatter(-1, to_keep, torch.ones_like(adj_mat).to(device))
+    return adj_mat
+
+
+def track_features_vis(features, att, top_k, device, node_idx=None):
+    """Computes an adjacency matrix based on the nearset neighbor similiarity for 
+    the i3d, audio, and sam input modalities. The tracked constituents of each modality
+    are randomly chosen (A_tilde in the paper).
+    """
+    features = features.clone().detach()
+    top_k = min(features.size(1), top_k)
+    if att is None:
+        node_idx = torch.randint(low=0, high=features.size(1), size=(features.size(0), top_k))
+    else:
+        _, node_idx = torch.topk(att, top_k, dim=-1, sorted=False)
+
+    node_idx = node_idx.unsqueeze(-1).repeat(1, 1, features.size(-1)).to(device)
+
+    selected_features = torch.gather(features, 1, node_idx)
+
+    return selected_features, node_idx
+
+
+def track_features_text(features, att, top_k, device, node_idx=None):
+    """Computes an adjacency matrix based on the nearset neighbor similiarity for 
+    the history and question inputs. The tracked constituents of each modality
+    are randomly chosen (A_tilde in the paper).
+    """
+    hidden_dim = features[0].size(-1)
+    min_len = min([feat.size(1) for feat in features])
+    top_k = min(min_len, top_k)
+    if att is None:
+        node_idx = [torch.randint(low=0, high=feat.size(1), size=(feat.size(0), top_k)) for feat in features]
+    else:
+        node_idx = [torch.topk(a, top_k, dim=-1, sorted=False)[-1] for a in att]
+
+    node_idx = [idx.unsqueeze(-1).repeat(1, 1, hidden_dim).to(device) for idx in node_idx]
+
+    selected_features = [torch.gather(feat, 1, idx) for feat, idx in zip(features, node_idx)]
+    selected_features = torch.cat(selected_features, dim=0)
+
+    return selected_features, node_idx
+
+
+def diag_tensor(tensors):
+    device = tensors[0].device
+    n = sum([t.size(-1) for t in tensors])
+    bsz = tensors[0].size(0)
+    diag_tensor = torch.zeros((bsz, n, n)).float().to(device)
+    delimiter = 0
+    delimiters = [0]
+    for t in tensors:
+        diag_tensor[:, delimiter:delimiter+t.size(-1), delimiter:delimiter+t.size(-1)] = t
+        delimiter += t.size(-1)
+        delimiters.append(delimiter)
+
+    return diag_tensor, delimiters
+
+
+def embed_graphs(features, delimiters):
+    state_vectors = []
+    for i in range(len(delimiters) - 1):
+        state_vectors.append(features[:, delimiters[i]:delimiters[i+1], :].mean(dim=1))
+    return state_vectors
+
+
+class AVSDEncoderOutput(ModelOutput):
+    last_hidden_state: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    QAs_local = None
+    PAs_local = None
+    QA_global = None
+    PA_global = None
+    state_vectors = None
+
+
+class AVSDSeq2SeqModelOutput(ModelOutput):
+
+    last_hidden_state: torch.FloatTensor = None
+    past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
+    decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    decoder_attentions: Optional[Tuple[torch.FloatTensor]] = None
+    cross_attentions: Optional[Tuple[torch.FloatTensor]] = None
+    encoder_last_hidden_state: Optional[torch.FloatTensor] = None
+    encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None
+    QAs_local = None
+    PAs_local = None
+    QA_global = None
+    PA_global = None
+    state_vectors = None
+
+
+class AVSDSeq2SeqLMOutput(ModelOutput):
+
+    gen_loss: Optional[torch.FloatTensor] = None
+    elbo_loss_global: Optional[torch.FloatTensor] = None
+    elbo_loss_local: Optional[torch.FloatTensor] = None
+    logits: torch.FloatTensor = None
+    past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
+    decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    decoder_attentions: Optional[Tuple[torch.FloatTensor]] = None
+    cross_attentions: Optional[Tuple[torch.FloatTensor]] = None
+    encoder_last_hidden_state: Optional[torch.FloatTensor] = None
+    encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None
+    encoder_QAs_local = None
+    encoder_PAs_local = None
+    encoder_QA_global = None
+    encoder_PA_global = None
+    encoder_state_vectors = None
--- a/optim_utils.py
+++ b/optim_utils.py
@ -0,0 +1,106 @@
+from torch.optim.lr_scheduler import _LRScheduler
+from torch.optim import AdamW
+
+
+class WarmupLinearScheduleNonZero(_LRScheduler):
+    """ Linear warmup and then linear decay.
+        Linearly increases learning rate from 0 to max_lr over `warmup_steps` training steps.
+        Linearly decreases learning rate linearly to min_lr over remaining `t_total - warmup_steps` steps.
+    """
+    def __init__(self, optimizer, warmup_steps, t_total, min_lr=1e-5, last_epoch=-1):
+        self.warmup_steps = warmup_steps
+        self.t_total = t_total
+        self.min_lr = min_lr
+        super(WarmupLinearScheduleNonZero, self).__init__(optimizer, last_epoch=last_epoch)
+    
+    def get_lr(self):
+        step = self.last_epoch
+        if step < self.warmup_steps:
+            lr_factor = float(step) / float(max(1, self.warmup_steps))
+        else:
+            lr_factor = max(0, float(self.t_total - step) / float(max(1.0, self.t_total - self.warmup_steps)))
+
+        return [base_lr * lr_factor if (base_lr * lr_factor) > self.min_lr else self.min_lr for base_lr in self.base_lrs]
+
+
+def init_optim(model, config):
+    encoder_params_with_weight_decay = []
+    encoder_params_without_weight_decay = []
+    decoder_params_with_weight_decay = []
+    decoder_params_without_weight_decay = []
+    other_params_with_weight_decay = []
+    other_params_without_weight_decay = []
+
+    exclude_from_weight_decay=['bias', 'LayerNorm.bias', 'LayerNorm.weight']
+
+    # Our model shares (embedding) parameters between the encoder and decoder.
+    # We want to include such parameters only in one parameter group.
+    # So we keep track of the unique ids of each parameter. 
+    params_ids = []
+
+    for module_name, module in model.named_children():
+        for param_name, param in module.named_parameters():
+            if id(param) not in params_ids:
+                params_ids.append(id(param))
+            else:
+                continue
+            if param.requires_grad:
+                if 'encoder' in param_name:
+                    if any(ex in param_name for ex in exclude_from_weight_decay):
+                        encoder_params_without_weight_decay.append(param)
+                    else:
+                        encoder_params_with_weight_decay.append(param)
+                           
+                elif 'decoder' in param_name:
+                    if any(ex in param_name for ex in exclude_from_weight_decay):
+                        decoder_params_without_weight_decay.append(param)
+                    else:
+                        decoder_params_with_weight_decay.append(param)
+                else:
+                    if any(ex in param_name for ex in exclude_from_weight_decay):
+                        other_params_without_weight_decay.append(param)
+                    else:
+                        other_params_with_weight_decay.append(param)
+
+    optimizer_grouped_parameters = [
+        {
+            'params': encoder_params_with_weight_decay,
+            'weight_decay': 0.01,
+            'lr': config['learning_rate_bart']
+        },
+        {
+            'params': encoder_params_without_weight_decay,
+            'weight_decay': 0.0,
+            'lr': config['learning_rate_bart']
+        },
+        {
+            'params': decoder_params_with_weight_decay,
+            'weight_decay': 0.01,
+            'lr': config['learning_rate_bart']
+        },
+        {
+            'params': decoder_params_without_weight_decay,
+            'weight_decay': 0.0,
+            'lr': config['learning_rate_bart']
+        },
+        {
+            'params': other_params_with_weight_decay,
+            'weight_decay': 0.01,
+            'lr': config['learning_rate_other']
+        },
+        {
+            'params': other_params_without_weight_decay,
+            'weight_decay': 0.0,
+            'lr': config['learning_rate_other']
+        },
+    ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=config['learning_rate_bart'])
+
+    scheduler = WarmupLinearScheduleNonZero(
+        optimizer,
+        warmup_steps=config['warmup_steps'],
+        t_total=config['train_steps'],
+        min_lr=config['min_lr']
+    )
+
+    return optimizer, scheduler
--- a/processed/.gitkeep
+++ b/processed/.gitkeep
--- a/processed/avsd/.gitkeep
+++ b/processed/avsd/.gitkeep
--- a/processed/nextqa/.gitkeep
+++ b/processed/nextqa/.gitkeep
--- a/processed/nextqa/annotations/add_reference_answer_test.json
+++ b/processed/nextqa/annotations/add_reference_answer_test.json
--- a/processed/nextqa/annotations/glove_embed.npy
+++ b/processed/nextqa/annotations/glove_embed.npy
--- a/processed/nextqa/annotations/test.csv
+++ b/processed/nextqa/annotations/test.csv
--- a/processed/nextqa/annotations/train.csv
+++ b/processed/nextqa/annotations/train.csv
--- a/processed/nextqa/annotations/val.csv
+++ b/processed/nextqa/annotations/val.csv
--- a/processed/nextqa/annotations/vocab.pkl
+++ b/processed/nextqa/annotations/vocab.pkl
--- a/processed/nextqa/vid_feat/app_mot_test.h5
+++ b/processed/nextqa/vid_feat/app_mot_test.h5
--- a/processed/nextqa/vid_feat/app_mot_train.h5
+++ b/processed/nextqa/vid_feat/app_mot_train.h5
--- a/processed/nextqa/vid_feat/app_mot_val.h5
+++ b/processed/nextqa/vid_feat/app_mot_val.h5
--- a/raw_data/.gitkeep
+++ b/raw_data/.gitkeep
--- a/raw_data/test_set4DSTC10-AVSD.json
+++ b/raw_data/test_set4DSTC10-AVSD.json
--- a/raw_data/test_set4DSTC7-AVSD.json
+++ b/raw_data/test_set4DSTC7-AVSD.json
--- a/raw_data/test_set4DSTC8-AVSD.json
+++ b/raw_data/test_set4DSTC8-AVSD.json
--- a/raw_data/train_set4DSTC7-AVSD.json
+++ b/raw_data/train_set4DSTC7-AVSD.json
--- a/raw_data/valid_set4DSTC7-AVSD.json
+++ b/raw_data/valid_set4DSTC7-AVSD.json
--- a/runners/runner.py
+++ b/runners/runner.py
@ -0,0 +1,488 @@
+import wandb
+import os
+import os.path as osp
+import json
+from collections import deque, OrderedDict
+import time
+import re
+import shutil
+import glob
+import pickle
+import gc
+import numpy as np
+import glog as log
+
+import torch
+import torch.utils.data as tud
+import torch.nn.functional as F
+import torch.distributed as dist
+from torch.nn.utils import clip_grad_value_
+
+
+class Runner:
+    def __init__(self, config):
+        self.config = config
+        if 'rank' in config:
+            self.gpu_rank = config['rank']
+        else:
+            self.gpu_rank = 0
+
+        self.epoch_idx = 0
+        self.min_gen_val_loss = float('inf')
+        self.best_epoch_idx = 0
+
+        if self.config["max_ckpt_to_keep"] > 0:
+            self.checkpoint_queue = deque([], maxlen=config["max_ckpt_to_keep"])
+            self.metrics_queue = deque([], maxlen=config["max_ckpt_to_keep"])
+
+        self.setup_wandb()
+
+    def setup_wandb(self):
+        if self.gpu_rank == 0:
+            print("[INFO] Set wandb logging on rank {}".format(0))
+            run = wandb.init(
+                project=self.config['wandb_project'], config=self.config, mode=self.config['wandb_mode'])
+        else:
+            run = None
+        self.run = run
+
+    def forward(self, batch, eval=False):
+        return NotImplementedError
+
+    def train(self, dataset, dataset_eval):
+        batch_size = self.config['batch_size']
+        if self.config['parallel'] and self.config['dp_type'] != 'dp':
+            sampler = tud.distributed.DistributedSampler(
+                        dataset,
+                        num_replicas=self.config['num_gpus'],
+                        rank=self.gpu_rank
+                    )
+        else:
+            sampler = None
+
+        data_loader = tud.DataLoader(
+                    dataset=dataset,
+                    batch_size=batch_size,
+                    shuffle=self.config['training'] and not self.config['parallel'],
+                    collate_fn=dataset.collate_fn,
+                    num_workers=self.config['num_workers'],
+                    sampler=sampler
+                )
+
+        start_epoch_idx = self.epoch_idx
+        num_iter_epoch = self.config['num_iter_per_epoch']
+        if self.config['display']:
+            log.info(f'{num_iter_epoch} iter per epoch.')
+
+        num_epochs = self.config['num_epochs']
+
+        # Perform validation before training
+        if self.config['eval_first']:
+            _ = self.val(dataset_eval)
+
+        for epoch_idx in range(start_epoch_idx, num_epochs):
+            if self.config['parallel'] and self.config['dp_type'] != 'dp':
+                sampler.set_epoch(epoch_idx)
+            self.epoch_idx = epoch_idx
+
+            if self.config['display']:
+                log.info(f'starting epoch {epoch_idx}')
+                log.info('training')
+
+            self.model.train()
+
+            num_batch = 0
+            next_logging_pct = .1
+            start_time = time.time()
+            self.optimizer.zero_grad()
+
+            for batch in data_loader:
+                num_batch += 1
+                pct = num_batch / num_iter_epoch * 100
+                iter_now = num_iter_epoch * epoch_idx + num_batch
+
+                output = self.forward(batch)
+                
+                losses = output['losses']
+
+                # optimizer step
+                losses['tot_loss'] /= self.config['batch_multiply']
+                # debug
+                if self.config['debugging']:
+                    log.info('try backward')
+
+                losses['tot_loss'].backward()
+                if self.config['clip_grad_value'] > 0:
+                    clip_grad_value_(self.model.parameters(), self.config['clip_grad_value'])
+                if self.config['debugging']:
+                    log.info('backward done')
+
+                if iter_now % self.config['batch_multiply'] == 0:
+                    self.optimizer.step()
+                    self.optimizer.zero_grad()
+                
+                self.scheduler.step()
+
+                # display and eval
+                if pct >= next_logging_pct:
+                    if self.config['display']:
+                        loss_to_print = ''
+                        for key in losses:
+                            if losses[key] is not None and isinstance(losses[key], torch.Tensor):
+                                loss_to_print += f'[{key}: {losses[key].item():.4f}]'
+                        print(
+                            f'[{int(pct)}%][Epoch: {epoch_idx + 1}/{num_epochs}][Iter : {num_batch}/{len(data_loader)}] [time: {time.time() - start_time:.2f}] {loss_to_print}'
+                        )
+                    if self.config['print_output']:
+                        print(10 * '-' + 'responses' + 10 * '-')
+                        print(output['reponses'])
+                        print(10 * '-' + 'gt' + 10 * '-')
+                        print(output['gt'])
+
+                    next_logging_pct += self.config["next_logging_pct"]
+
+                    if self.config['debugging']:
+                        break
+
+                lr_bart, lr_other = self.scheduler.get_lr()[0], self.scheduler.get_lr()[-1]
+
+                elbo_global_key = 'elbo_loss_global (x{})'.format(self.config['elbo_global_coeff'])
+                elbo_local_key = 'elbo_loss_local (x{})'.format(self.config['elbo_local_coeff'])
+                gen_key = 'gen_loss (x{})'.format(self.config['gen_coeff'])
+                if self.run:
+                    self.run.log(
+                        {
+                            f"Train/{gen_key}": losses[gen_key].item(),
+                            f"Train/{elbo_global_key}": losses[elbo_global_key].item(),
+                            f"Train/{elbo_local_key}": losses[elbo_local_key].item(),
+                            "Train/total_loss": losses['tot_loss'].item(),
+                        },
+                        step=iter_now
+                    )
+
+                    self.run.log(
+                        {"Train/lr_bart": lr_bart, "Train/lr_other": lr_other},
+                        step=iter_now
+                    )
+                del losses
+                del output
+
+            if self.config['display']:
+                log.info(
+                    f'100%,\ttime:\t{time.time() - start_time:.2f}'
+                )
+                if not self.config['overfit'] and self.run:
+                    self.save_ckpt()
+            
+            if not self.config['skip_eval']:
+
+                iter_now = num_iter_epoch * (epoch_idx + 1)
+                val_losses = self.val(dataset_eval)
+
+                if self.config['display']:
+                    log.info('#'*100)
+                    for k in val_losses:
+                        log.info('Average val {} (epoch {}) = {}'.format(k, self.epoch_idx, val_losses[k]))
+                    log.info('#'*100)
+
+                gen_val_loss = val_losses[gen_key]
+
+                if gen_val_loss < self.min_gen_val_loss:
+                    self.min_gen_val_loss = gen_val_loss
+                    self.best_epoch_idx = epoch_idx
+                    # Log the best model w.r.t. the validation data
+                    if self.run and self.config['save_ckpt']:
+                        self.save_ckpt_best()
+
+                if self.run:
+
+                    self.run.log(
+                        {
+                            f"Val/{gen_key}": val_losses[gen_key],
+                            f"Val/{elbo_global_key}": val_losses[elbo_global_key],
+                            f"Val/{elbo_local_key}": val_losses[elbo_local_key],
+                            "Val/total_loss": val_losses['tot_loss'],
+                            "Val/min_gen_loss": self.min_gen_val_loss
+                        },
+                        step=iter_now
+                    )
+
+            if self.config['parallel']:
+                if self.config['dp_type'] == 'dp':
+                    gc.collect()
+                    torch.cuda.empty_cache()
+                else:
+                    dist.barrier()
+                    torch.cuda.empty_cache()
+
+            if self.config['stop_epochs'] >= 0 and epoch_idx + 1 >= self.config['stop_epochs']:
+                if self.config['display']:
+                    log.info('Stop for reaching stop_epochs.')
+                break
+        if self.config['display']:
+            log.info(f'Best validation loss was reached at epoch {self.best_epoch_idx}.')
+
+    def val(self, dataset):
+        total_loss_val = 0.0
+        total_gen_loss_val = 0.0
+        total_elbo_global_val = 0.0
+        total_elbo_local_val = 0.0
+        num_batch_val = 0
+        next_logging_pct_val = 0.05
+
+        elbo_global_key = 'elbo_loss_global (x{})'.format(self.config['elbo_global_coeff'])
+        elbo_local_key = 'elbo_loss_local (x{})'.format(self.config['elbo_local_coeff'])
+        gen_key = 'gen_loss (x{})'.format(self.config['gen_coeff'])
+
+        # Prepare the dataloader
+        if self.config['parallel'] and self.config['dp_type'] != 'dp':
+            sampler_val = tud.distributed.DistributedSampler(
+                dataset,
+                num_replicas=self.config['num_gpus'],
+                rank=self.gpu_rank
+            )
+            sampler_val.set_epoch(self.epoch_idx)
+        else:
+            sampler_val = None
+        
+        data_loader_val = tud.DataLoader(
+            dataset=dataset,
+            batch_size=self.config['batch_size'],
+            shuffle=False,
+            collate_fn=dataset.collate_fn,
+            num_workers=self.config['num_workers'],
+            sampler=sampler_val
+        )
+
+        if self.config['parallel'] and self.config['dp_type'] == 'dp':
+            num_iter_per_epoch_val = int(np.ceil(len(dataset) / self.config['batch_size']))
+        else:
+            num_iter_per_epoch_val = int(np.ceil(len(dataset) / (self.config['batch_size'] * self.config['num_gpus'])))
+
+
+        self.model.eval()
+
+        if self.gpu_rank == 0:
+            start_time = time.time()
+        
+        for batch in data_loader_val:
+            num_batch_val += 1
+
+            pct = num_batch_val / num_iter_per_epoch_val * 100
+
+            with torch.no_grad():
+                output = self.forward(batch)
+
+            losses = output['losses']
+
+            losses['tot_loss'] /= self.config['batch_multiply']
+            losses[elbo_global_key] /= self.config['batch_multiply']
+            losses[elbo_local_key] /= self.config['batch_multiply']
+            losses[gen_key] /= self.config['batch_multiply']
+
+            total_loss_val += losses['tot_loss'].item()
+            total_gen_loss_val += losses[gen_key].item()
+            total_elbo_global_val += losses[elbo_global_key].item()
+            total_elbo_local_val += losses[elbo_local_key].item()
+
+            # display and eval
+            if pct >= next_logging_pct_val:
+                if self.config['display']:
+                    loss_to_print = ''
+                    for key in losses:
+                        if losses[key] is not None and isinstance(losses[key], torch.Tensor):
+                            loss_to_print += f'[{key}: {losses[key].item():.4f}]'
+                    print(
+                        f'[{int(pct)}%][Validating][Iter : {num_batch_val}/{num_iter_per_epoch_val}] [time: {time.time() - start_time:.2f}] {loss_to_print}'
+                    )
+
+                next_logging_pct_val += self.config["next_logging_pct"]
+        loss_val = total_loss_val / num_batch_val
+        gen_loss_val = total_gen_loss_val / num_batch_val
+        elbo_global_val = total_elbo_global_val / num_batch_val
+        elbo_local_val = total_elbo_local_val / num_batch_val
+
+        losses_val = {
+            'tot_loss': loss_val,
+            elbo_global_key: elbo_global_val,
+            elbo_local_key: elbo_local_val,
+            gen_key: gen_loss_val
+        }
+        self.model.train()
+        return losses_val
+
+
+    def save_eval_results(self, split, epoch_idx, metrics_results):
+
+        metrics_filename = osp.join(self.config['log_dir'], f'metrics_epoch_{epoch_idx}.json')
+        with open(metrics_filename, 'w') as f:
+            json.dump(metrics_results, f)
+        log.info(f'Results of metrics saved to {metrics_filename}')
+
+        if self.config["max_ckpt_to_keep"] > 0:
+            if len(self.metrics_queue) == self.metrics_queue.maxlen:
+                todel = self.metrics_queue.popleft()
+                os.remove(todel)
+            self.metrics_queue.append(metrics_filename)
+
+        if epoch_idx == 'best':
+            self.copy_best_predictions(split)
+
+    def copy_best_results(self, split, epoch_idx):
+        to_print = 'Copy '
+
+        if not self.config['skip_saving_ckpt']:
+            ckpt_path = osp.join(self.config['log_dir'], f'epoch_{epoch_idx}.ckpt')
+            best_ckpt_path = ckpt_path.replace(f'{epoch_idx}.ckpt', 'best.ckpt')
+            shutil.copyfile(ckpt_path, best_ckpt_path)
+            to_print += best_ckpt_path + ' '
+
+        metrics_filename = osp.join(self.config['log_dir'], f'metrics_epoch_{epoch_idx}.json')
+        best_metric_filename = metrics_filename.replace(f'{epoch_idx}.json', 'best.json')
+        shutil.copyfile(metrics_filename, best_metric_filename)
+        to_print += best_metric_filename + ' '
+
+        log.info(to_print)
+
+
+    def set_ckpt(self, ckpt_dict):
+        if self.config['parallel']:
+            model = self.model.module
+        else:
+            model = self.model
+        
+        model_state_dict = model.state_dict()
+      
+        former_dict = {k: v for k, v in ckpt_dict['model_state_dict'].items() if k in model_state_dict}
+
+        if self.config['display']:
+            log.info("number of keys transferred: %d" % len(former_dict))
+        assert len(former_dict.keys()) > 0
+
+        model_state_dict.update(former_dict)
+
+        model.load_state_dict(model_state_dict)
+        if self.config['display']:
+            log.info('loaded model')
+        del model_state_dict, former_dict
+
+        # if not self.config['uses_new_optimizer']:
+        if not self.config['generating'] and not (self.config['uses_new_optimizer'] or self.config['sets_new_lr']):
+            if not self.config['restarts']:
+                self.epoch_idx = ckpt_dict['epoch_idx'] + 1
+
+            if not self.config['resets_min_val_loss']:
+                self.min_gen_val_loss = ckpt_dict['min_gen_val_loss']
+
+            self.optimizer.load_state_dict(ckpt_dict['optimizer'])
+            if self.config['display']:
+                log.info('loaded optimizer')
+            if 'scheduler' in ckpt_dict:
+                self.scheduler.last_epcoh = ckpt_dict['epoch_idx'] * self.config['num_iter_per_epoch']
+                self.scheduler.load_state_dict(ckpt_dict['scheduler'])
+
+        del ckpt_dict
+
+        torch.cuda.empty_cache()
+
+
+    def save_ckpt(self):
+        ckpt_path = f'{self.config["log_dir"]}/epoch_{self.epoch_idx}.ckpt'
+        log.info(f'saving checkpoint {ckpt_path}')
+        ckpt = self.get_ckpt()
+        if self.config['skip_saving_ckpt']:
+            return ckpt_path
+        torch_version = float(torch.__version__[:3])
+        if torch_version - 1.4 > 1e-3:
+            torch.save(ckpt, f=ckpt_path, _use_new_zipfile_serialization=False)
+        else:
+            torch.save(ckpt, f=ckpt_path)
+        del ckpt
+        if not self.config['parallel']:
+            torch.cuda.empty_cache()
+
+        if self.config["max_ckpt_to_keep"] > 0:
+            if len(self.checkpoint_queue) == self.checkpoint_queue.maxlen:
+                todel = self.checkpoint_queue.popleft()
+                os.remove(todel)
+            self.checkpoint_queue.append(ckpt_path)
+
+    def save_ckpt_best(self):
+        ckpt_path = f'{self.config["log_dir"]}/epoch_best.ckpt'
+        log.info(f'saving checkpoint {ckpt_path}')
+        ckpt = self.get_ckpt()
+        torch.save(ckpt, f=ckpt_path)
+        del ckpt
+        return ckpt_path
+
+    def load_ckpt_best(self):
+        ckpt_path = f'{self.config["log_dir"]}/epoch_best.ckpt'
+        if not osp.exists(ckpt_path):
+            ckpt_paths = [path for path in os.listdir(f'{self.config["log_dir"]}/') if path.endswith('.ckpt') and 'best' not in path]
+            if len(ckpt_paths) == 0:
+                if self.config['display']:
+                    log.info(f'No .ckpt found in {self.config["log_dir"]}')
+                return
+            sort_func = lambda x:int(re.search(r"(\d+)", x).groups()[0])
+            ckpt_path = f'{self.config["log_dir"]}/{sorted(ckpt_paths, key=sort_func, reverse=True)[0]}'
+        if self.config['display']:
+            log.info(f'loading checkpoint {ckpt_path}')
+        map_location = {'cuda:0': f'cuda:{self.gpu_rank}'}
+        self.set_ckpt(torch.load(ckpt_path, map_location=map_location))
+
+    def get_ckpt(self):
+        ckpt = {
+            'epoch_idx': self.epoch_idx,
+            'min_gen_val_loss': self.min_gen_val_loss,
+            'seed': self.config['random_seed'],
+            'optimizer': self.optimizer.state_dict(),
+            'scheduler': self.scheduler.state_dict()
+        }
+        ckpt['model_state_dict'] = self.model.module.state_dict()
+        return ckpt
+
+    def load_ckpt(self, ckpt_path=None):
+        if not ckpt_path:
+            if self.config['generating']:  # or self.config['start_ckpt_for_generating']:
+                ckpt_path = f'{self.config["log_dir"]}/epoch_best.ckpt'
+            else:
+                ckpt_paths = [path for path in os.listdir(f'{self.config["log_dir"]}/') if path.endswith('.ckpt') and 'best' not in path]
+                if len(ckpt_paths) == 0:
+                    if self.config['display']:
+                        log.info(f'No .ckpt found in {self.config["log_dir"]}')
+                    return
+                sort_func = lambda x:int(re.search(r"(\d+)", x).groups()[0])
+                ckpt_path = f'{self.config["log_dir"]}/{sorted(ckpt_paths, key=sort_func, reverse=True)[0]}'
+
+        if self.config['display']:
+            log.info(f'loading checkpoint {ckpt_path}')
+            epoch_name = osp.split(ckpt_path)[1].split('.')[0]
+            if re.search(r"(\d+)", epoch_name):
+                self.checkpoint_queue.append(ckpt_path)
+                metrics_filename = osp.join(self.config['log_dir'], f'metrics_{epoch_name}.json')
+                if osp.exists(metrics_filename):
+                    self.metrics_queue.append(metrics_filename)
+
+        map_location = {'cuda:0': f'cuda:{self.gpu_rank}'}
+        self.set_ckpt(torch.load(ckpt_path, map_location=map_location))
+
+    def match_model_key(self, pretrained_dict, model_dict):
+        matched_dict = dict()
+        not_found = []
+        for key in pretrained_dict:
+            if key in model_dict:
+                matched_key = key
+            elif key.startswith('encoder.') and key[8:] in model_dict:
+                matched_key = key[8:]
+            elif key.startswith('module.') and key[7:] in model_dict:
+                matched_key = key[7:]
+            elif 'encoder.' + key in model_dict:
+                matched_key = 'encoder.' + key
+            elif 'module.' + key in model_dict:
+                matched_key = 'module.' + key
+            else:
+                not_found.append(key)
+                continue
+            matched_dict[matched_key] = pretrained_dict[key]
+        print("Keys from pretrained_dict that were not found in model_dict:\n", not_found)
+        return matched_dict
--- a/runners/runner_avsd.py
+++ b/runners/runner_avsd.py
@ -0,0 +1,337 @@
+import time
+import os
+import glog as log
+import numpy as np
+import json
+import torch
+import torch.nn.functional as F
+from runners.runner import Runner
+from copy import deepcopy
+from optim_utils import init_optim
+from transformers.models.bart.configuration_bart import BartConfig
+from models.avsd_bart import AVSDBart
+
+from custom_datasets.avsd import build_input_from_segments
+from time import time
+
+
+class AVSDRunner(Runner):
+    def __init__(self, config, tokenizer, vocab_size):
+        super(AVSDRunner, self).__init__(config)
+        bart_config = BartConfig.from_json_file(self.config['bart_config'])
+
+        self.model = AVSDBart.from_pretrained(
+            'facebook/bart-{}'.format(self.config['bart_size']), config=bart_config)
+
+        # Resize the embedding to match the vocab with additional special toks
+        # This takes care of resizing weights of related parts of the network
+        # pytorch_total_params = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
+        # print(pytorch_total_params)
+
+        if vocab_size != bart_config.vocab_size:
+            self.model.resize_token_embeddings(vocab_size)
+
+        self.model.to(self.config['device'])
+        if not self.config['generating']:
+            self.optimizer, self.scheduler = init_optim(self.model, self.config)
+        self.tokenizer = tokenizer
+
+    def forward(self, batch):
+
+        for key in batch:
+            if isinstance(batch[key], torch.Tensor):
+                batch[key] = batch[key].cuda()
+
+        ########################################################
+        input_ids = batch['input_ids']
+        video_place_holder_ids = batch['video_place_holder_ids']
+        i3d_rgb = batch['i3d_rgb']
+        i3d_flow = batch['i3d_flow']
+        sam = batch['sam']
+        vggish = batch['vggish']
+        lm_labels = batch['lm_labels']
+        input_mask = batch['input_mask']
+
+        i3d_rgb_interval = batch['i3d_rgb_interval']
+        i3d_flow_interval = batch['i3d_flow_interval']
+        sam_interval = batch['sam_interval']
+        audio_interval = batch['audio_interval']
+        history_intervals = batch['history_intervals']
+        question_intervals = batch['question_intervals']
+        vis_state_vector_idx = batch['vis_state_vector_idx']
+        history_state_vector_idx = batch['history_state_vector_idx']
+        question_state_vector_idx = batch['question_state_vector_idx']
+
+        ########################################################
+        bart_output = self.model(
+            input_ids=input_ids,
+            video_place_holder_ids=video_place_holder_ids,
+            i3d_rgb=i3d_rgb,
+            i3d_flow=i3d_flow,
+            sam=sam,
+            vggish=vggish,
+            attention_mask=input_mask,
+            labels=lm_labels,
+            i3d_rgb_interval=i3d_rgb_interval,
+            i3d_flow_interval=i3d_flow_interval,
+            sam_interval=sam_interval,
+            audio_interval=audio_interval,
+            history_intervals=history_intervals,
+            question_intervals=question_intervals,
+            vis_state_vector_idx=vis_state_vector_idx,
+            history_state_vector_idx=history_state_vector_idx,
+            question_state_vector_idx=question_state_vector_idx,
+            output_attentions=True,
+            return_dict=True
+        )
+
+        output = {}
+
+        if self.config['print_output']:
+            logits = bart_output['logits']
+            probs = F.softmax(logits, dim=-1)
+            preds = torch.topk(probs, 1)[1].squeeze(-1)
+            preds = preds.tolist()
+            lm_labels_list = lm_labels[:, 1:].tolist()
+            lm_labels_list = [[s for s in label if s != -1] for label in lm_labels_list] 
+            reponses = ''
+            labels = ''
+            for pred, label in zip(preds, lm_labels_list):
+                reponses += self.tokenizer.decode(pred) + '\n'
+                labels += self.tokenizer.decode(label) + '\n'
+            
+            output['reponses'] = reponses
+            output['gt'] = labels
+        
+
+        gen_key = 'gen_loss (x{})'.format(self.config['gen_coeff'])
+        gen_loss = bart_output['gen_loss']
+        gen_loss = self.config['gen_coeff'] * gen_loss
+
+
+        elbo_global_key = 'elbo_loss_global (x{})'.format(self.config['elbo_global_coeff'])
+        if bart_output['elbo_loss_global'] is not None:
+            elbo_global_loss = bart_output['elbo_loss_global']
+            elbo_global_loss = self.config['elbo_global_coeff'] * elbo_global_loss
+        else:
+            elbo_global_loss = torch.tensor(0.0)
+
+        elbo_local_key = 'elbo_loss_local (x{})'.format(self.config['elbo_local_coeff'])
+        if bart_output['elbo_loss_local'] is not None:
+            elbo_local_loss = bart_output['elbo_loss_local']
+            elbo_local_loss = self.config['elbo_local_coeff'] * elbo_local_loss
+        else:
+            elbo_local_loss = torch.tensor(0.0)
+
+        total_loss = gen_loss + elbo_global_loss + elbo_local_loss
+
+        output['losses'] = {
+            gen_key: gen_loss,
+            elbo_local_key: elbo_local_loss,
+            elbo_global_key: elbo_global_loss,
+            'tot_loss': total_loss
+        }
+        del bart_output
+        return output
+
+
+    def generate(self, dataset, tag, tokenizer, gen_subset_num=None):
+
+        self.model.eval()
+        responses = {}
+        i3d_flow_sep, i3d_rgb_sep, sam_sep, audio_sep, ph_token = tokenizer.convert_tokens_to_ids(
+            ['<s0>', '<s1>', '<s2>', '<s3>', '<place_holder>'])
+        
+        # Generate the repsonse for each round
+        log.info('[INFO] Generating responses for {} samples'.format(len(dataset)))
+        with torch.no_grad():
+            for counter, dialog in enumerate(dataset):
+                start_time = time()
+                vid = dialog['vid']
+
+                i3d_rgb = np.load(os.path.join(self.config['avsd_i3d_rgb_test'], vid + '.npy'))
+                i3d_flow = np.load(os.path.join(self.config['avsd_i3d_flow_test'], vid + '.npy'))
+                sam = np.load(os.path.join(self.config['avsd_objects_test'], vid + '.npy'))
+                vggish = np.load(os.path.join(self.config['avsd_audio_test'], vid + '.npy'))
+
+                min_length = min([self.config['vis_feat_length'], i3d_rgb.shape[0], i3d_flow.shape[0], sam.shape[0], vggish.shape[0]])
+                sample_idx_i3d_rgb = np.round(np.linspace(0, i3d_rgb.shape[0] - 1, min_length)).astype(int)
+                sample_idx_i3d_flow = np.round(np.linspace(0, i3d_flow.shape[0] - 1, min_length)).astype(int)
+                sample_idx_sam = np.round(np.linspace(0, sam.shape[0] - 1, min_length)).astype(int)
+                sample_idx_vggish = np.round(np.linspace(0, vggish.shape[0] - 1, min_length)).astype(int)
+
+                i3d_rgb = torch.from_numpy(i3d_rgb[sample_idx_i3d_rgb, :]).float()
+                i3d_flow = torch.from_numpy(i3d_flow[sample_idx_i3d_flow, :]).float()
+                sam = torch.from_numpy(sam[sample_idx_sam, :]).float()
+                vggish = torch.from_numpy(vggish[sample_idx_vggish, :]).float()
+
+                dummy = torch.ones((1, min_length)) * ph_token
+                video_place_holder_ids = torch.cat(
+                    [torch.ones((1, 1)) * i3d_rgb_sep, dummy,
+                     torch.ones((1, 1)) * i3d_flow_sep, dummy,
+                     torch.ones((1, 1)) * sam_sep, dummy,
+                     torch.ones((1, 1)) * audio_sep, dummy,
+                    ], dim=-1).long()
+                # Now we get the intervals of the visual input tokens
+                # Here the interval do not change across the batch dimension
+                i3d_rgb_interval = [0, min_length + 1]  # the last token is not part of this modality
+                i3d_flow_interval = [min_length + 1, 2 * min_length + 2]
+                sam_interval = [2 * min_length + 2, 3 * min_length + 3]
+                audio_interval = [3 * min_length + 3, 4 * min_length + 4]
+                vis_state_vector_idx = [i3d_rgb_interval[0], i3d_flow_interval[0], sam_interval[0], audio_interval[0]]
+
+                
+                response = self.beam_search_generation(
+                    dialog['caption'], dialog['history'],
+                    i3d_rgb, i3d_flow, sam, vggish,
+                    i3d_rgb_interval, i3d_flow_interval, sam_interval, audio_interval,
+                    vis_state_vector_idx, video_place_holder_ids, tokenizer)
+
+                # Decode the response
+                response = self.tokenizer.decode(response)
+                responses[vid] = response
+                # all_graphs[vid] = graphs
+                time_elapsed = int(time() - start_time)
+                print('Generating resonse {} / {} -- took {}s'.format(counter + 1, len(dataset), time_elapsed))
+                
+        # Create a file with all responses
+        with open(self.config['avsd_test_dstc{}'.format(self.config['dstc'])], 'r') as f:
+            test_data = json.load(f)
+        test_dialogs = deepcopy(test_data['dialogs'])
+        # Filter the predicted dialogs
+        test_dialogs = list(filter(lambda diag: diag['image_id'] in responses, test_dialogs))
+
+        for i, dialog in enumerate(test_dialogs):
+            vid_id = dialog['image_id']
+            gen_response = responses[vid_id]
+            round_num_to_answer = len(dialog['dialog'])-1
+            assert dialog['dialog'][round_num_to_answer]['answer'] == '__UNDISCLOSED__'
+            dialog['dialog'][round_num_to_answer]['answer'] = gen_response
+            test_dialogs[i] = dialog
+
+        # Log the file
+        file_name = 'results_dstc{}_beam_depth_{}'.format(self.config['dstc'], self.config['beam_depth'])
+        if gen_subset_num is not None:
+            file_name += f'-part_{gen_subset_num}'
+        file_name = f'{tag}_' + file_name
+        output_path = os.path.join(self.config['output_dir_dstc{}'.format(self.config['dstc'])], file_name + '.json')
+        with open(output_path, 'w') as f:
+            json.dump({'dialogs': test_dialogs}, f, indent=4)
+        log.info('Results logged to {}'.format(output_path))
+        print(os.getcwd())
+        # Switch back to training mode
+        self.model.train()
+
+
+    def beam_search_generation(
+        self, caption, history,
+        i3d_rgb, i3d_flow, sam, vggish,
+        i3d_rgb_interval, i3d_flow_interval, sam_interval, audio_interval,
+        vis_state_vector_idx, video_place_holder_ids, tokenizer):
+
+        eos_token = tokenizer.eos_token_id
+        unk_token = tokenizer.unk_token_id
+        question_sep = tokenizer.convert_tokens_to_ids('<s5>')
+
+        gen_ans = [eos_token]
+        hyplist = [([], 0.0, [eos_token])]
+        best_state = None
+        comp_hyplist = []
+
+        i3d_rgb = i3d_rgb.unsqueeze(0).cuda()
+        i3d_flow = i3d_flow.unsqueeze(0).cuda()
+        sam = sam.unsqueeze(0).cuda()
+        vggish = vggish.unsqueeze(0).cuda()
+        video_place_holder_ids = video_place_holder_ids.cuda()
+        text_shift_len = video_place_holder_ids.size(-1)
+
+        drop_caption = self.config['dstc'] == 10
+        instance = build_input_from_segments(caption, history, gen_ans, tokenizer, drop_caption=drop_caption)
+
+        input_ids = torch.tensor(instance['input_ids'])
+        history_end = (input_ids == question_sep).nonzero(as_tuple=True)[0]
+        history_intervals = [[0 + text_shift_len, history_end.item() + text_shift_len]]  # The last token is the question state token (not part of the history)
+        question_intervals = [[history_end.item() + text_shift_len, input_ids.size(0) + text_shift_len]]
+
+        history_state_vector_idx = [x[0] + 1 for x in history_intervals]  # +1 because the history starts with <s><s4> .....
+        question_state_vector_idx = [x[0] for x in question_intervals]  # +1 because the history starts with <s><s4> .....
+       
+        input_ids = input_ids.long().cuda().unsqueeze(0)
+        encoder_outputs = None
+
+        for i in range(self.config['max_generation_length']):
+            new_hyplist = []
+            argmin = 0
+            for out, lp, st in hyplist:
+                decoder_input_ids = torch.tensor(st).long().cuda().unsqueeze(0)
+
+                bart_output = self.model(
+                    input_ids=input_ids,
+                    video_place_holder_ids=video_place_holder_ids,
+                    i3d_rgb=i3d_rgb,
+                    i3d_flow=i3d_flow,
+                    sam=sam,
+                    vggish=vggish,
+                    encoder_outputs=encoder_outputs,
+                    decoder_input_ids=decoder_input_ids,
+                    i3d_rgb_interval=i3d_rgb_interval,
+                    i3d_flow_interval=i3d_flow_interval,
+                    sam_interval=sam_interval,
+                    audio_interval=audio_interval,
+                    history_intervals=history_intervals,
+                    question_intervals=question_intervals,
+                    vis_state_vector_idx=vis_state_vector_idx,
+                    history_state_vector_idx=history_state_vector_idx,
+                    question_state_vector_idx=question_state_vector_idx,
+                    output_attentions=True,
+                    generate=True,
+                    return_dict=True
+                )
+
+                if encoder_outputs is None:
+                    encoder_outputs = [
+                        bart_output['encoder_last_hidden_state'],
+                        bart_output['encoder_hidden_states'],
+                        bart_output['encoder_attentions'],
+                        bart_output['encoder_QAs_local'],
+                        bart_output['encoder_PAs_local'],
+                        bart_output['encoder_QA_global'],
+                        bart_output['encoder_PA_global'],
+                        bart_output['encoder_state_vectors']
+                    ]
+
+                logits = bart_output['logits'][:,-1,:].squeeze()  # get the logits of the last token
+                logp = F.log_softmax(logits, dim=0)
+                lp_vec = logp.cpu().data.numpy() + lp
+                if i >= self.config['min_generation_length']:
+                    new_lp = lp_vec[eos_token] + self.config['length_penalty'] * (len(out) + 1)
+                    comp_hyplist.append((out, new_lp))
+                    if best_state is None or best_state < new_lp:
+                        best_state = new_lp
+                count = 1
+                for o in np.argsort(lp_vec)[::-1]:  # reverse the order
+                    if o in [eos_token, unk_token]:
+                        continue
+                    new_lp = lp_vec[o]
+                    if len(new_hyplist) == self.config['beam_depth']:
+                        if new_hyplist[argmin][1] < new_lp:
+                            new_st = deepcopy(st)
+                            new_st.append(int(o))
+                            new_hyplist[argmin] = (out + [o], new_lp, new_st)
+                            argmin = min(enumerate(new_hyplist), key=lambda h: h[1][1])[0]
+                        else:
+                            break
+                    else:
+                        new_st = deepcopy(st)
+                        new_st.append(int(o))
+                        new_hyplist.append((out + [o], new_lp, new_st))
+                        if len(new_hyplist) == self.config['beam_depth']:
+                            argmin = min(enumerate(new_hyplist), key=lambda h: h[1][1])[0]
+                    count += 1
+            hyplist = new_hyplist
+        
+        if len(comp_hyplist) > 0:
+            maxhyps = sorted(comp_hyplist, key=lambda h: -h[1])[:1]
+            return maxhyps[0][0]
+        else:
+            return []
--- a/runners/runner_nextqa.py
+++ b/runners/runner_nextqa.py
@ -0,0 +1,300 @@
+import time
+import os
+import glog as log
+import numpy as np
+import json
+import torch
+import torch.nn.functional as F
+from runners.runner import Runner
+from copy import deepcopy
+from optim_utils import init_optim
+from transformers.models.bart.configuration_bart import BartConfig
+from models.nextqa_bart import AVSDBart
+from time import time
+
+
+def tokenize(obj, tokenizer):
+    if isinstance(obj, str):
+        return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(obj))
+    if isinstance(obj, dict):
+        return dict((n, tokenize(o)) for n, o in obj.items())
+    return list(tokenize(o) for o in obj)
+
+
+class NEXTQARunner(Runner):
+    def __init__(self, config, tokenizer, vocab_size):
+        super(NEXTQARunner, self).__init__(config)
+        bart_config = BartConfig.from_json_file(self.config['bart_config'])
+
+        self.model = AVSDBart.from_pretrained(
+            'facebook/bart-{}'.format(self.config['bart_size']), config=bart_config)
+
+        # Resize the embedding to match the vocab with additional special toks
+        # This takes care of resizing weights of related parts of the network
+
+        if vocab_size != bart_config.vocab_size:
+            self.model.resize_token_embeddings(vocab_size)
+
+        self.model.to(self.config['device'])
+        if not self.config['generating']:
+            self.optimizer, self.scheduler = init_optim(self.model, self.config)
+        self.tokenizer = tokenizer
+
+    def forward(self, batch):
+
+        for key in batch:
+            if isinstance(batch[key], torch.Tensor):
+                batch[key] = batch[key].cuda()
+
+        ########################################################
+        input_ids = batch['input_ids']
+        video_place_holder_ids = batch['video_place_holder_ids']
+        app_feats = batch['app_feats']
+        mot_feats = batch['mot_feats']
+        lm_labels = batch['lm_labels']
+        input_mask = batch['input_mask']
+
+        app_interval = batch['app_interval']
+        mot_interval = batch['mot_interval']
+        question_intervals = batch['question_intervals']
+        vis_state_vector_idx = batch['vis_state_vector_idx']
+        question_state_vector_idx = batch['question_state_vector_idx']
+        ########################################################
+        
+        bart_output = self.model(
+            input_ids=input_ids,
+            video_place_holder_ids=video_place_holder_ids,
+            i3d_rgb=app_feats,
+            i3d_flow=mot_feats,
+            attention_mask=input_mask,
+            labels=lm_labels,
+            i3d_rgb_interval=app_interval,
+            i3d_flow_interval=mot_interval,
+            question_intervals=question_intervals,
+            vis_state_vector_idx=vis_state_vector_idx,
+            question_state_vector_idx=question_state_vector_idx,
+            output_attentions=True,
+            return_dict=True
+        )
+
+        output = {}
+
+        if self.config['print_output']:
+            logits = bart_output['logits']
+            probs = F.softmax(logits, dim=-1)
+            preds = torch.topk(probs, 1)[1].squeeze(-1)
+            preds = preds.tolist()
+            lm_labels_list = lm_labels[:, 1:].tolist()
+            lm_labels_list = [[s for s in label if s != -1] for label in lm_labels_list] 
+            reponses = ''
+            labels = ''
+            for pred, label in zip(preds, lm_labels_list):
+                reponses += self.tokenizer.decode(pred) + '\n'
+                labels += self.tokenizer.decode(label) + '\n'
+            
+            output['reponses'] = reponses
+            output['gt'] = labels
+        
+
+        gen_key = 'gen_loss (x{})'.format(self.config['gen_coeff'])
+        gen_loss = bart_output['gen_loss']
+        gen_loss = self.config['gen_coeff'] * gen_loss
+
+
+        elbo_global_key = 'elbo_loss_global (x{})'.format(self.config['elbo_global_coeff'])
+        if bart_output['elbo_loss_global'] is not None:
+            elbo_global_loss = bart_output['elbo_loss_global']
+            elbo_global_loss = self.config['elbo_global_coeff'] * elbo_global_loss
+        else:
+            elbo_global_loss = torch.tensor(0.0)
+
+        elbo_local_key = 'elbo_loss_local (x{})'.format(self.config['elbo_local_coeff'])
+        if bart_output['elbo_loss_local'] is not None:
+            elbo_local_loss = bart_output['elbo_loss_local']
+            elbo_local_loss = self.config['elbo_local_coeff'] * elbo_local_loss
+        else:
+            elbo_local_loss = torch.tensor(0.0)
+
+        total_loss = gen_loss + elbo_global_loss + elbo_local_loss
+
+        output['losses'] = {
+            gen_key: gen_loss,
+            elbo_local_key: elbo_local_loss,
+            elbo_global_key: elbo_global_loss,
+            'tot_loss': total_loss
+        }
+        del bart_output
+        return output
+
+
+    def generate(self, dataset, app_feats, mot_feats, tag, tokenizer, start_idx_gen, end_idx_gen, gen_subset_num=None):
+
+        self.model.eval()
+        results = {}
+        app_sep, mot_sep, ph_token = tokenizer.convert_tokens_to_ids(
+            ['<s0>', '<s1>', '<place_holder>'])
+        
+        # Generate the repsonse for each round
+        log.info('[INFO] Generating responses for {} samples'.format(len(dataset)))
+        with torch.no_grad():
+            counter = 0
+            for idx in range(start_idx_gen, end_idx_gen):
+                start_time = time()
+                cur_sample = dataset.loc[idx]
+                video_name, ques, ans, qid = str(cur_sample['video']), str(cur_sample['question']),\
+                                            str(cur_sample['answer']), str(cur_sample['qid'])
+                if video_name not in results:
+                    results[video_name] = {}
+
+                input_ids = tokenize(ques, tokenizer)
+
+                app_feat = app_feats[video_name]
+                app_feat = torch.from_numpy(app_feat).type(torch.float32)
+
+                mot_feat = mot_feats[video_name]
+                mot_feat = torch.from_numpy(mot_feat).type(torch.float32)
+
+                bos, eos, ques_state = self.tokenizer.convert_tokens_to_ids(['<s>', '</s>', '<s2>'])
+
+                # Add state tokens
+                input_ids.insert(0, ques_state)
+
+                input_ids = torch.Tensor(input_ids).long()
+
+                dummy = torch.ones((1, 16)) * ph_token
+                video_place_holder_ids = torch.cat(
+                    [torch.ones((1, 1)) * app_sep, dummy,
+                     torch.ones((1, 1)) * mot_sep, dummy,
+                    ], dim=-1).long()
+                
+                # Now we get the intervals of the visual input tokens
+                # Here the interval do not change across the batch dimension
+                app_interval = [0, 16 + 1]  # the last token is not part of this modality
+                mot_interval = [16 + 1, 2 * 16 + 2]
+                vis_state_vector_idx = [app_interval[0], mot_interval[0]]
+
+                response = self.beam_search_generation(
+                    input_ids,
+                    app_feat, mot_feat,
+                    app_interval, mot_interval,
+                    vis_state_vector_idx, video_place_holder_ids, tokenizer)
+
+                # Decode the response
+                response = self.tokenizer.decode(response)
+
+                results[video_name][qid] = response
+                time_elapsed = int(time() - start_time)
+                print('Generating resonse {} / {} -- took {}s'.format(counter + 1, len(dataset), time_elapsed))
+                counter += 1
+                
+        # Create a file with all responses
+        file_name = 'results_nextqa_beam_depth_{}'.format(self.config['beam_depth'])
+        if gen_subset_num is not None:
+            file_name += f'-part_{gen_subset_num}'
+        file_name = f'{tag}_' + file_name
+        output_path = os.path.join(self.config['output_dir_nextqa'], file_name + '.json')
+        with open(output_path, 'w') as f:
+            json.dump(results, f, indent=4)
+        log.info('Results logged to {}'.format(output_path))
+        print(os.getcwd())
+        # Switch back to training mode
+        self.model.train()
+
+
+    def beam_search_generation(
+        self, input_ids,
+        app_feat, mot_feat,
+        app_interval, mot_interval,
+        vis_state_vector_idx, video_place_holder_ids, tokenizer):
+
+        eos_token = tokenizer.eos_token_id
+        unk_token = tokenizer.unk_token_id
+        question_sep = tokenizer.convert_tokens_to_ids('<s2>')
+
+        gen_ans = [eos_token]
+        hyplist = [([], 0.0, [eos_token])]
+        best_state = None
+        comp_hyplist = []
+
+        app_feat = app_feat.unsqueeze(0).cuda()
+        mot_feat = mot_feat.unsqueeze(0).cuda()
+        video_place_holder_ids = video_place_holder_ids.cuda()
+        text_shift_len = video_place_holder_ids.size(-1)
+
+        question_intervals = [[0 + text_shift_len, input_ids.size(0) + text_shift_len]]  # The last token is the question state token (not part of the history)
+
+        question_state_vector_idx = [x[0] for x in question_intervals] 
+       
+        input_ids = input_ids.long().cuda().unsqueeze(0)
+        encoder_outputs = None
+
+        for i in range(self.config['max_generation_length']):
+            new_hyplist = []
+            argmin = 0
+            for out, lp, st in hyplist:
+                decoder_input_ids = torch.tensor(st).long().cuda().unsqueeze(0)
+
+                bart_output = self.model(
+                    input_ids=input_ids,
+                    video_place_holder_ids=video_place_holder_ids,
+                    i3d_rgb=app_feat,
+                    i3d_flow=mot_feat,
+                    encoder_outputs=encoder_outputs,
+                    decoder_input_ids=decoder_input_ids,
+                    i3d_rgb_interval=app_interval,
+                    i3d_flow_interval=mot_interval,
+                    question_intervals=question_intervals,
+                    vis_state_vector_idx=vis_state_vector_idx,
+                    question_state_vector_idx=question_state_vector_idx,
+                    output_attentions=True,
+                    generate=True,
+                    return_dict=True
+                )
+
+                if encoder_outputs is None:
+                    encoder_outputs = [
+                        bart_output['encoder_last_hidden_state'],
+                        bart_output['encoder_hidden_states'],
+                        bart_output['encoder_attentions'],
+                        bart_output['encoder_QAs_local'],
+                        bart_output['encoder_PAs_local'],
+                        bart_output['encoder_QA_global'],
+                        bart_output['encoder_PA_global'],
+                        bart_output['encoder_state_vectors']
+                    ]
+
+                logits = bart_output['logits'][:,-1,:].squeeze()  # get the logits of the last token
+                logp = F.log_softmax(logits, dim=0)
+                lp_vec = logp.cpu().data.numpy() + lp
+                if i >= self.config['min_generation_length']:
+                    new_lp = lp_vec[eos_token] + self.config['length_penalty'] * (len(out) + 1)
+                    comp_hyplist.append((out, new_lp))
+                    if best_state is None or best_state < new_lp:
+                        best_state = new_lp
+                count = 1
+                for o in np.argsort(lp_vec)[::-1]:  # reverse the order
+                    if o in [eos_token, unk_token]:
+                        continue
+                    new_lp = lp_vec[o]
+                    if len(new_hyplist) == self.config['beam_depth']:
+                        if new_hyplist[argmin][1] < new_lp:
+                            new_st = deepcopy(st)
+                            new_st.append(int(o))
+                            new_hyplist[argmin] = (out + [o], new_lp, new_st)
+                            argmin = min(enumerate(new_hyplist), key=lambda h: h[1][1])[0]
+                        else:
+                            break
+                    else:
+                        new_st = deepcopy(st)
+                        new_st.append(int(o))
+                        new_hyplist.append((out + [o], new_lp, new_st))
+                        if len(new_hyplist) == self.config['beam_depth']:
+                            argmin = min(enumerate(new_hyplist), key=lambda h: h[1][1])[0]
+                    count += 1
+            hyplist = new_hyplist
+        
+        if len(comp_hyplist) > 0:
+            maxhyps = sorted(comp_hyplist, key=lambda h: -h[1])[:1]
+            return maxhyps[0][0]
+        else:
+            return []