neuro-symbolic-visual-dialog/README.md

181 lines
8.2 KiB
Markdown
Raw Normal View History

2022-08-10 16:49:55 +02:00
# NSVD
This repository contains the official code of the paper:
2022-09-13 10:01:14 +02:00
## Neuro-Symbolic Visual Dialog [[PDF](https://perceptualui.org/publications/abdessaied22_coling.pdf)]
2022-08-10 16:49:55 +02:00
[Adnen Abdessaied](https://adnenabdessaied.de), [Mihai Bace](https://perceptualui.org/people/bace/), [Andreas Bulling](https://perceptualui.org/people/bulling/)
2022-08-12 14:03:06 +02:00
International Conferenc on Computational Linguistics (COLING), 2022 / Gyeongju, Republic of Korea :kr:
2022-11-14 10:46:08 +01:00
2022-11-14 10:45:53 +01:00
:loudspeaker: **Oral Presentation** :loudspeaker:
2022-08-10 16:49:55 +02:00
If you find our code useful or use it in your own projects, please cite our paper:
2022-08-11 08:42:18 +02:00
```
2022-08-17 13:57:20 +02:00
@inproceedings{abdessaied22_coling,
author = {Abdessaied, Adnen and Bâce, Mihai and Bulling, Andreas},
2022-11-14 10:44:32 +01:00
title = {{Neuro-Symbolic Visual Dialog}},
2022-08-17 13:57:20 +02:00
booktitle = {Proceedings of the 29th International Conference on Computational Linguistics (COLING)},
year = {2022},
2022-11-14 10:44:32 +01:00
pages = {192--217},
month = {oct},
year = {2022},
address = {Gyeongju, Republic of Korea},
publisher = {International Committee on Computational Linguistics},
url = {https://aclanthology.org/2022.coling-1.17},
pages = "192--217",
2022-08-17 13:57:20 +02:00
}
2022-08-11 08:42:18 +02:00
```
2022-08-10 16:49:55 +02:00
# Abstract
We propose Neuro-Symbolic Visual Dialog (NSVD) —the first method to combine deep learning and symbolic program execution for multi-round visually-grounded reasoning. NSVD significantly outperforms existing purely-connectionist methods on two key challenges inherent to visual dialog: long-distance co-reference resolution as well as vanishing question-answering performance. We demonstrate the latter by proposing a more realistic and stricter evaluation scheme in which we use predicted answers for the full dialog history when calculating accuracy. We describe two variants of our model and show that using this new scheme, our best model achieves an accuracy of 99.72% on CLEVR-Dialog —a relative improvement of more than 10% over the state
of the art —while only requiring a fraction of training data. Moreover, we demonstrate that our neuro-symbolic models have a higher mean first failure round, are more robust against incomplete dialog histories, and generalise better not only to dialogs that are up to three times longer than those seen during training but also to unseen question types and scenes.
# Method
<figure>
<p align="center"><img src="misc/method_overview.png" alt="missing"/></
<figcaption>Overview of our method NSVD.</figcaption>
</figure>
<figure>
<p align="center"><img src="misc/method_smaller.png" alt="missing"/></
<figcaption>Overview of concat and stack encoders.</figcaption>
</figure>
# Requirements
- PyTorch 1.3.1
- Python 3.6
- Ubuntu 18.04
# Raw Data
## Scene Data
2022-08-11 08:40:41 +02:00
We used CLEVR and Minecraft images in this project. The raw images have a large footprint and we won't upload them. However, we provide their json file as well as their derendedred versions:
2022-08-10 16:49:55 +02:00
2022-08-11 08:40:41 +02:00
- Original clevr-dialog training and validation raw scenes: [](https://dl.fbaipublicfiles.com/clevr/CLEVR_v1.0.zip)
- Raw scenes we used in our experiments: [](https://1drv.ms/u/s!AlGoPLjLV-BOh1fdB30GscvRnFAt?e=Xtorzr)
- All derendered scenes: [](https://1drv.ms/u/s!AlGoPLjLV-BOh0d00ynwnXQO14da?e=Ub6k33)
2022-08-10 16:49:55 +02:00
## Dialog Data
2022-08-11 08:40:41 +02:00
The dialog data we used can be found here [](https://1drv.ms/u/s!AlGoPLjLV-BOhzaYs3s2qSLbGTL_?e=oGGrxr)
2022-08-10 16:49:55 +02:00
You can also create your own data using the ``generate_dataset.py`` script.
# Preprocessing
## Scenes
2022-08-12 14:03:06 +02:00
The derendered scenes do not need any further preprocessing and can be directly used with our neuro-symbolic executor.
2022-08-10 16:49:55 +02:00
## Dialogs
To preprocess the dialogs, follow these steps:
2022-08-11 08:40:41 +02:00
```bash
cd preprocess_dialogs
```
2022-08-10 16:49:55 +02:00
For the stack encoder, execute
2022-08-11 08:40:41 +02:00
```python
python preprocess.py --input_dialogs_json <path_to_raw_dialog_file> --input_vocab_json '' --output_vocab_json <path_where_to_save_the_vocab> --output_h5_file <path_of_the_output_file> --split <train/val/test> --mode stack
```
2022-08-10 16:49:55 +02:00
For the concat encoder, execute
2022-08-11 08:40:41 +02:00
```python
python preprocess.py --input_dialogs_json <path_to_raw_dialog_file> --input_vocab_json '' --output_vocab_json <path_where_to_save_the_vocab> --output_h5_file <path_of_the_output_file> --split <train/val/test> --mode concat
```
2022-08-10 16:49:55 +02:00
# Training
First, change directory
2022-08-11 08:43:40 +02:00
```bash
cd ../prog_generator
```
2022-08-10 16:49:55 +02:00
## Caption Program Parser
To train the caption parser, execute
2022-08-11 08:40:41 +02:00
```python
python train_caption_parser.py --mode train --run_dir <experiment_dir> --res_path <path_to_store_results> --dataPathTr <path_to_preprocessed_training_data> --dataPathVal <path_to_preprocessed_val_data> --dataPathTest <path_to_preprocessed_test_data> --vocab_path <path_where_to_save_the_vocab>
```
2022-08-10 16:49:55 +02:00
## Question Program Parser
To train the question program parser with the stack encoder, execute
2022-08-11 08:40:41 +02:00
```python
python train_question_parser.py --mode train --run_dir <experiment_dir> --text_log_dir <log_dir_path> --dataPathTr <path_to_preprocessed_training_data> --dataPathVal <path_to_preprocessed_val_data> --dataPathTest <path_to_preprocessed_test_data> --scenePath <path_to_derendered_scenes> --vocab_path <path_where_to_save_the_vocab> --encoder_type 2
```
2022-08-10 16:49:55 +02:00
To train the question program parser with the concat encoder, execute
2022-08-11 08:40:41 +02:00
```python
python train_question_parser.py --mode train --run_dir <experiment_dir> --text_log_dir <log_dir_path> --dataPathTr <path_to_preprocessed_training_data> --dataPathVal <path_to_preprocessed_val_data> --dataPathTest <path_to_preprocessed_test_data> --scenePath <path_to_derendered_scenes> --vocab_path <path_where_to_save_the_vocab> --encoder_type 1
```
2022-08-10 16:49:55 +02:00
## Baselines
- [MAC-XXX](https://github.com/ahmedshah1494/clevr-dialog-mac-net/tree/dialog-macnet)
- [HCN](https://github.com/jojonki/Hybrid-Code-Networks)
# Evaluation
To evaluate using the *Hist+GT* scheme, execute
2022-08-11 08:40:41 +02:00
```python
python train_question_parser.py --mode test_with_gt --run_dir <experiment_dir> --text_log_dir <log_dir_path> --dataPathTr <path_to_preprocessed_training_data> --dataPathVal <path_to_preprocessed_val_data> --dataPathTest <path_to_preprocessed_test_data> --scenePath <path_to_derendered_scenes> --vocab_path <path_where_to_save_the_vocab> --encoder_type <1/2> --questionNetPath <path_to_pretrained_question_parser> --captionNetPath <path_to_pretrained_caption_parser> --dialogLen <total_number_of_dialog_rounds> --last_n_rounds <number_of_last_rounds_to_considered_in_history>
```
2022-08-10 16:49:55 +02:00
To evaluate using the *Hist+Pred* scheme, execute
2022-08-11 08:40:41 +02:00
```python
python train_question_parser.py --mode test_with_pred --run_dir <experiment_dir> --text_log_dir <log_dir_path> --dataPathTr <path_to_preprocessed_training_data> --dataPathVal <path_to_preprocessed_val_data> --dataPathTest <path_to_preprocessed_test_data> --scenePath <path_to_derendered_scenes> --vocab_path <path_where_to_save_the_vocab> --encoder_type <1/2> --questionNetPath <path_to_pretrained_question_parser> --captionNetPath <path_to_pretrained_caption_parser> --dialogLen <total_number_of_dialog_rounds> --last_n_rounds <number_of_last_rounds_to_considered_in_history>
```
2022-08-10 16:49:55 +02:00
# Results
We achieve new state-of-the-art performance on clevr-dialog.
## Hist+GT
| <center>Model</center> | <center>Accurcy</center> | <center>NFFR</center> |
| :---: | :---: | :---: |
| MAC-CQ | 97.34 | 0.92 |
| + CAA | 97.87 | 0.94 |
| + MTM | 97.58 | 0.92 |
| HCN | 75.88 | 0.34 |
| **NSVD-concat (Ours)** | 99.59 | 0.98 |
| **NSVD-stack (Ours)** | **99.72** | **0.99** |
## Hist+Pred
| <center>Model</center> | <center>Accurcy</center> | <center>NFFR</center> |
| :---: | :---: | :---: |
| MAC-CQ | 41.10 | 0.15 |
| + CAA | 89.39 | 0.75 |
| + MTM | 70.39 | 0.46 |
| HCN | 74.42 | 0.32 |
| **NSVD-concat (Ours)** | 99.59 | 0.98 |
| **NSVD-stack (Ours)** | **99.72** | **0.99** |
2022-08-11 08:13:30 +02:00
We refer to our paper for more results and experiments.
2022-08-10 16:49:55 +02:00
# Acknowledgements
2022-08-11 08:13:30 +02:00
We thank [Ahmed Shah](https://www.linkedin.com/in/mahmedshah/) for his MAC-XXX implemetation, [Junki Ohmura](https://www.linkedin.com/in/junki/) for his HCN implemantation, [Jiayuan Mao](https://jiayuanm.com/) for providing us with the minecraft images, and finally [Satwik Kottur](https://satwikkottur.github.io/) for his clevr-dialog [codebase](https://github.com/satwikkottur/clevr-dialog).
2022-08-10 16:49:55 +02:00
# Contributors
- [Adnen Abdessaied](https://adnenabdessaied.de)
2022-08-11 08:13:30 +02:00
For any questions or enquiries, don't hesitate to contact the above contributor.