100 lines
2.6 KiB
Markdown
100 lines
2.6 KiB
Markdown
# ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos
|
|
*Lei Shi<sup>1</sup>, Paul Bürkner<sup>2</sup>, Andreas Bulling<sup>1</sup>*
|
|
|
|
1. University of Stuttgart
|
|
2. TU Dortmund University
|
|
|
|
IEEE/CVF Winter Conference on Applications of Computer Vision, 2025
|
|
|
|
Paper link: https://arxiv.org/abs/2403.08591
|
|
|
|
## Dataset
|
|
|
|
Download pre-extracted features.
|
|
|
|
|
|
### Crosstask
|
|
|
|
```
|
|
cd dataset/crosstask
|
|
wget https://www.di.ens.fr/~dzhukov/crosstask/crosstask_release.zip
|
|
wget https://vision.eecs.yorku.ca/WebShare/CrossTask_s3d.zip
|
|
unzip '*.zip'
|
|
```
|
|
|
|
### Coin
|
|
|
|
```
|
|
cd dataset/coin
|
|
wget https://vision.eecs.yorku.ca/WebShare/COIN_s3d.zip
|
|
unzip COIN_s3d.zip
|
|
```
|
|
|
|
### NIV
|
|
|
|
```
|
|
cd dataset/NIV
|
|
wget https://vision.eecs.yorku.ca/WebShare/NIV_s3d.zip
|
|
unzip NIV_s3d.zip
|
|
```
|
|
|
|
## Train
|
|
|
|
### Task Predicion
|
|
|
|
Set arguments in `train_mlp.sh`. Train task prediction model for each dataset. Set `--class_dim, --action_dim, --observation_dim` accordingly. For horizon `T={3,4,5,6}`, set `--horizon, --json_path_val ,--json_path_train` accordingly.
|
|
|
|
```
|
|
sh train_mlp.sh
|
|
```
|
|
|
|
Set the checkpoint path in `temp.py` via `--checkpoint_mlp`
|
|
|
|
|
|
### Diffusion Model
|
|
|
|
Set `dataset, horizon` in `train.sh` to corresponding datasets and time horizons for training. Set `mask_type` to `multi_add` to use multiple-add noise mask or `single_add` to use single-add noise mask. Set `attn` to `WithAttention` to use UNet with attention or `NoAttention` to use UNet without attention.
|
|
|
|
To train the model, run
|
|
|
|
```
|
|
sh train.sh
|
|
```
|
|
|
|
To train the model without mask, run
|
|
|
|
```
|
|
sh train_no_mask.sh
|
|
```
|
|
|
|
## Inference
|
|
|
|
Set `dataset, horizon` in `inference.sh` to corresponding datasets and time horizons for training. Set `checkpoint_diff` to the pre-trained model.
|
|
Set `mask_type` to `multi_add` to use multiple-add noise mask or `single_add` to use single-add noise mask. Set `attn` to `WithAttention` to use UNet with attention or `NoAttention` to use UNet without attention.
|
|
|
|
Set `dataset, horizon` to corresponding datasets and time horizons for inference. Set `checkpoint_diff` to the path of pre-trained model.
|
|
Set `mask_type` to `multi_add` to use multiple-add noise mask or `single_add` to use single-add noise mask. Set `attn` to `WithAttention` to use UNet with attention or `NoAttention` to use UNet without attention.
|
|
|
|
To perform inference, run
|
|
|
|
```
|
|
sh inference.sh
|
|
```
|
|
|
|
To perform inference without action mask, run
|
|
|
|
```
|
|
sh inference_no_mask.sh
|
|
```
|
|
|
|
To infer with the ditribution of the noise with action embedding, run
|
|
|
|
```
|
|
sh inference_dist.sh
|
|
```
|
|
|
|
## Acknowledgement
|
|
```
|
|
This repository is developed based on https://github.com/MCG-NJU/PDPP/tree/main/
|
|
```
|
|
|