# ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos
*Lei Shi1, Paul Bürkner2, Andreas Bulling1*
1. University of Stuttgart
2. TU Dortmund University
IEEE/CVF Winter Conference on Applications of Computer Vision, 2025
Paper link: https://arxiv.org/abs/2403.08591
## Dataset
Download pre-extracted features.
### Crosstask
```
cd dataset/crosstask
wget https://www.di.ens.fr/~dzhukov/crosstask/crosstask_release.zip
wget https://vision.eecs.yorku.ca/WebShare/CrossTask_s3d.zip
unzip '*.zip'
```
### Coin
```
cd dataset/coin
wget https://vision.eecs.yorku.ca/WebShare/COIN_s3d.zip
unzip COIN_s3d.zip
```
### NIV
```
cd dataset/NIV
wget https://vision.eecs.yorku.ca/WebShare/NIV_s3d.zip
unzip NIV_s3d.zip
```
## Train
### Task Predicion
Set arguments in `train_mlp.sh`. Train task prediction model for each dataset. Set `--class_dim, --action_dim, --observation_dim` accordingly. For horizon `T={3,4,5,6}`, set `--horizon, --json_path_val ,--json_path_train` accordingly.
```
sh train_mlp.sh
```
Set the checkpoint path in `temp.py` via `--checkpoint_mlp`
### Diffusion Model
Set `dataset, horizon` in `train.sh` to corresponding datasets and time horizons for training. Set `mask_type` to `multi_add` to use multiple-add noise mask or `single_add` to use single-add noise mask. Set `attn` to `WithAttention` to use UNet with attention or `NoAttention` to use UNet without attention.
To train the model, run
```
sh train.sh
```
To train the model without mask, run
```
sh train_no_mask.sh
```
## Inference
Set `dataset, horizon` in `inference.sh` to corresponding datasets and time horizons for training. Set `checkpoint_diff` to the pre-trained model.
Set `mask_type` to `multi_add` to use multiple-add noise mask or `single_add` to use single-add noise mask. Set `attn` to `WithAttention` to use UNet with attention or `NoAttention` to use UNet without attention.
Set `dataset, horizon` to corresponding datasets and time horizons for inference. Set `checkpoint_diff` to the path of pre-trained model.
Set `mask_type` to `multi_add` to use multiple-add noise mask or `single_add` to use single-add noise mask. Set `attn` to `WithAttention` to use UNet with attention or `NoAttention` to use UNet without attention.
To perform inference, run
```
sh inference.sh
```
To perform inference without action mask, run
```
sh inference_no_mask.sh
```
To infer with the ditribution of the noise with action embedding, run
```
sh inference_dist.sh
```
## Acknowledgement
```
This repository is developed based on https://github.com/MCG-NJU/PDPP/tree/main/
```