# ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos *Lei Shi1, Paul Bürkner2, Andreas Bulling1* 1. University of Stuttgart 2. TU Dortmund University IEEE/CVF Winter Conference on Applications of Computer Vision, 2025 Paper link: https://arxiv.org/abs/2403.08591 ## Dataset Download pre-extracted features. ### Crosstask ``` cd dataset/crosstask wget https://www.di.ens.fr/~dzhukov/crosstask/crosstask_release.zip wget https://vision.eecs.yorku.ca/WebShare/CrossTask_s3d.zip unzip '*.zip' ``` ### Coin ``` cd dataset/coin wget https://vision.eecs.yorku.ca/WebShare/COIN_s3d.zip unzip COIN_s3d.zip ``` ### NIV ``` cd dataset/NIV wget https://vision.eecs.yorku.ca/WebShare/NIV_s3d.zip unzip NIV_s3d.zip ``` ## Train ### Task Predicion Set arguments in `train_mlp.sh`. Train task prediction model for each dataset. Set `--class_dim, --action_dim, --observation_dim` accordingly. For horizon `T={3,4,5,6}`, set `--horizon, --json_path_val ,--json_path_train` accordingly. ``` sh train_mlp.sh ``` Set the checkpoint path in `temp.py` via `--checkpoint_mlp` ### Diffusion Model Set `dataset, horizon` in `train.sh` to corresponding datasets and time horizons for training. Set `mask_type` to `multi_add` to use multiple-add noise mask or `single_add` to use single-add noise mask. Set `attn` to `WithAttention` to use UNet with attention or `NoAttention` to use UNet without attention. To train the model, run ``` sh train.sh ``` To train the model without mask, run ``` sh train_no_mask.sh ``` ## Inference Set `dataset, horizon` in `inference.sh` to corresponding datasets and time horizons for training. Set `checkpoint_diff` to the pre-trained model. Set `mask_type` to `multi_add` to use multiple-add noise mask or `single_add` to use single-add noise mask. Set `attn` to `WithAttention` to use UNet with attention or `NoAttention` to use UNet without attention. Set `dataset, horizon` to corresponding datasets and time horizons for inference. Set `checkpoint_diff` to the path of pre-trained model. Set `mask_type` to `multi_add` to use multiple-add noise mask or `single_add` to use single-add noise mask. Set `attn` to `WithAttention` to use UNet with attention or `NoAttention` to use UNet without attention. To perform inference, run ``` sh inference.sh ``` To perform inference without action mask, run ``` sh inference_no_mask.sh ``` To infer with the ditribution of the noise with action embedding, run ``` sh inference_dist.sh ``` ## Acknowledgement ``` This repository is developed based on https://github.com/MCG-NJU/PDPP/tree/main/ ```