# HaHeAE: Learning Generalisable Joint Representations of Human Hand and Head Movements in Extended Reality


## Abstract
```
Human hand and head movements are the most pervasive input modalities in extended reality (XR) and are significant for a wide range of applications. 
However, prior works on hand and head modelling in XR only explored a single modality or focused on specific applications. 
We present HaHeAE - a novel self-supervised method for learning generalisable joint representations of hand and head movements in XR. 
At the core of our method is an autoencoder (AE) that uses a graph convolutional network-based semantic encoder and a diffusion-based stochastic encoder to learn the joint semantic and stochastic representations of hand-head movements. 
It also features a diffusion-based decoder to reconstruct the original signals. 
Through extensive evaluations on three public XR datasets, we show that our method 1) significantly outperforms commonly used self-supervised methods by up to 74.1% in terms of reconstruction quality and is generalisable across users, activities, and XR environments, 2) enables new applications, including interpretable hand-head cluster identification and variable hand-head movement generation, and 3) can serve as an effective feature extractor for downstream tasks. 
Together, these results demonstrate the effectiveness of our method and underline the potential of self-supervised methods for jointly modelling hand-head behaviours in extended reality.
```


## Environment:
Ubuntu 22.04
python 3.8+
pytorch 1.8.1


## Usage:
Step 1: Create the environment
```
conda env create -f ./environment/haheae.yaml -n haheae
conda activate haheae
```

Step 2: Follow the instructions at [Pose2Gaze][1] to process the datasets.


Step 3:  Set 'data_dir' in 'config.py' and 'main.py' for the processed datasets. Run 'train.sh' to evaluate the pre-trained models. If you want to train the model from scratch, you can remove the pre-trained models and uncomment the training command (the command with "mode" set to "train").

 
## Citation

```bibtex
@article{hu25haheae,
	author={Hu, Zhiming and Zhang, Guanhua and Yin, Zheming and Haeufle, Daniel and Schmitt, Syn and Bulling, Andreas},
	journal={IEEE Transactions on Visualization and Computer Graphics}, 
	title={HaHeAE: Learning Generalisable Joint Representations of Human Hand and Head Movements in Extended Reality}, 
	year={2025}}
	
@article{hu24pose2gaze,
	author={Hu, Zhiming and Xu, Jiahui and Schmitt, Syn and Bulling, Andreas},
	journal={IEEE Transactions on Visualization and Computer Graphics}, 
	title={Pose2Gaze: Eye-body Coordination during Daily Activities for Gaze Prediction from Full-body Poses}, 
	year={2024}}
```


## Acknowledgements
Our work is built on the codebase of [Diffusion Autoencoders][2] and [DisMouse][3]. Thanks to the authors for sharing their codes.

[1]: https://github.com/CraneHzm/Pose2Gaze
[2]: https://diff-ae.github.io/
[3]: https://git.hcics.simtech.uni-stuttgart.de/public-projects/DisMouse