Hybrid backbone: hPGA-DP wraps diffusion with PGA components. A P-GATr state encoder converts observations (robot links + task objects) into multivector latents; a P-GATr action decoder maps denoised latents back to actions. Both preserve geometric structure and E(3)-equivariance.
Geometric inductive bias via PGA: Points, planes, and motions live as multivectors in 𝔾3,0,1, so Euclidean transforms are native instead of relearned. This encodes spatial relations compactly and consistently across tasks.
Standard denoiser in the middle: Between encoder and decoder, a conventional U-Net or Transformer handles the diffusion denoising steps. This keeps the proven generative power of standard backbones while operating on geometry-informed latents.
Training recipe: Noise is added to action latents; the denoiser predicts it conditioned on encoded observations. The decoder is supervised only on later denoising steps (loss masking with threshold η) so it learns from well-denoised latents instead of pure noise. Total loss combines noise-prediction MSE and a masked decoder reconstruction term.
Fig. 2: Top: simulation tasks in robosuite, with colored 3D bounding boxes indicating task-relevant objects. Bottom left: success rates for diffusion policies with different network backbones for various tasks, and mean epoch training time (MET) for each network on all tasks together. Bottom right: plot of success rate for state-based policies with U-Net, Transformer, hPGA-U, and hPGA-T for 100 training epochs of the Stack task.
Our experiments show that by "priming" the network with geometric knowledge, hPGA-DP reaches peak success rates in significantly fewer training epochs compared to vanilla Diffusion Policies.
Fig. 3: Top left: the dual-arm system for real-world experiments. Top right: top and bottom row show the block stacking task and drawer interaction task respectively. Bottom: results for real-world experiments. SR: success rate, CT: cumulative training time measured in minutes.
In the real-world tasks, hPGA-DP achieves higher success rates within the same epoch budget than U-Net or Transformer baselines. Although each epoch is slightly slower, the hybrid reaches target performance in fewer epochs, yielding lower cumulative training time.
@article{sun2025hybrid,
title={Hybrid diffusion policies with projective geometric algebra for efficient robot manipulation learning},
author={Sun, Xiatao and Wang, Yuxuan and Yang, Shuo and Chen, Yinxing and Rakita, Daniel},
journal={arXiv preprint arXiv:2507.05695},
year={2025}
}