Hybrid Diffusion Policies with Projective Geometric Algebra for Efficient Robot Manipulation Learning

Architecture — **Fig. 1:** Overview of the hPGA-DP network architecture.

Abstract

Diffusion policies are a powerful paradigm for robot learning, but their training is often inefficient. A key reason is that networks must relearn fundamental spatial concepts, such as translations and rotations, from scratch for every new task. To alleviate this redundancy, we propose embedding geometric inductive biases directly into the network architecture using Projective Geometric Algebra (PGA). PGA provides a unified algebraic framework for representing geometric primitives and transformations, allowing neural networks to reason about spatial structure more effectively. In this paper, we introduce hPGA-DP, a novel hybrid diffusion policy that capitalizes on these benefits. Our architecture leverages the Projective Geometric Algebra Transformer (P-GATr) as a state encoder and action decoder, while employing established U-Net or Transformer-based modules for the core denoising process. Through extensive experiments and ablation studies in both simulated and real-world environments, we demonstrate that hPGA-DP significantly improves task performance and training efficiency. Notably, our hybrid approach achieves substantially faster convergence compared to both standard diffusion policies and architectures that rely solely on P-GATr.

Fusing Projective Geometric Algebra with Diffusion

Hybrid backbone: hPGA-DP wraps diffusion with PGA components. A P-GATr state encoder converts observations (robot links + task objects) into multivector latents; a P-GATr action decoder maps denoised latents back to actions. Both preserve geometric structure and E(3)-equivariance.

Geometric inductive bias via PGA: Points, planes, and motions live as multivectors in 𝔾_3,0,1, so Euclidean transforms are native instead of relearned. This encodes spatial relations compactly and consistently across tasks.

Standard denoiser in the middle: Between encoder and decoder, a conventional U-Net or Transformer handles the diffusion denoising steps. This keeps the proven generative power of standard backbones while operating on geometry-informed latents.

Training recipe: Noise is added to action latents; the denoiser predicts it conditioned on encoded observations. The decoder is supervised only on later denoising steps (loss masking with threshold η) so it learns from well-denoised latents instead of pure noise. Total loss combines noise-prediction MSE and a masked decoder reconstruction term.

Evaluations

Fig. 2: Top: simulation tasks in robosuite, with colored 3D bounding boxes indicating task-relevant objects. Bottom left: success rates for diffusion policies with different network backbones for various tasks, and mean epoch training time (MET) for each network on all tasks together. Bottom right: plot of success rate for state-based policies with U-Net, Transformer, hPGA-U, and hPGA-T for 100 training epochs of the Stack task.

Our experiments show that by "priming" the network with geometric knowledge, hPGA-DP reaches peak success rates in significantly fewer training epochs compared to vanilla Diffusion Policies.

Fig. 3: Top left: the dual-arm system for real-world experiments. Top right: top and bottom row show the block stacking task and drawer interaction task respectively. Bottom: results for real-world experiments. SR: success rate, CT: cumulative training time measured in minutes.

In the real-world tasks, hPGA-DP achieves higher success rates within the same epoch budget than U-Net or Transformer baselines. Although each epoch is slightly slower, the hybrid reaches target performance in fewer epochs, yielding lower cumulative training time.

Hybrid Diffusion Policies with Projective Geometric Algebra for Efficient Robot Manipulation Learning

Abstract

Fusing Projective Geometric Algebra with Diffusion

Evaluations

BibTeX