Diffusion Model with Cross Attention as an Inductive Bias for Disentanglement

Part of Advances in Neural Information Processing Systems 37 (NeurIPS 2024) Main Conference Track

Bibtex Paper

Authors

Tao Yang, Cuiling Lan, Yan Lu, Nanning Zheng

Abstract

Disentangled representation learning strives to extract the intrinsic factors within the observed data. Factoring these representations in an unsupervised manner is notably challenging and usually requires tailored loss functions or specific structural designs. In this paper, we introduce a new perspective and framework, demonstrating that diffusion models with cross-attention itself can serve as a powerful inductive bias to facilitate the learning of disentangled representations. We propose to encode an image into a set of concept tokens and treat them as the condition of the latent diffusion model for image reconstruction, where cross attention over the concept tokens is used to bridge the encoder and the U-Net of the diffusion model. We analyze that the diffusion process inherently possesses the time-varying information bottlenecks. Such information bottlenecks and cross attention act as strong inductive biases for promoting disentanglement. Without any regularization term in the loss function, this framework achieves superior disentanglement performance on the benchmark datasets, surpassing all previous methods with intricate designs. We have conducted comprehensive ablation studies and visualization analyses, shedding a light on the functioning of this model. We anticipate that our findings will inspire more investigation on exploring diffusion model for disentangled representation learning towards more sophisticated data analysis and understanding.