Joint-Embedding vs. Reconstruction: When Should You Use Each? (huguesva.github.io)

🤖 AI Summary
Researchers presented a theoretical comparison (NeurIPS 2025) of two core self-supervised learning paradigms—reconstruction (e.g., denoising autoencoders, MAE) and joint-embedding (e.g., SimCLR, BYOL, DINO)—using linear models to get closed-form solutions. They formalize reconstruction (SSL-RC) as minimizing input-space reconstruction error and joint-embedding (SSL-JE) as minimizing pairwise distance between augmented views under an orthonormality constraint to avoid collapse. By deriving SVD/eigendecomposition solutions parameterized directly by the augmentation distribution, the paper shows both paradigms need augmentations aligned with the task-relevant signal (an alignment parameter α) even with infinite data; misaligned augmentations cannot be fixed by more samples. The key technical and practical takeaway is that the two families have different robustness thresholds in α. Reconstruction methods are biased toward explaining high-variance components of the input and therefore work well when noise is low or the signal dominates variance (e.g., language with semantic tokens). Joint-embedding predicts in latent space and avoids reconstructing irrelevant, high-variance features, so it requires weaker alignment and is more robust when irrelevant features have high magnitude (images with texture/background noise, histopathology, Earth observation, video). Linear-model theory and experiments (MNIST with synthetic noise, ImageNet-1k corruptions) support preferring joint-embedding under strong noise or semantically diffuse, high-dimensional inputs; reconstruction can be preferable when important features naturally dominate input variance.
Loading comments...
loading comments...