The Smart Way to Compare and Engineer Data Distributions
Imagine you have a pile of earth and you need to move it to fill a hole of the same size. Optimal Transport (OT) is the mathematical framework for finding the most efficient plan to do this, minimizing the total "work" (mass × distance). In data science, OT applies this same principle to find the cheapest way to "transform" one probability distribution into another.
The power of modern OT comes from a key conceptual shift that made the problem solvable and vastly more applicable.
A strict one-to-one mapping. Fails if a single source point needs to supply multiple destinations (no "mass splitting").
A flexible one-to-many "transport plan" that allows mass to be split, guaranteeing a solution always exists.
The Wasserstein distance (derived from OT) is powerful for machine learning because it's defined even for non-overlapping distributions, unlike other metrics which can fail and provide no useful signal for model training.
Result: A finite distance. Provides a useful gradient for optimization.
Result: Infinite or constant. Gradients vanish, halting model training.
Solving OT exactly is very slow. Modern algorithms like Sinkhorn use entropic regularization to find an approximate solution thousands of times faster, making OT practical for large datasets.
This chart shows the massive improvement in relative computational cost for approximate algorithms. Lower is better.
✨
Uses the Wasserstein distance as a loss function to stabilize GAN training, preventing mode collapse and generating more diverse, high-quality synthetic data.
➕
Calculates Wasserstein Barycenters to create realistic new samples by "averaging" multiple data points in a geometrically meaningful way, avoiding out-of-distribution examples.
🌉
Learns a transport map to align a labeled source domain with an unlabeled target domain, allowing models to generalize across different data distributions and reduce domain shift.
🔗
Uses Gromov-Wasserstein to align data with different structures (like graphs), and Fused Gromov-Wasserstein to align multi-modal data by considering both features and structure.
✂️
Identifies a small, representative subset of a large dataset by minimizing the OT distance, enabling more efficient model training with minimal performance loss.
⚖️
Aligns model output distributions across different demographic groups to a common "fair" target (the barycenter), mitigating bias while preserving accuracy.