Optimal Transport

The Smart Way to Compare and Engineer Data Distributions

The Big Idea: Moving Earth 🌍

Imagine you have a pile of earth and you need to move it to fill a hole of the same size. Optimal Transport (OT) is the mathematical framework for finding the most efficient plan to do this, minimizing the total "work" (mass × distance). In data science, OT applies this same principle to find the cheapest way to "transform" one probability distribution into another.

From a Rigid Map to a Flexible Plan

The power of modern OT comes from a key conceptual shift that made the problem solvable and vastly more applicable.

The Old Way: Monge's Map (1781)

A strict one-to-one mapping. Fails if a single source point needs to supply multiple destinations (no "mass splitting").

→

The Modern Way: Kantorovich's Plan (1940s)

A flexible one-to-many "transport plan" that allows mass to be split, guaranteeing a solution always exists.

The Geometric Advantage: Why Wasserstein Distance Wins

The Wasserstein distance (derived from OT) is powerful for machine learning because it's defined even for non-overlapping distributions, unlike other metrics which can fail and provide no useful signal for model training.

Wasserstein Distance

Result: A finite distance. Provides a useful gradient for optimization.

KL / JS Divergence

Result: Infinite or constant. Gradients vanish, halting model training.

Taming Complexity: Making OT Fast

Solving OT exactly is very slow. Modern algorithms like Sinkhorn use entropic regularization to find an approximate solution thousands of times faster, making OT practical for large datasets.

This chart shows the massive improvement in relative computational cost for approximate algorithms. Lower is better.

A Data-Centric AI Toolkit

✨

Generative Modeling (WGANs)

Uses the Wasserstein distance as a loss function to stabilize GAN training, preventing mode collapse and generating more diverse, high-quality synthetic data.

➕

Data Augmentation

Calculates Wasserstein Barycenters to create realistic new samples by "averaging" multiple data points in a geometrically meaningful way, avoiding out-of-distribution examples.

🌉

Domain Adaptation

Learns a transport map to align a labeled source domain with an unlabeled target domain, allowing models to generalize across different data distributions and reduce domain shift.

🔗

Data Fusion (GW & FGW)

Uses Gromov-Wasserstein to align data with different structures (like graphs), and Fused Gromov-Wasserstein to align multi-modal data by considering both features and structure.

✂️

Coreset Selection

Identifies a small, representative subset of a large dataset by minimizing the OT distance, enabling more efficient model training with minimal performance loss.

⚖️

Algorithmic Fairness

Aligns model output distributions across different demographic groups to a common "fair" target (the barycenter), mitigating bias while preserving accuracy.