I. Weak Supervision
When you have a large pool of unlabeled data, Weak Supervision (WS) is your starting point. It's a powerful technique for programmatically generating noisy labels at scale, transforming your domain expertise into a massive training set.
The Programmatic Labeling Pipeline
Write Labeling Functions (LFs)
Encode domain knowledge as functions (e.g., using keywords, patterns, or LLM prompts) that vote on labels or abstain.
Run Generative Label Model
This model analyzes the agreements and disagreements among LFs to estimate their accuracies and correlations—without any ground truth.
Produce Probabilistic Labels
The output is a full training set with "soft" labels (e.g., 90% Class A, 10% Class B), capturing the model's confidence.
Train Discriminative End Model
A powerful model (e.g., a Transformer) is trained on these probabilistic labels. It learns to generalize beyond the simple heuristics of the LFs, resulting in a robust, high-performance final model.
II. Active Learning
Active Learning (AL) addresses the labeling bottleneck from a different angle. Instead of labeling more data noisily, AL helps you label less data intelligently, maximizing model improvement while minimizing human annotation cost.
The Query Strategy Explorer
The "brain" of an active learner is its query strategy. Select a strategy family below to understand its core principle.
III. Generative Methods & Optimal Transport
When you need to fill gaps, cover edge cases, or simply create more data, generative methods are the solution. Optimal Transport (OT) provides a principled, geometric framework for creating high-fidelity synthetic data.
Principled Augmentation with OT
Naive Interpolation (e.g., Mixup)
Simply averaging data points can create unrealistic samples that fall "off" the true data manifold.
Wasserstein Barycenters (OT)
OT finds a geometric "average" that respects the data's structure, producing realistic, in-distribution samples.
The OT Data-Centric Toolkit
WGANs
Uses Wasserstein distance to stabilize GAN training, generating higher-quality synthetic data.
Domain Adaptation
Aligns data distributions from a source domain to a target domain, bridging the "domain gap".
Coreset Selection
Finds a small, representative subset of a large dataset for more efficient model training.
IV. The Unified Framework
These techniques are most powerful when combined. Use this interactive guide to determine the best data-building strategy for your specific needs.
Recommended Strategy:
Your recommended workflow will appear here...