The Data Labeling Bottleneck
Modern AI requires vast amounts of labeled data, but manual annotation is slow, expensive, and requires expert time. This section introduces Weak Supervision, a paradigm that breaks this bottleneck by enabling the programmatic creation of large-scale training sets.
Supervised Learning
Relies on massive, perfectly labeled "gold" datasets. Extremely high cost.
Semi-Supervised & Active Learning
Use a small labeled set to learn from a large unlabeled pool. Medium cost.
Weak Supervision
Uses noisy, high-level heuristics to programmatically label data. Very low cost.
Relative Annotation Cost
Anatomy of Weakness
"Weakness" in supervision isn't a single concept. It's a formal taxonomy of label imperfections. Understanding the type of weakness you face is the first step toward choosing the right solution. Click each tab to explore the different types.
The Programmatic Workflow
Modern weak supervision follows a powerful two-stage pipeline. First, a generative model learns to combine and denoise various heuristic "Labeling Functions". Second, a powerful discriminative model is trained on these probabilistic labels to generalize to new, unseen data. Click each step to learn more.
① Develop Labeling Functions (LFs)
Experts encode domain knowledge as code (heuristics, regex, keywords) to programmatically label data.
② Train Generative Label Model
A model learns LF accuracies and correlations to produce a single, denoised probabilistic label per data point.
③ Train Discriminative End Model
A powerful model (e.g., a transformer) is trained on the probabilistic labels to learn rich features and generalize.
The Labeling Function (LF) Lab
Labeling Functions are the heart of weak supervision. They are pieces of code that programmatically assign labels to data. Effective LF development involves balancing three key factors: Precision (accuracy), Coverage (how much data it labels), and Development Effort. Use the toggles to compare different LF types.
Compare LF Types:
The LLM Revolution
Large Language Models have become a new class of ultra-flexible Labeling Functions. Instead of writing code, experts can now write natural language prompts. This dramatically lowers the barrier to entry but introduces new challenges, especially with correlations.
Traditional vs. Prompted LFs
The shift from code to prompts democratizes weak supervision.
if re.search("review.*horrible", text): return "NEGATIVE"
"Is the following review NEGATIVE? Review: [...]"
The Correlation Challenge
Since all prompted LFs come from the same LLM, their errors are highly correlated. Advanced techniques like the Structure Refining Module are needed to model these dependencies by analyzing the similarity of the prompts themselves.
Synergistic AI: A Unified Approach
Weak Supervision is not an isolated technique. It forms a core part of the modern data-centric AI toolkit, creating powerful synergies when combined with other methods like Active Learning and Data Augmentation.
Weak Supervision
Role: Generate Labels.
Operates on unlabeled data to create a large, noisily labeled training set at low cost.
Active Learning
Role: Select Data.
Intelligently selects the most informative data points for expensive, high-quality manual labeling.
Data Augmentation
Role: Create Data.
Expands an existing labeled dataset by applying label-preserving transformations to improve robustness.
Powerful Combination: Start with Weak Supervision to label a large dataset. Use Active Learning to find and fix errors in the most critical areas. Then use Data Augmentation on the cleaned, labeled set to train a final, highly robust model.