Synthetic Data
Snippet
General Trends
- Early Jan 2026
- Shift from quantity (big crawl) to quality (synthetic textbook)
- Moved past idea of scraping more internet.
- Bottleneckk isn't number of tokens but density of logic within those tokens.
What
- Early Jan 2026
- Highly structured information generated by a model to teach another model (or itself) a specific skill.
- Distillation: A massive model (like GPT-4) generates detailed explanations or solutions. A smaller model (like Phi or Orca) is trained on these "reasoning traces" to punch above its weight class.
- Self-Correction/Verification: A model generates 100 attempts at a math problem. An external verifier (like a Python code interpreter or a mathematical prover) checks which one is right. The model is then trained on the correct path.
- Programmatic Synthesis: For Vision-Language Models (VLMs), synthetic data often comes from Game Engines (Unreal/Unity) or Simulators (NVIDIA Isaac). This provides "perfect" ground truth for depth, physics, and spatial relationships that humans are bad at labeling.
- Highly structured information generated by a model to teach another model (or itself) a specific skill.
Landmark Results
- Phi Series (Textbooks Are All You Need)
- Microsoft’s Phi-1 and Phi-2 proved that 1.3B parameter models could beat 175B parameter models in coding if they were trained on "textbook-quality" data.
- How it works: Instead of learning from messy StackOverflow threads, the model learns from synthetic textbooks generated by GPT-4 that explain concepts from first principles.
- The Rebuttal to Scarcity: This suggests that we don't need more data; we need cleaner data. Synthetic generation allows us to "densify" the signal.
- The Orca Series (Explanation Tuning)
- Method: It uses "System Instructions" to force a teacher model to explain its thought process step-by-step (Chain of Thought). The student model then learns to mimic the reasoning path, not just the final token.
-
STaR (Self-Taught Reasoner)
- This is a recursive loop. The model:
- Attempts to solve a problem.
- If it fails, it is given the hint/answer and asked to generate a rationale for why that answer is correct.
- The rationales that lead to correct answers are added back into the training set.
- This is a recursive loop. The model:
How to Construct Synthetic Data
- Rejection Sampling involves selecting best performing outputs from RL trained model for further use.
- Tricks:
- Good filtering (like auto-rater, verifiability test like compilation)
- Highly orthogonal data (data deserts, areas where the internet is thin, complex logic, multi-step math, niche scientific reasoning)
- Instruction reversal: Take high quality piece of text and ask AI 'what prompt would have generated this'
- To avoid model collapse, use 'Diverse Prompting'. i.e. don't just askk for 1000 math problems, ask for math problems in the style of 19th century pirate, NASA engineer, 5th grader. So the latent space of model remains broad.
- Execution based feedback. Make sure that the code is ran and verified in real-time during data generation.
- Robotics
- Ego-centric data. Video as physics, AI needing to understand cause and effect in the real world.
- First person video of humans performing tasks like folding laundry, using a drill to learn 'affordances' - what an object is for
- High frequency sensor logs (torque, touch, balance)
- Massisve amount of unlabeled video to predict 'next frame' of reality, teaching AI 'laws of physics' by observation
- Ego-centric data. Video as physics, AI needing to understand cause and effect in the real world.
- Specialised Domains
- Science
- Data is siloed or unstructured (handwritten notes vs genomic sequences)
- Cross-modal datasets that align different types of information. ex) dataset pairing medical image like X-ray with genomic sequence and physician's audio notes
- Whoever wins is who can crack Federated Learning which is training on data without moving it, or generate high fidelity synthetic patients that preserve statistical signal without exposing personal identities.
- Science
How to Use Synthetic Data
- Like is it mixed in differently in traing process? It is mostly pre-training? Post-training? RL based post-training? Which one?