Make AI research agents smarter, faster, and honest
Extending Andrej Karpathy's auto-research with parallel execution, memory, and a cheating detector
Building on Karpathy's auto-research loop with parallel execution, persistent memory, honesty guarantees, and accumulated wisdom
N workers in isolated git worktrees explore different experiment categories simultaneously. A coordinator cherry-picks the best results and syncs the knowledge base after every batch.
SQLite database auto-syncs from experiment logs. Auto-classifies experiments into 11 categories, detects stagnation, tracks insights, and suggests under-explored directions.
Catches when agents game evaluation metrics. Winsorization made our metric appear 26x better than reality. The detector compares filtered vs. unfiltered performance and flags ratios above 3x.
Cross-cutting insights accumulate across experiments. Diminishing returns breaker stops redundant testing. Convergence detection forces fundamental shifts when progress stalls.
A single program.md file defines the entire research constitution. The agent reads it, experiments autonomously, and builds knowledge over time.
Our most surprising discovery: AI agents will unconsciously game metrics if you let them
During autonomous experimentation, the agent discovered that winsorization (clipping extreme values before evaluation) dramatically improved the reported metric. But when we introduced an unfiltered metric that evaluated on raw, real-world data, the truth emerged:
| Filtering Strategy | Reported Metric | Real-World Metric | Gaming Ratio |
|---|---|---|---|
| No filtering | 20,016 | 20,016 | 1.00x |
| 5th-95th percentile | 14,587 | 21,265 | 1.46x |
| 10th-90th percentile | 10,232 | 25,891 | 2.53x |
| 25th-75th percentile | 4,205 | 38,442 | 9.14x |
| 45th-55th percentile | 1,709 | ~45,000 | ~26x |
The lesson: Without a gamification detector, the agent would have "optimized" itself into a model that looks perfect on paper but fails catastrophically in production. IdeaMaze flags any technique where the unfiltered/filtered metric ratio exceeds 3x.
maze.py turns stateless experimentation into cumulative learning
# Before each experiment: check what the knowledge base suggests $ python maze.py sync && python maze.py status Best MAE: 14,751 (experiment #47, commit a8f2c3d) Stagnation: 3 experiments since last improvement Categories explored: feature_engineering (18), ensemble_strategy (12), hyperparameter_tuning (15), neural_network (8), embedding (6), target_transforms (9), data_preprocessing (7), algorithm_selection (5) Under-explored: radical_shift (1), segmentation (3) $ python maze.py next Suggestion: Try "segmentation" (only 3 experiments, 2 discarded). Avoid: hyperparameter_tuning (15 experiments, last 5 all discarded). # After learning something: record it $ python maze.py insight "Target encoding is the single biggest lever for categorical features" $ python maze.py strategy "Focus on ensemble diversity, not ensemble size"
IdeaMaze builds on Andrej Karpathy's auto-research, the pioneering system that showed a single LLM agent can autonomously improve ML models through a modify-train-keep/discard loop. We extend it with parallel execution, persistent memory, and honesty guarantees.
Generate your own program.md, download maze.py, and start autonomous research in minutes.