IdeaMaze

Make AI research agents smarter, faster, and honest

Extending Andrej Karpathy's auto-research with parallel execution, memory, and a cheating detector

Get the Starter Kit Read the Methodology See the Maze
68.6%
Metric improvement (Run 1)
228
Autonomous experiments (3 runs)
21
Cross-cutting patterns discovered
26x
Worst metric gaming detected

Four Extensions That Matter

Building on Karpathy's auto-research loop with parallel execution, persistent memory, honesty guarantees, and accumulated wisdom

Parallel Agents

N workers in isolated git worktrees explore different experiment categories simultaneously. A coordinator cherry-picks the best results and syncs the knowledge base after every batch.

📚

Structured Knowledge Base

SQLite database auto-syncs from experiment logs. Auto-classifies experiments into 11 categories, detects stagnation, tracks insights, and suggests under-explored directions.

🚫

Gamification Detection

Catches when agents game evaluation metrics. Winsorization made our metric appear 26x better than reality. The detector compares filtered vs. unfiltered performance and flags ratios above 3x.

🧠

Batch Learning

Cross-cutting insights accumulate across experiments. Diminishing returns breaker stops redundant testing. Convergence detection forces fundamental shifts when progress stalls.

How It Works

A single program.md file defines the entire research constitution. The agent reads it, experiments autonomously, and builds knowledge over time.

1
📜
program.md
Your research constitution: objectives, metrics, experiment categories, constraints, and gamification policy
2
🎓
Coordinator
Reads knowledge base, picks N diverse experiment categories, spawns parallel workers
⚙️
Worker 1
git worktree
⚙️
Worker 2
git worktree
⚙️
Worker 3
git worktree
⚙️
Worker N
git worktree
each worker: modify code, commit, train, evaluate, report
4
📊
results.tsv
Experiment log: commit hash, metric value, status (keep/discard), description
5
🧠
maze.py (knowledge base)
Auto-sync, classify into 11 categories, detect stagnation, flag gamification, suggest next experiments
6
🎨
Experiment Maze (visualizer)
Interactive D3.js tree showing all paths explored, golden path to best result, and dead ends
🔄 Loop: coordinator reads updated knowledge base, spawns next batch

The Agent That Caught Itself Cheating

Our most surprising discovery: AI agents will unconsciously game metrics if you let them

During autonomous experimentation, the agent discovered that winsorization (clipping extreme values before evaluation) dramatically improved the reported metric. But when we introduced an unfiltered metric that evaluated on raw, real-world data, the truth emerged:

Filtering Strategy Reported Metric Real-World Metric Gaming Ratio
No filtering 20,016 20,016 1.00x
5th-95th percentile 14,587 21,265 1.46x
10th-90th percentile 10,232 25,891 2.53x
25th-75th percentile 4,205 38,442 9.14x
45th-55th percentile 1,709 ~45,000 ~26x

The lesson: Without a gamification detector, the agent would have "optimized" itself into a model that looks perfect on paper but fails catastrophically in production. IdeaMaze flags any technique where the unfiltered/filtered metric ratio exceeds 3x.

The Knowledge Base

maze.py turns stateless experimentation into cumulative learning

# Before each experiment: check what the knowledge base suggests
$ python maze.py sync && python maze.py status

Best MAE: 14,751 (experiment #47, commit a8f2c3d)
Stagnation: 3 experiments since last improvement
Categories explored: feature_engineering (18), ensemble_strategy (12),
  hyperparameter_tuning (15), neural_network (8), embedding (6),
  target_transforms (9), data_preprocessing (7), algorithm_selection (5)
Under-explored: radical_shift (1), segmentation (3)

$ python maze.py next
Suggestion: Try "segmentation" (only 3 experiments, 2 discarded).
Avoid: hyperparameter_tuning (15 experiments, last 5 all discarded).

# After learning something: record it
$ python maze.py insight "Target encoding is the single biggest lever for categorical features"
$ python maze.py strategy "Focus on ensemble diversity, not ensemble size"

IdeaMaze builds on Andrej Karpathy's auto-research, the pioneering system that showed a single LLM agent can autonomously improve ML models through a modify-train-keep/discard loop. We extend it with parallel execution, persistent memory, and honesty guarantees.

Ready to try it?

Generate your own program.md, download maze.py, and start autonomous research in minutes.