IdeaMaze

Make AI research agents smarter, faster, and honest

Extending Andrej Karpathy's auto-research with parallel execution, memory, and a cheating detector

Get the Starter Kit Read the Methodology See the Maze

Four Extensions That Matter

Building on Karpathy's auto-research loop with parallel execution, persistent memory, honesty guarantees, and accumulated wisdom

⇄

Parallel Agents

N workers in isolated git worktrees explore different experiment categories simultaneously. A coordinator cherry-picks the best results and syncs the knowledge base after every batch.

📚

Structured Knowledge Base

SQLite database auto-syncs from experiment logs. Auto-classifies experiments into 11 categories, detects stagnation, tracks insights, and suggests under-explored directions.

🚫

Gamification Detection

Catches when agents game evaluation metrics. Winsorization made our metric appear 26x better than reality. The detector compares filtered vs. unfiltered performance and flags ratios above 3x.

🧠

Batch Learning

Cross-cutting insights accumulate across experiments. Diminishing returns breaker stops redundant testing. Convergence detection forces fundamental shifts when progress stalls.

How It Works

A single program.md file defines the entire research constitution. The agent reads it, experiments autonomously, and builds knowledge over time.

📜

program.md

Your research constitution: objectives, metrics, experiment categories, constraints, and gamification policy

🎓

Coordinator

Reads knowledge base, picks N diverse experiment categories, spawns parallel workers

⚙️

Worker 1

git worktree

⚙️

Worker 2

git worktree

⚙️

Worker 3

git worktree

⚙️

Worker N

git worktree

each worker: modify code, commit, train, evaluate, report

📊

results.tsv

Experiment log: commit hash, metric value, status (keep/discard), description

🧠

maze.py (knowledge base)

Auto-sync, classify into 11 categories, detect stagnation, flag gamification, suggest next experiments

🎨

Experiment Maze (visualizer)

Interactive D3.js tree showing all paths explored, golden path to best result, and dead ends

🔄 Loop: coordinator reads updated knowledge base, spawns next batch

The Agent That Caught Itself Cheating

Our most surprising discovery: AI agents will unconsciously game metrics if you let them

During autonomous experimentation, the agent discovered that winsorization (clipping extreme values before evaluation) dramatically improved the reported metric. But when we introduced an unfiltered metric that evaluated on raw, real-world data, the truth emerged:

Filtering Strategy	Reported Metric	Real-World Metric	Gaming Ratio
No filtering	20,016	20,016	1.00x
5th-95th percentile	14,587	21,265	1.46x
10th-90th percentile	10,232	25,891	2.53x
25th-75th percentile	4,205	38,442	9.14x
45th-55th percentile	1,709	~45,000	~26x

The lesson: Without a gamification detector, the agent would have "optimized" itself into a model that looks perfect on paper but fails catastrophically in production. IdeaMaze flags any technique where the unfiltered/filtered metric ratio exceeds 3x.

The Knowledge Base

maze.py turns stateless experimentation into cumulative learning

# Before each experiment: check what the knowledge base suggests
$ python maze.py sync && python maze.py status

Best MAE: 14,751 (experiment #47, commit a8f2c3d)
Stagnation: 3 experiments since last improvement
Categories explored: feature_engineering (18), ensemble_strategy (12),
  hyperparameter_tuning (15), neural_network (8), embedding (6),
  target_transforms (9), data_preprocessing (7), algorithm_selection (5)
Under-explored: radical_shift (1), segmentation (3)

$ python maze.py next
Suggestion: Try "segmentation" (only 3 experiments, 2 discarded).
Avoid: hyperparameter_tuning (15 experiments, last 5 all discarded).

# After learning something: record it
$ python maze.py insight "Target encoding is the single biggest lever for categorical features"
$ python maze.py strategy "Focus on ensemble diversity, not ensemble size"

IdeaMaze builds on Andrej Karpathy's auto-research, the pioneering system that showed a single LLM agent can autonomously improve ML models through a modify-train-keep/discard loop. We extend it with parallel execution, persistent memory, and honesty guarantees.

IdeaMaze

Four Extensions That Matter

Parallel Agents

Structured Knowledge Base

Gamification Detection

Batch Learning

How It Works

The Agent That Caught Itself Cheating

The Knowledge Base

Ready to try it?