How to Build a Perfect Plan
Before writing a single line of code, I spent two hours planning before executing anything. The result: the perfect plan that saves me days in debugging. Here is how I do it
The engineers who built it treat it as a defensive system with exact failure modes and circuit breakers for each one.
Before you get bored with read the comprehensive deep dive, 2 things:
Subscribe to not miss the future articles and also because the 2nd item is an external link
If you just need to insert this article into your agent or chat about it with Claude
Most people open Claude and say “build me X.” Then they debug for three days.
I spent two hours planning before executing anything. The result: a 28-bead dependency graph with 5 decision gates, 8 failure scenarios with recovery cascades, priority-weighted insights, and an execution strategy designed to survive context compaction across sessions.
Here’s exactly how we did it, step by step, so you can replicate this for any complex project.
The Problem
We needed to fine-tune an open-source LLM on 70,000+ evaluated questions from our production system. The pipeline: extract data from Supabase, filter by quality, reconstruct training prompts, convert to JSONL, upload to a training platform, train, evaluate, deploy, integrate — with rollback if anything goes wrong.
This is exactly the kind of project that fails when you just start coding. Too many unknowns. Too many places where a bad assumption early on wastes days of work downstream.
So we didn’t code. We planned.
Step 1: Set the Rules of Engagement
The first prompt established three non-negotiable constraints:
“Do not execute. Right now, all we’re doing is building a plan. The plan needs to be perfect. We’ll spend a couple of hours on this.”
This matters more than it sounds. Without this constraint, Claude will start writing extraction scripts by minute two. You’ll get working code for the wrong pipeline.
We also chose a task tracking methodology up front. We used Beads — a dependency-aware task graph designed for AI agents. The key property: each bead blocks the next. You literally cannot skip ahead. This prevents the most common AI failure mode: jumping to the interesting part while leaving a landmine in the foundation.
Technique: Declare “plan only” mode explicitly. Claude optimizes for helpfulness. If you don’t tell it to stop, it’ll start executing while you’re still thinking. Pin it to planning.
Step 2: Research Before Designing
Before writing a single bead, we made Claude read the relevant external resources:
The Beads article — Understanding the task tracking methodology
The Beads repository — Implementation details, CLI commands, workflow patterns
The training platform’s documentation — API reference, data format requirements, supported models, pricing
The production codebase — Full analysis of the existing system’s architecture, what makes it work, what transfers to training data
Each of these was a separate research bead. Claude fetched the articles, read the repo, explored the docs, and analyzed the codebase — all before the first design decision was made.
We also provided an internal ML training report from prior experiments on the same infrastructure. Claude extracted 12 key findings, ranked them by impact, and identified which ones changed the plan’s hyperparameters.
Technique: Feed Claude primary sources, not summaries. Don’t say “we’re using LoRA rank 64.” Say “here’s the training report from our last 5 experiments — extract what’s relevant.” Claude will find things you missed.
Step 3: Log Every Prompt
This was a simple but crucial decision. We created PLAN-user-prompts.md and recorded every prompt verbatim, along with the key decisions extracted from each one.
Why? Because we came back to them later. When the plan was nearly complete, we audited every single prompt against the bead graph to check: “Did we actually cover everything the user asked for?”
The audit caught one gap: the success target had been set at 95% based on the ML report, but the user had explicitly said 99% in a later prompt. Without the prompt log, this would have been lost to context compaction.
### Prompt 7 — More data, images, Evaluator versions
> also 15K+ high-scoring outputs seems too low...
> if a question comes with an image, we cannot have our training
> get a model trained for image generation...
> we would ideally need only the later, better version, Evaluator 2.3.10...
**Key decisions:**
- Scan ALL sources, not just Supabase
- Image questions: extract text value but strip image URLs
- Evaluator version gate: prefer v2.3.10 and v2.3.8At the end, we ran through all 13 prompts and verified:
No gaps.
Technique: Create a prompt log file and audit it before finalizing. Your prompts are your requirements spec. If you don’t write them down, you lose them to context compaction.
Step 4: Build the Bead Graph
With research done and requirements logged, we built the dependency graph. The first version had 16 beads in a linear chain:
Extract → Filter → Analyze → Convert → Validate → Split → Upload → Train → Monitor → Evaluate → Deploy → IntegrateEach bead had:
Priority (P0/P1/P2)
Blocked by (which beads must complete first)
Acceptance criteria (what “done” means, with evidence)
“If it fails” (specific recovery action)
The bead definitions weren’t vague. Compare:
Bad bead: “Prepare training data” — what does done mean?
Good bead:
ft-0001.8a — Reconstruct curriculum facts
Priority: P0 | Blocked by: ft-0001.7
What: Build a lookup function that takes (course, unit, topic)
and returns date-prefixed curriculum fact statements.
Done when: ≥ 80% of examples get relevant curriculum facts.
If < 80%: see Scenario F3.The bead has a number, a dependency, a success metric, and a failure pointer. When you close it, the evidence is: “reconstruction success rate = 87%, report at data/pilot_reconstruction_report.md.”
Technique: Every bead needs a “done when” with a measurable condition and a “if it fails” with a named scenario. Beads without acceptance criteria are wishes, not tasks.
Step 5: Add Decision Gates
Between phases, we placed explicit stop-and-evaluate points. Not “check if things look good” — structured decision tables:
Five gates total:
Feasibility checkpoint (after 50-example pilot)
Data sufficiency (after full analysis)
Training health (during training — loss curves, overfitting)
Model quality (after
Evaluatorevaluation)Production readiness (after integration + canary)
Each gate has three columns: Pass (continue), Adjust (change parameters and retry), Abort (stop and escalate). No ambiguity.
Technique: Gates are not checkpoints. They’re decision tables with concrete thresholds. “Looks good” is not a gate. “p95 token count < 4096” is a gate.
Step 6: Write Failure Scenarios
We named 8 specific failure scenarios, each with a detection mechanism and a numbered recovery cascade:
Scenario F4: Token overflow (DBQs exceed context window)
Detection: Analysis shows p95 > 4096 tokens
Recovery cascade:
Set max_context_length to 8192
If still insufficient: truncate curriculum facts
If still over: train separate DBQ model at 8192
If cost prohibitive: keep Gemini for DBQs only
The cascades are numbered because order matters. You try the cheapest fix first. You don’t jump to “use a different model” when “increase context length” might work.
Technique: Name your failures. Give them IDs. Write cascading recovery steps in priority order. An unnamed failure is a panic. A named failure is a procedure.
Step 7: Prioritize the Insights
Not all research findings are equal. We ranked every insight into tiers:
Tier 1 — Plan fails without these:
Curriculum facts in training prompts are MANDATORY
Data quality > quantity (filter aggressively)
Token length determines context window (measure BEFORE training)
Tier 2 — Significantly affects quality:
Specific hyperparameters from prior experiments
Keep post-processing pipeline active after fine-tuning
Tier 3 — Incremental:
DPO as optional second stage
Difficulty-specific prompt text
Does not apply:
Multi-stage pipeline (runtime architecture, not learnable)
Image generation (separate model, unrelated)
This stack drives every downstream decision. When two options conflict, the higher-tier insight wins.
Technique: Rank your insights explicitly. When you have 15 “important” findings, you have zero priorities. Force-rank them into tiers with clear “plan fails without this” vs “nice to have” boundaries.
Step 8: Get External Feedback
This was the highest-ROI step. We took the completed plan and sent it to a separate Claude instance for adversarial review. It came back with 10 criticisms.
We evaluated each one honestly:
8 adopted (feasibility spike, split fat beads, pilot training, canary deploy)
2 rejected (separate bead for manual audit — folded into existing bead; separate failure taxonomy bead — folded into evaluation output)
The best suggestion was adding a Phase 0.5: Feasibility Spike — a 50-example pilot BEFORE building the full data pipeline. This catches catastrophic assumptions (curriculum facts can’t be reconstructed, token lengths are wrong, model isn’t tunable) in 30 minutes instead of after 4 hours of pipeline work.
The second-best suggestion was splitting the hardest bead (ft-0001.8, “Convert to JSONL + reconstruct curriculum facts”) into 6 sub-beads. When one monolith bead fails, you don’t know which part failed. When 6 narrow beads fail, you know exactly which step broke.
Technique: Get adversarial feedback on your plan BEFORE executing. Send it to a separate Claude instance, a colleague, or a review tool. Accept most critiques, reject the ones that add ceremony without value, and document why for each.
Step 9: Design for Compaction Survival
This is specific to Claude Code but the principle applies to any long-running AI session.
Claude Code’s context window has a hard limit. When you approach it, the system compacts your conversation into a 9-section summary.
Your carefully built context — the beads, the gates, the failure scenarios — gets compressed into a paragraph.
Our execution strategy accounts for this:
Write progress to files, not conversation. After each bead closes, append to
data/progress.md. This survives any compaction.State the current bead explicitly at the start of every major action. The compaction summary captures “current work state” — make it obvious.
Keep the plan file as ground truth. After compaction, re-read
PLAN-finetune-qwen.mdto recover context rather than relying on the summary.Use agents for heavy processing. Data analysis goes into a subagent that returns a 200-word summary. The 70K-record dataset never enters the main context.
Technique: If your session will outlive a single context window, write state to files. The conversation is ephemeral. The filesystem is persistent.
The Result
After two hours of planning, zero hours of coding, we have:
28 beads in a dependency graph across 5 phases
5 decision gates with Pass/Adjust/Abort tables
8 failure scenarios with numbered recovery cascades
A priority-weighted insight stack (Tier 1/2/3 + “does not apply”)
A parallel execution map (which beads can run simultaneously)
A session handoff protocol (how to resume if interrupted)
A prompt log (every user intent recorded and audited)
Complete data inventory (3 sources, 70K+ items, breakdowns by type/course/version)
The plan was reviewed, challenged, revised, and committed to version control. When we start executing, every bead has a clear definition, a measurable done condition, and a failure recovery path.
No ambiguity. No “figure it out when we get there.” No skipping ahead to the interesting part.
The Methodology, Generalized
You can use this for any complex project. Here’s the recipe:
1. Declare planning mode
Tell Claude: “Do not execute. We are only planning.”
2. Research first
Feed primary sources — articles, docs, reports, codebases. Let Claude extract insights, not guess.
3. Log every prompt
Create a prompt log. Your prompts are your requirements. Audit them before finalizing.
4. Build dependency-aware beads
Each bead: priority, blocked-by, acceptance criteria, failure pointer. Use Beads or any dependency graph system.
5. Add decision gates
Between phases. Pass/Adjust/Abort with concrete thresholds. Not “looks good.”
6. Write failure scenarios
Name them. Give them IDs. Write cascading recovery in priority order.
7. Rank your insights
Tier 1 (plan fails without) → Tier 2 (significant) → Tier 3 (incremental) → Does not apply.
8. Get adversarial feedback
Send the plan to a separate reviewer. Accept most, reject with reasons, document both.
9. Design for session survival
Progress to files. Current state explicit. Plan file as ground truth.
10. Commit before executing
Version control your plan. It’s as important as the code it produces.
Resources
Beads: Task Tracking for AI Agents — The dependency-aware task graph methodology we used. Prevents AI from skipping steps.
Beads Repository — The CLI tool.
bd ready,bd update --claim,bd close --reason "evidence".How to Use Claude Code Like a Claude Code Engineer — The internal engineering techniques (compaction survival, parallel execution, context budget) we applied to the execution strategy.






