AI-Assisted Legacy Migration: Lessons Learned
A retrospective on heavily using AI tools for a large codebase migration, from initial failures to a workflow that actually works.

"This is going to be incredible," I thought as I watched Claude Code analyze thousands of lines of legacy code. "I'll just give it the legacy endpoint to analyze, and the AI will migrate everything. This migration that would take months? Done in a few weeks."
Four hours later, I was staring at a screen filled with hallucinations that confidently referenced functions that didn't exist, implemented patterns we'd never used, and would have broken production spectacularly. It was Day 1, and my illusions had just crashed against reality.
Welcome to my seven-week journey discovering how to use AI for a large legacy codebase migration. Not the YouTube influencer version promising miracles, but the messy, frustrating, and ultimately productive reality of this transformation.
The Challenge: Welcome to the Real World
Forget the blog posts about using ChatGPT to build a todo app. This was a production code migration: extracting features from a legacy monolith and rebuilding them in a TypeScript monorepo. Not a greenfield project where mistakes are cheap. Not a prototype where "good enough" suffices.
Real migration. Real stakes. Real complexity:
- Interconnected systems where changing one thing breaks three others you didn't know existed
- Domain knowledge buried in comments from 2011, scattered across multiple Git repositories, living only in senior developers' heads
- Architectural decisions that require understanding why something was built a certain way before you can rebuild it differently
- Quality standards where "it works on my machine" isn't acceptable
This is where AI tools promise the most value - and where most of them fail spectacularly.
Week 1: When Hope Meets Reality
September 18: The Divine Prompt
My strategy was simple: create one perfect prompt containing everything. The complete feature specification. Legacy code context. New architecture patterns. Migration requirements. Everything.
I hit Enter and watched Claude Code think. And think. And think some more.
Ten minutes later, it delivered a complete migration plan with detailed implementation steps, beautiful formatting, and confident explanations. The problem? It was complete hallucination.
The AI had invented a caching layer we didn't have, referenced a state management pattern we'd never used, and proposed migrating database calls to an ORM we weren't using. It was impressively wrong, and I wasted too many hours trying to figure out why nothing matched our actual codebase. The AI wasn't analyzing our code - it was pattern-matching against its training data and hoping for the best.
September 19: Breaking the Fantasy
Okay. Plan B: instead of one mega-prompt, break it down. Analysis prompt. Planning prompt. Review prompt. Test specification prompt.
Results immediately improved. Hallucinations went from "complete fiction" to "mostly accurate with creative interpretation". Progress!
But here's the problem: "mostly accurate" isn't acceptable in production code. You can't deploy an 85% correct feature. Those remaining 15%? That's where bugs live.
I was spending more time checking AI outputs than I would have spent reading the code myself.
Week 2: The Uncomfortable Truth
September 22: When I Became the Bottleneck
After three days fighting hallucinations, the pattern became clear: every time I actually understood the legacy code before prompting the AI, the output was dramatically better.
It was almost embarrassing how obvious this should have been.
When I blindly asked the AI to analyze code:
"Analyze this authentication module and create a migration plan"
� AI invents patterns, makes assumptions, confident but wrong
When I analyzed it first, then asked the AI:
"This auth module uses JWT tokens with a custom refresh strategy stored in Redis.
The legacy implementation has three edge cases: [specific cases].
Create a migration plan that handles these cases using our new auth service."
� AI generates an accurate and usable plan
The difference wasn't the AI. The difference was me doing my job first.
Four hard-earned rules emerged:
- Ambiguity kills accuracy - Every vague prompt produced hallucinations, while every specific prompt produced usable code.
- You can't skip the understanding phase - AI amplifies your understanding of the code; it can't replace it.
- AI generates noise that needs filtering - Specs required manual filtering because AI would include every tangentially related detail, drowning the essential in the accessory.
- Validation remains non-negotiable - Thorough human review of every spec, every plan, every assumption was indispensable.
The new workflow: AI Analysis � Human Analysis � Detailed Prompt � AI Generation � Human Validation
Think pair programming, where you're the senior dev and the AI is the talented junior who needs clear directions.
The Auto-Accept Mistake
I also learned an important lesson: I needed to disable auto-accept for AI modifications. Watching AI automatically modify files while I grabbed coffee might seem productive, but in reality it was just playing with production code in a risky way.
Week 3: The Bill Arrives
September 29: Let's Talk About Time
Time to face the music. I started tracking actual hours:
Per feature migration:
- 25 minutes: Planning (AI reading legacy, AI writing specs, human review)
- 2h30: Implementation with AI (reviewing each change)
- Unknown: Debugging architectural drift
Those 2h30 of "pair programming" with AI? Half of it was:
- Correcting course when AI went crazy
- Explaining why we can't "just refactor everything to use the latest patterns"
- Undoing changes because AI's context window filled up and it forgot what we were doing
The brutal truth: AI was making architectural decisions it had no business making.
Over-engineering is the enemy of productivity
October: The Month Everything Broke (Then Got Better)
October 3: The Multi-Model Epiphany
The initial specifications I'd created? Useful for framing, but too shallow. AI kept making bad assumptions because my specs were too rough.
Then I had an idea: what if I had multiple AI models analyze the legacy code independently, then merged their insights?
The experiment:
- Claude analyzes the authentication flow
- Codex analyzes the authentication flow
- Cursor analyzes the authentication flow
- Claude merges the three perspectives by checking the legacy and validating findings
Claude noticed error handling patterns, Codex caught edge cases, and Cursor identified performance considerations. None of them captured everything alone, but together they offered complete coverage. It was like having three junior devs do a code review - each catches different issues.
The multi-model approach certainly took more time upfront, but the quality improvement was dramatic. Each model's errors were systematically caught by the others, creating a particularly effective cross-validation system.
October 13: Rock Bottom
I decided to test "autonomous AI development". Give it a clear spec, let it run, see what happens.
The results were... educational.
What AI delivered:
- Branded types that looked correct but had the wrong types
- Variables with names suggesting one purpose but types implying another
- An abstraction layer so over-engineered it required abstractions for the abstractions
The worst part? The code compiled, tests passed, and it looked professional. But it was subtly and insidiously wrong - the kind of wrong that slips into production and causes bugs a month later when someone tries to extend it.
I spent more time fixing "functional" AI code than I would have spent writing it correctly from scratch. This was my lowest point, when I wondered if all this was really saving time or if I was just creating technical debt at AI speed.
October 17: The Code Review Apocalypse
I was generating code faster than the team could review it.
Monday morning standup:
- "I have 3 PRs ready for review"
- "Each PR changes 50-70 files"
- "They're all blocked waiting for review"
The team couldn't keep up. PRs were piling up like snow in a storm.
Small frequent PRs? Became blocking. Everyone spent the day doing code reviews instead of writing code.
Large infrequent PRs? Unreadable. "Can you review these 80 file changes?" Guaranteed team hatred.
The irony: AI had made me so fast that I'd become the team's bottleneck.
We needed a new strategy. Nested feature branches? Parallel development flows? Better separation of concerns so PRs could be reviewed independently?
All of the above, probably.
Late October: The Orchestration Revelation
October 22: The Five-Step Salvation
After several weeks of chaos, the pattern finally crystallized. Every successful migration followed the same five steps:
- Define precise scope (not "migrate auth" but "migrate JWT refresh token handling with Redis fallback")
- Understand it completely (actually read the legacy code, don't just skim it)
- AI plans in PLAN mode (get the strategy before any code)
- Validate everything (AI makes good suggestions and terrible assumptions)
- Generate with supervision (auto-accept always off, review each change)
Simple. Obvious in hindsight. But it took a month of mistakes to learn.
October 23: The Slash Command Breakthrough
Late night coding session. Wait, what's this?
Claude Code can trigger slash commands itself in sub-agents.
My brain immediately went to: "What if I orchestrated multiple AI agents?"
At 2 AM, I had a working prototype:
/scout-explorer� multiple agents analyze the codebase in parallel via Claude, Codex and Cursor/specification-plan� synthesize findings into a coherent spec/build-prd� generate a detailed implementation plan
First test: Analysis took three times longer than a manual prompt, but quality was significantly better and PRD completeness was finally truly satisfactory. I had something.
October 24: When Magic Stops Working
The problem: The orchestrator was fragile. It worked well when everything went as planned, but as soon as something went wrong - AI not following instructions, token limits, or ambiguous prompts - everything collapsed.
Complex prompts to Codex via CLI tools were unreliable. Multi-agent multi-step coordination only worked one in three times. Depending on Claude Code to handle orchestration proved impossible.
The dream of "set it and forget it" automation died that day. I needed something more stable, something I truly controlled, that wouldn't randomly decide to interpret my instructions creatively.
November: When It Finally Clicks
November 2: Building My Own Tool
I was tired of depending on AI's whims for orchestration. Time to build something truly reliable. In one evening, I built a CLI that orchestrates the entire analysis workflow. Nothing intelligent or fancy, just something reliable and predictable.
The flow:
$ migration-cli analyze
� "Which repositories?" [interactive prompt]
� "Which AI models?" [Claude Sonnet, Claude Haiku, Codex GPT5, Cursor Composer1 - multiple choice]
� "Task description?" [specific, not vague]
� "Entry points?" [where to start the analysis]
� "Scope?" [what's included, what's not]
[CLI runs in the background]
� Clones relevant legacy repos
� Launches chosen models in parallel
� Each analyzes independently
� Merges results into coherent analysis
� Generates PRD
� Breaks down into atomic tasks
� Parallel review of each task
� Generates test suite
Done.
The first run took 15 minutes of analysis and produced a complete PRD without any hallucinations. The second run maintained the same quality. By the tenth run, consistency was still there. I'd finally achieved the repeatability I'd been seeking.
November 3: The Context Window Revelation
I discovered that AI quality collapses after about 60% context window usage. It's not a gradual degradation but a real cliff: at 50% everything works well, at 60% hallucinations start appearing.
The solution:
- Stored relevant task context in a
.work-in-progress.mdfile - Removed context-hungry MCP servers (looking at you, Atlassian eating 18% of context just by existing)
With this approach, migrations could run as long as needed without any quality degradation.
November 4: After Analysis, The Code
The missing piece: standardized prompts.
I created three instruction files:
01_plan.md- How to analyze and plan02_code.md- How to implement (with specific code style rules)03_review.md- How to review for common errors
Each task now starts the same way:
- Read the task and PRD and follow
01_plan.mdinstructions - Human review of the plan
- Start implementation in pair programming with AI via
02_code.mdinstructions
Added .work-in-progress.md to track current task state. When context gets heavy? Clear everything except the WIP file. Fresh context, no lost progress.
The rule that changed everything: Clear AI context religiously.
Don't let conversations grow. Don't let context windows fill. Restart frequently with only essential information.
It seemed wasteful at first. "But I just explained the architecture!"
Then I realized: explaining the architecture to fresh context takes 2 minutes. Debugging hallucinations from bloated context takes 2 hours.
Easy choice.
The Final Workflow: What Actually Works
November 6. Seven weeks in. The workflow finally crystallized into something repeatable and reliable.
Phase 1: Analysis & PRD (The Foundation)
Launch the CLI for complete analysis:
- Multiple AI models work in parallel
- Cross-validation of findings
- Generation of detailed PRD
- Breakdown into atomic tasks
Then stop and validate.
Don't trust blindly. Spend time reviewing:
- Does this PRD match reality?
- Are these tasks truly atomic?
- What edge cases did AI miss?
- Where are the hidden dependencies?
Fix what's wrong. Approve what's right. Copy to the new codebase.
This validation phase, taking about 45 minutes, prevents weeks of bad assumptions and false starts that could have compromised the entire project.
Phase 2: Implementation (The Part You Can't Rush)
For each task, follow these steps religiously:
1. Read the legacy code yourself
Not optional. Not "AI will handle it". You read it. You understand it. Then you guide the AI.
AI without your understanding = hallucinations. AI with your understanding = productivity.
2. Generate the plan with 01_plan.md
Let AI propose the approach. Reference the PRD. Reference the task. Let it think about implementation.
Then review every assumption and steer AI at the slightest deviation.
3. If starting a new feature: Code the architecture yourself
This is critical. Don't let AI design your architecture.
AI will create something that works in isolation but doesn't integrate with your system. It will add abstractions you don't need. It will make choices that seem good now but hurt later. This approach proved extremely cost-effective: 30 minutes of manual architecture work systematically saved 3 hours of later refactoring.
4. Let AI implement the details with 02_code.md
Now AI shines. Architecture exists. Patterns are clear. Let it generate the implementation.
But keep auto-accept OFF.
Review every file. Question anything that looks "clever". AI loves being clever. But you need code that's boring and maintainable.
This part is tedious. You'll review dozens of files. You'll reject changes that work but don't integrate.
Do it anyway. This is where architectural drift is caught.
5. AI reviews its own work with 03_review.md
AI catches different errors than humans. Use it.
6. Retrospective
AI made the same mistake three times? Update your prompts. Teach it your ORM patterns. Document your team's conventions. Make the prompts smarter.
The workflow isn't fast. But it's consistent.
Every feature follows the same process. Every task receives the same scrutiny. Quality stays high even when velocity increases.
The Hard-Earned Lessons
After seven weeks, dozens of tasks, and countless mistakes, here's what I learned about AI-assisted development at large codebase scale.
What AI Actually Excels At
Exploring large codebases - AI can read thousands of lines faster than any human. Point it at a complex module and ask "what does this do?" Genuinely useful.
Pattern recognition - When properly guided, AI spots patterns you'd miss. It sees the same error handling approach repeated in 20 files while you're still on file 3.
Multi-model analysis - Different models catch different issues. Like code review with multiple devs, each brings a different perspective.
What AI Consistently Fails At
Making architectural decisions - AI will design elaborate systems that work in isolation but don't integrate with your codebase. Every time you let AI decide architecture, you create refactoring work for later.
Maintaining consistency - Across multiple files, AI drifts. File 1 uses one pattern. File 5 uses a different pattern. File 10 invents a third pattern. You have to catch that.
Managing its own context - After 60% context usage, quality collapses. AI doesn't know this. It will keep generating garbage confidently.
Refactoring without breaking things - Ask AI to "improve" code and watch it create technical debt at impressive speed. It will add abstractions you don't need and remove patterns you depend on.
The Four Rules That Really Matter
1. Context Management Is Everything
AI quality has a hard limit at ~60% context usage. Respect this limit or pay the price in hallucinations.
Solutions:
- Clear context frequently
- Start new conversations often
- Remove context-hungry MCP servers
- Use work-in-progress files to track state between sessions
2. Human Understanding Is Non-Negotiable
You can't outsource understanding to AI. Period.
AI amplifies what you know. It doesn't replace knowing things. Every time I tried to skip understanding the legacy code, I paid for it later in debugging time.
3. Structure Beats Intelligence
Ad-hoc prompting produces inconsistent results. Standardized workflows produce reliable results.
Build templates. Create checklists. Document your process. Make it boring and repeatable.
4. You're the Architect
AI is a talented junior dev who writes clean code but makes questionable design decisions.
You design the architecture. AI implements the details. Reverse that and you'll hate your codebase in three months.
Let's Talk About Time
Everyone wants to know: "Does AI actually save time?"
The answer isn't simple. Here's the real breakdown:
| Phase | Time Investment |
|---|---|
| Initial "Full AI" attempts | Days of wasted effort |
| Workflow refinement | ~7 weeks of experimentation |
| Current workflow per feature | 25 min planning + 2-3 hours implementation |
| Manual corrections | Significantly reduced but still necessary |
| Tooling development (CLI) | Initial cost, ongoing productivity gains |
The brutal truth? The learning curve was real. During the first week, AI was slower than manual work because of time spent fixing its errors. By the second week, we reached parity with manual work. But by the fourth week, with a polished workflow, AI enabled going 4x faster than manual work.
Is it worth it?
For this type of complex migration, the answer is clearly yes. AI brings concrete benefits: it reads and explores thousands of lines of code while I grab coffee, which is genuinely valuable. Writing the 15th CRUD endpoint myself would be soul-crushing, while AI does it in seconds. Multi-model analysis, which would equal employing three junior devs, just runs on my laptop. Finally, AI generates detailed and standardized task lists that would take me hours to produce manually.
But here's what the "AI will 10x your productivity" crowd won't tell you: you still have to review every line, every file, every architectural decision. You still have to design the architecture yourself - AI can only implement it. And you still have to debug weird issues, knowing that AI-generated bugs are often more subtle and insidious than ones you'd write yourself.
The productivity gain is real. But it's more like 2.5-4x for careful developers, not 10x.
If You're About To Do This: Read This First
Planning a similar migration? Learn from my mistakes.
The Seven Rules For AI-Assisted Migration
1. Don't start with autonomous AI
I know it's tempting. "Just let AI handle it!"
Don't. You'll waste days fixing hallucinations that seemed correct at first glance.
Start with tight supervision. Gradually loosen as you learn what AI succeeds at and where it fails.
2. Invest in tooling from the start
The evening I spent building the CLI paid for itself in a week.
Custom tooling seems like overhead until you launch your tenth feature migration and everything just works.
3. Multiple models > Single model
Different AI models catch different errors. Run them in parallel. Cross-validate findings.
Yes, it costs more. It also catches hallucinations before they reach production.
4. Auto-accept is a trap
Watching AI auto-commit changes while you grab coffee might seem productive, but it's a dangerous illusion. In reality, it's gambling with your codebase and taking unnecessary risks. You must review every change, without exception.
5. You design the architecture
30 minutes of manual architecture work saves 3 hours of refactoring AI-generated abstractions that don't integrate with your system.
6. Clear context like your code quality depends on it
Because it does exactly that. After 60% context window usage, AI quality collapses brutally. You must restart frequently with a clean slate to maintain reliable results.
7. Standardize everything
Create prompt templates. Build checklists. Document the workflow.
Consistency beats intelligence when you're migrating dozens of features.
The Code Review Crisis
Warning: you'll generate code faster than your team can review it.
This creates a new bottleneck. Solutions that worked for us:
Nested branch strategies - Don't wait for PR approval to start the next feature. Branch from your branch.
Smaller, atomic PRs - 50 files is too much. 5-8 is manageable.
Heavy test coverage - Automated tests reduce review burden. If tests pass, reviewer focuses on architecture.
Scheduled review sessions - Block one-hour windows for deep review work. Quick 15-minute reviews miss subtle issues.
When to Skip This Whole Approach
This workflow makes sense for complex migrations, but would be overkill for simpler contexts. For small projects, just use Cursor and move fast without too much structure. For prototypes where architecture can evolve freely, ship and iterate without constraints. If you're learning a new technology, take time to understand for yourself first before introducing AI. Finally, for simple CRUD applications with standard patterns, AI naturally shines without needing all this structure.
Keep the structured workflow for problems that truly need it.
The Conclusion
Here's what nobody selling AI tools wants to admit:
AI doesn't replace developers. It changes what we spend our time on.
Before AI: I spent 50% of my time reading legacy code, 10% thinking about architecture, and 40% writing code.
With AI: I spend 10% writing code, 60% reviewing AI output and making architectural decisions, and 30% reading legacy code.
Total hours? 2-4x faster than before.
The difference? I no longer write boilerplate. I no longer implement the fifteenth CRUD endpoint by hand. I no longer type repetitive test cases.
Instead, I spend my time designing systems, reviewing implementations, catching subtle bugs, and making strategic decisions. It's actually a significant improvement over before.
Writing boilerplate was never the interesting part of development. Designing elegant systems and solving complex problems, on the other hand, is exactly why I got into this field.
AI handles the boring parts. I handle the interesting parts.
After seven weeks and dozens of migrated tasks, this division of labor finally feels right.
In Summary
AI-assisted development at scale requires more structure, not less.
You can't just point AI at a problem and hope for the best. You need:
- Tight human supervision
- Aggressive context management
- Standardized workflows
- Custom tooling for your specific use case
Build the structure. Learn the tools. Understand their limits. Review everything.
Do that, and you'll see real 2-4x productivity gains on complex work.
Skip these steps, and you'll spend more time debugging hallucinations than you would have spent writing the code manually.
The future of development isn't autonomous AI, but rather developers who master the art of directing AI effectively in well-structured workflows. That's where real productivity gains lie. And honestly, that future is already here - you just need to be ready to do the necessary work to reach it.
Seven weeks in, dozens of tasks migrated, and the workflow finally feels productive. Not magical. Not autonomous. Just... productive. Sometimes that's enough.