Back to Blog
16 minFMKTech Team

AI Coding in Brownfield Projects: Making AI Work in Real-World Codebases

AI coding tools excel at greenfield projects but struggle with complex, legacy codebases. Learn the systematic approaches that transform AI from a prototype toy into a production workhorse for brownfield development.

AILegacy CodeSoftware EngineeringBrownfield Development

Introduction: The Brownfield Problem

Picture this: You've just watched an AI coding agent build a complete Next.js app in 45 minutes. Your engineering manager walks into your office with stars in their eyes, ready to unleash this magic on your 300,000-line legacy codebase. You know what happens next.

Here's the uncomfortable truth about AI coding tools in late 2024: they're amazing at building new projects from scratch and terrible at working with real-world codebases.

The Stanford study on AI's impact on developer productivity confirmed what many engineering teams already knew:

  1. AI tools work great for greenfield projects and small changes
  2. In large, established codebases, they often make developers less productive
  3. A lot of "AI-generated code" is just reworking the slop that was shipped last week

The common response ranges from pessimism ("this will never work for real code") to cautious optimism ("maybe when models get smarter"). But there's a third option: systematic workflows that make today's models work in brownfield projects.

In the past few months, teams using these techniques have:

  • Fixed bugs in 300,000-line Rust codebases they'd never seen before (in under an hour)
  • Shipped 35,000 lines of working code adding major features to complex systems (in 7 hours)
  • Maintained code quality that passes expert review from senior engineers

This isn't theoretical. This is production code, in unfamiliar codebases, often in languages the developers barely know.

How? They solved the brownfield problem.

At FMKTech, we help organizations implement these systematic workflows to make AI coding agents work in real-world, legacy codebases—not just impressive demos. Whether you're exploring AI-assisted development for your team or looking to scale existing implementations, understanding how to navigate brownfield complexity is essential for getting actual value from AI coding tools.

What Makes Brownfield Different?

Greenfield: The AI Comfort Zone

When you start a new project:

  • No existing code to understand
  • You define all the patterns and conventions
  • The entire codebase fits in context easily
  • Mistakes are cheap—just delete and regenerate

AI coding agents excel here. Give Claude a spec for a new Next.js app and watch it build a working prototype in an hour. The context is clean, the requirements are clear, and there's no legacy to navigate.

Brownfield: The Real-World Challenge

When you work in an existing codebase, everything changes.

Implicit knowledge: Architectural decisions made years ago, documented nowhere

Three years ago, someone chose to split user sessions across Redis and PostgreSQL. Half the session data lives in each. Why? Maybe Redis kept running out of memory. Maybe there was a reliability concern. Maybe it was just a bad decision. The code doesn't say, the commits don't explain, and the person who made the choice left the company.

So when the AI sees this pattern, it looks like a mistake. Redundant. Inefficient. The "obvious fix" is to consolidate everything into one data store. Except that breaks session failover in subtle ways that only show up under load.

Without understanding the implicit architectural decisions—and why they exist—AI will "fix" things that shouldn't be fixed.

Hidden dependencies: Change one file, break three others in non-obvious ways

You update the authentication timeout logic in auth/timeout-manager.ts. Reasonable change. Tests pass. Ship it.

Except now the mobile app's offline sync is broken. The timeout manager is called by middleware, which is also used by the sync endpoint, which has special timing requirements that weren't documented. The connection between your change and mobile sync isn't in any import statement—it's a runtime dependency that only exists under specific conditions.

AI can't see these hidden dependencies without systematic exploration. It makes the "obvious" change and introduces subtle breakage.

Inconsistent patterns: The auth system was rewritten twice, both versions still exist

Your codebase has three different patterns for error handling. The old way from 2020 (throwing exceptions), the new way from 2023 (returning Result types), and the "we were migrating but stopped halfway" hybrid that exists in 60% of the code.

When AI generates new code, which pattern should it follow? If it picks the "modern" Result pattern but integrates with a file still using exceptions, you get a mixing anti-pattern that makes the code even more inconsistent.

Without research to identify which patterns exist and which to follow, AI will pick what looks best in isolation—which might be exactly the wrong choice for your codebase.

Massive scale: 100,000+ lines across hundreds of files

There are 47 files with "auth" in the name. Which ones are relevant to fixing the timeout bug? The answer depends on understanding the architecture, which requires reading multiple files to build a mental model.

If you let AI search sequentially, it might read 20 files before finding the right one. That's 10,000+ lines in context before doing any useful work. By the time it finds the relevant code, context is 70% full and the model is operating in what one researcher calls "the dumb zone."

The scale problem isn't just about size—it's about the combinatorial explosion of where to look and what matters.

Context overflow: Just loading relevant files exceeds the context window

Even after you identify the right files, reading them all might exceed your context budget. The session manager is 1,200 lines. The timeout logic is 800 lines. The middleware that calls it is 950 lines. The tests are another 1,500 lines.

You just loaded 4,450 lines, and you haven't started implementing anything yet. If your context window is 200k tokens (~150k in practice before degradation), you're at 30% utilization before writing a single line of code.

Without surgical reading strategies—line ranges, compaction, parallel research—context overflow prevents you from even getting started properly.

Tribal knowledge: "Oh yeah, don't touch that module, it's fragile"

Some parts of the codebase have a reputation. The payment processing code is "fragile." The websocket handler is "weird." The cache invalidation logic is "don't ask, it just works."

This tribal knowledge is never written down. It exists in slack messages, code review comments, and the collective anxiety of the team. AI has no access to this context.

So it confidently refactors the "fragile" payment code to be "cleaner," and introduces a race condition that was fixed three years ago. The commit message from that fix says "handle edge case," which tells you nothing about what edge case or why it matters.

This is where AI coding tools typically fail. They can't find the right files among hundreds of options. They miss critical dependencies and break existing functionality. They don't understand the context and history behind design decisions. They produce code that "works" in isolation but violates codebase conventions. They waste the context window searching randomly through files until they hit the performance cliff.

Sound familiar? If you've tried using AI coding assistants on a real codebase, you've probably lived this frustration. The good news is that it's solvable—and the solution isn't waiting for GPT-5.

The Four Challenges of Brownfield AI Coding

Challenge 1: The Discovery Problem

The situation: You need to fix a bug or add a feature. The AI needs to find the 5-10 relevant files among 500+ files in the codebase.

What fails:

You: "Fix the authentication timeout bug"

AI: [Searches]
- Reads auth/handler.ts
- Reads auth/middleware.ts
- Reads auth/tokens.ts
- Reads auth/sessions.ts
- Reads auth/utils.ts
- Reads config/auth.ts
...
[30 files later, context is 70% full, hasn't found the actual bug]

What works: Structured research phase with specialized discovery.

Challenge 2: The Context Explosion Problem

The situation: Understanding how a feature works requires reading 20+ files, each with 500+ lines.

What fails:

AI reads:
- Full file 1 (842 lines)
- Full file 2 (1,243 lines)
- Full file 3 (673 lines)
- Full file 4 (891 lines)

Context: 85% full
Quality: In the "dumb zone"
Useful work: None yet

What works: Surgical reading with specific line ranges and compaction.

Challenge 3: The Missing Context Problem

The situation: The code makes sense only if you understand why it was written that way.

What fails:

AI: "This error handling seems redundant, I'll simplify it"
[Removes error handling that was added to fix a race condition]
[Reintroduces a bug that took 3 days to debug last time]

What works: Research-first workflow that captures historical context and architectural constraints.

Challenge 4: The Integration Problem

The situation: New code must integrate with existing patterns, conventions, and infrastructure.

What fails:

AI: [Implements feature in a clean, modern way]
Senior engineer: "This doesn't follow our patterns, uses wrong
error handling, and breaks our testing conventions"

What works: Pattern discovery and explicit integration requirements in planning.

The Solution: Systematic Brownfield Workflows

The breakthrough isn't waiting for smarter models. It's applying systematic workflows designed specifically for brownfield development.

The Three-Phase Approach

Instead of throwing an AI agent at a brownfield codebase and hoping, structure your work into three distinct phases:

1. Research Phase: Understand before touching 2. Planning Phase: Design before implementing 3. Implementation Phase: Execute with clarity

Each phase has specific goals, context budgets, and human review points. Let's dive into each.

Phase 1: Research - Understanding Before Acting

The research phase is about one thing: finding the truth about how the codebase actually works.

Research Goals

Good brownfield research answers:

  1. Where is the relevant code? (Specific file:line references)
  2. How does it currently work? (Data flow, key functions, patterns)
  3. Why was it built this way? (Architectural context, constraints)
  4. What are the integration points? (Dependencies, side effects)
  5. How is it tested? (Test patterns, coverage, verification approach)

The Parallel Research Technique

Don't let AI search sequentially. Spawn multiple specialized research agents in parallel:

[Launch simultaneously]

Agent 1: "Find all files related to authentication flow"
→ Uses codebase-locator to identify files
→ Returns: 12 files with auth-related code

Agent 2: "Analyze the session management implementation"
→ Uses codebase-analyzer to understand current approach
→ Returns: Detailed explanation of session handling

Agent 3: "Find existing patterns for timeout handling"
→ Uses pattern-finder to locate similar features
→ Returns: Examples of timeout implementation patterns

Agent 4: "Search thoughts/ directory for auth-related decisions"
→ Uses thoughts-locator to find historical context
→ Returns: Previous research, architectural decisions

[Each agent works in isolation with its own context]
[Main agent receives only compact findings]
[Total time: 5-8 minutes, not 30+ minutes sequential]

Research Output Structure

The research phase must produce a structured document, not a conversation history:

# Authentication Timeout Bug Research

**Date**: 2025-01-15
**Codebase**: UserService (250k LOC)
**Git Commit**: abc123def

## Issue Understanding
User sessions don't timeout correctly when grace period expires,
allowing access with expired tokens.

## Current Implementation

### Session Management
- Entry point: `src/auth/session-handler.ts:45`
- Token validation: `src/auth/token-validator.ts:123`
- Timeout logic: `src/auth/timeout-manager.ts:67`
- Database layer: `src/db/session-store.ts:89`

### Data Flow
1. Request arrives at session handler (session-handler.ts:45)
2. Token extracted and validated (token-validator.ts:130)
3. Timeout check against session store (timeout-manager.ts:72)
4. **BUG LOCATION**: Grace period check fails at timeout-manager.ts:85

### Bug Root Cause
The `isSessionValid()` function at `src/auth/timeout-manager.ts:85`
checks expiry but doesn't properly handle the case where grace
period has also expired. Returns true when it should return false.

## Integration Points
- Used by: API gateway middleware (middleware/auth.ts:34)
- Depends on: Session store, JWT library, Redis cache
- Side effects: Logs to audit system on timeout

## Testing Patterns
- Unit tests: `tests/auth/timeout.test.ts`
- Integration tests: `tests/integration/auth-flow.test.ts`
- Pattern: Test-first approach, all auth changes require tests

## Architectural Constraints
- Must maintain backwards compatibility with v1 token format
- Cannot change database schema (migration freeze until Q2)
- Redis cache must stay in sync with DB

## Historical Context
From `thoughts/shared/research/2024-12-auth-refactor.md`:
- Grace period was added to handle clock skew between services
- Previous timeout bug was in different location (fixed in PR #1234)
- Team debated removing grace period but kept it for reliability

## Recommended Approach
Fix timeout-manager.ts:85 to properly check both expiry and grace
period. Add test cases for edge case. No schema changes needed.

Critical insight: This research doc is 100 lines. Getting to this point might have involved reading 50 files and 10,000+ lines of code across multiple subagents. But all that noise stays in the subagent contexts—the main context only sees these 100 lines.

Think of it like sending scouts ahead instead of marching your entire army through every possible path. The scouts explore, report back with a map, and then you advance with confidence.

The Human Review Checkpoint

Before proceeding to planning, a human must read the research. This is not optional.

Check:

  • ✅ Are the identified files actually correct?
  • ✅ Does the root cause analysis make sense?
  • ✅ Are integration points and dependencies captured?
  • ✅ Are there architectural constraints we must respect?
  • ✅ Does the recommended approach seem sound?

If anything is wrong or missing, iterate on research before planning. A bad line of research leads to thousands of bad lines of code.

Time investment: 5-10 minutes reading research Time saved: 2-3 hours of implementation going in wrong direction

Phase 2: Planning - Design Before Building

With solid research in hand, planning becomes straightforward. The goal is to create a detailed, phase-by-phase implementation guide.

Planning Goals

A good brownfield plan specifies:

  1. Exactly what to change (Specific files and line ranges)
  2. The order of changes (Phased to minimize risk)
  3. How to verify each step (Automated and manual success criteria)
  4. What NOT to change (Explicit scope boundaries)
  5. Integration requirements (Following existing patterns)

The Phased Implementation Structure

Break work into small, verifiable phases. Here's what the implementation plan looks like:

Plan Structure:

# Authentication Timeout Fix - Implementation Plan

## Overview
Fix the grace period timeout bug in session validation by updating
the timeout manager to properly check both expiry and grace period.

## Current State (from research)
Bug in `src/auth/timeout-manager.ts:85` where `isSessionValid()`
returns true for sessions where both timeout and grace period expired.

## Desired End State
Sessions with expired grace periods are correctly rejected.
Existing tests pass. New edge case tests added and passing.

## What We're NOT Doing
- NOT changing database schema
- NOT modifying token format
- NOT changing API contracts
- NOT touching session handler or token validator

## Phase 1: Add Failing Test

### Changes
File: `tests/auth/timeout.test.ts`
Add test case for expired session with expired grace period

    test('rejects session when grace period expired', async () => {
      const session = createExpiredSessionWithExpiredGrace();
      const result = await isSessionValid(session);
      expect(result).toBe(false);
    });

### Success Criteria

**Automated:**
- [ ] Test compiles: `npm run build:test`
- [ ] Test fails as expected: `npm test timeout.test.ts`

**Manual:**
- [ ] Test accurately represents the bug scenario
- [ ] Test is clear and maintainable

**Checkpoint**: Confirm test fails before proceeding to Phase 2

---

## Phase 2: Fix Timeout Logic

### Changes
File: `src/auth/timeout-manager.ts:85-92`

Update `isSessionValid()` to check both conditions:

    function isSessionValid(session: Session): boolean {
      const now = Date.now();

      // Check session expiry
      if (now > session.expiresAt) {
        // Session expired, check grace period
        if (now > session.expiresAt + GRACE_PERIOD_MS) {
          return false; // Grace period also expired
        }
      }

      return true;
    }

### Success Criteria

**Automated:**
- [ ] All unit tests pass: `npm test auth/`
- [ ] New test passes: `npm test timeout.test.ts`
- [ ] Type checking passes: `npm run type-check`
- [ ] Linting passes: `npm run lint`

**Manual:**
- [ ] Logic handles edge cases correctly
- [ ] No performance degradation
- [ ] Error messages are clear

**Checkpoint**: All automated checks must pass before Phase 3

---

## Phase 3: Add Integration Test

### Changes
File: `tests/integration/auth-flow.test.ts`

Add end-to-end test for complete auth flow with timeout:

    test('full auth flow with grace period timeout', async () => {
      // Create session, wait for timeout + grace period
      // Verify API request is rejected
      // Verify audit log is created
    });

### Success Criteria

**Automated:**
- [ ] Integration tests pass: `npm test integration/`
- [ ] Coverage check passes: `npm run coverage`

**Manual:**
- [ ] Test in staging environment
- [ ] Verify audit logging works correctly
- [ ] Check Redis cache invalidation

---

## Testing Strategy

### Unit Tests
- Edge case: Exactly at grace period boundary
- Edge case: Just before grace period expires
- Edge case: Way past grace period

### Integration Tests
- Full request flow with expired session
- Verify downstream effects (logging, cache)

### Manual Testing
1. Deploy to staging
2. Create session and wait for timeout
3. Verify access is denied
4. Check audit logs
5. Test with clock skew scenarios

## Performance Considerations
- No additional database queries
- Cache behavior unchanged
- Existing performance characteristics maintained

## Rollback Plan
If issues arise:
1. Revert timeout-manager.ts changes
2. Keep new tests (marked as skipped)
3. Create new bug ticket with findings

## References
- Research: `thoughts/shared/research/2025-01-15-auth-timeout-bug.md`
- Related PR: #1234 (previous timeout fix)
- Architecture doc: `thoughts/shared/plans/auth-architecture.md`

The Human Review Checkpoint

Before implementation, a human must review the plan. Again, not optional.

Check:

  • ✅ Are phases properly scoped and testable?
  • ✅ Do success criteria cover edge cases?
  • ✅ Is the approach technically sound?
  • ✅ Are integration requirements clear?
  • ✅ Is anything missing?

Time investment: 10-15 minutes reviewing plan Time saved: 3-4 hours of implementation doing the wrong thing

Phase 3: Implementation - Execute with Clarity

With research and plan complete, implementation becomes almost mechanical. The AI has everything it needs in a clean context.

Implementation Approach

Start a fresh session with:

  1. The implementation plan (compact!)
  2. Reference to research (read only if needed)
  3. Clear instruction to work phase-by-phase
You: "Implement the authentication timeout fix following this plan:
thoughts/shared/plans/2025-01-15-auth-timeout-fix.md

Work through each phase sequentially. After each phase:
1. Run all automated verification
2. Stop and wait for my confirmation before proceeding

Read the referenced research only if you need clarification.
Do not deviate from the plan without asking first."

Why This Works

The implementation agent starts with:

  • Clean context: No research noise, no planning back-and-forth
  • Clear instructions: Exactly what to do, in what order
  • Success criteria: Knows when each phase is done
  • Constraints: Knows what NOT to change
  • Patterns: Research identified conventions to follow

Handling Implementation Issues

If the agent gets stuck or produces wrong results:

  1. Don't let it spin: Stop after 2-3 failed attempts
  2. Compact the issue: Document what's failing and why
  3. Maybe research is wrong: Go back to research, investigate
  4. Maybe plan is wrong: Update plan with new understanding
  5. Start fresh: New session with updated context

Real-World Example: 35,000 Lines in 7 Hours

Let's look at a concrete example: adding cancellation support to BAML (a 300k line Rust codebase).

Research Phase (2 hours)

Parallel research agents investigated:

  • How async operations currently work
  • Where cancellation needs to be supported
  • Rust cancellation patterns (tokio, futures)
  • Testing patterns for async code
  • Performance implications

Output: 300-line research document identifying:

  • 47 files that need changes
  • 8 key integration points
  • 3 architectural constraints
  • Existing async patterns to follow
  • Test patterns to replicate

Planning Phase (1 hour)

With research in hand, planning identified:

  • 6 implementation phases
  • Specific changes for each file
  • Test-first approach for each phase
  • Success criteria (automated + manual)
  • Rollback strategy

Output: 400-line implementation plan

Implementation Phase (4 hours)

Following the plan phase-by-phase:

  • Phase 1: Add cancellation token type (30 min)
  • Phase 2: Thread tokens through async executor (45 min)
  • Phase 3: Update function runtime (1 hour)
  • Phase 4: Add cancellation points (1 hour)
  • Phase 5: Update tests (45 min)
  • Phase 6: Integration testing (30 min)

Result: 35,000 lines of working code, tests passing

The Key Insight

Without research and planning:

  • Would have spent 4+ hours just finding the right files
  • Context would have filled with noise
  • Would have missed architectural constraints
  • Would have violated existing patterns
  • Senior engineers estimated: 3-5 days of work

With systematic workflow:

  • Total time: 7 hours (including research and planning)
  • Code quality: Passed senior engineer review
  • First-time success rate: ~80% (vs. ~20% without workflow)

Brownfield Best Practices

These aren't theoretical guidelines—they're battle-tested lessons from teams shipping production code in real brownfield codebases. Ignore them at your peril.

1. Always Research First

Here's the mistake everyone makes: "This looks simple, it's just updating a config value. Let me skip research and jump straight to implementation."

Then 90 minutes later, you're three failed implementations deep, the AI has read 40 files, context is a mess, and you still don't have working code. The "simple" config value turns out to be referenced in seven places, three of them dynamically, and changing it breaks a migration script that runs on deployment.

The 10-15 minutes you "saved" by skipping research just cost you an hour of frustration.

Even for "simple" tasks, spend 10-15 minutes on research. Where is the code? How does it actually work? What patterns exist? What are the constraints? What depends on this?

This isn't optional overhead—it's the foundation that makes everything else work. Research is what transforms AI from a random code generator into a targeted implementation tool.

Never skip this. The 10 minutes invested saves hours later. Every single time.

2. Read Everything the AI Produces

Your highest-leverage time isn't reviewing the final code. It's reviewing the research and plans before any code is written.

Think about the failure modes:

  • Bad research → bad plan → bad implementation → hours wasted
  • Good research → bad plan → bad implementation → one hour wasted
  • Good research → good plan → bad implementation → 15 minutes wasted

If you catch problems at the research phase, you prevent thousands of lines of code from going in the wrong direction. If you catch problems at the planning phase, you prevent hours of implementation churn. If you wait until reviewing code, you've already paid the full cost of the mistake.

This creates a leverage pyramid:

  • Reviewing code: 1x value (you catch bugs before they ship)
  • Reviewing plan: 10x value (you catch wrong approach before implementation)
  • Reviewing research: 100x value (you catch wrong understanding before planning)

The teams shipping high-quality brownfield code spend 30-40% of their time reviewing research and plans. The teams struggling spend 90% of their time reviewing code.

3. Make Scope Boundaries Explicit

AI has no natural sense of scope. Given a bug to fix, it might decide to refactor the entire module "while it's here." Given a feature to add, it might redesign your architecture to be "cleaner."

You need explicit boundaries. In every plan, include a "What We're NOT Doing" section:

  • NOT changing database schema
  • NOT modifying API contracts
  • NOT touching the fragile legacy module
  • NOT refactoring while fixing bugs

This prevents scope creep and keeps AI focused on the actual task. It also forces you to think about what's in scope vs. what's a separate concern.

When AI goes off-script during implementation, you have a clear reference point: "The plan says we're NOT refactoring error handling. Stay focused on the timeout fix."

4. Use Test-Driven Implementation

In brownfield code, tests are your safety net. They're how you know the change worked and you didn't break anything else.

The test-first pattern works exceptionally well with AI:

  1. Write failing test (captures desired behavior)
  2. Implement fix (make test pass)
  3. Verify no regressions (all other tests still pass)

Why does this work so well? Because tests provide clear, executable success criteria. "Make this test pass" is unambiguous in a way that "fix the bug" is not.

Tests also catch the sneaky breakages. Your change might fix the immediate issue but break an edge case that only shows up under specific conditions. Without tests, you'd find out in production. With tests, you find out in 30 seconds.

For brownfield work, your implementation plan should start with "Phase 1: Add failing test." Don't skip this. If you implement first and test later, you'll find out your implementation was wrong after you've already paid the context cost.

5. Embrace the Iteration

Research will sometimes be wrong. You'll identify the wrong files or miss a key dependency. Plans will need adjustment. You'll discover during planning that research missed a constraint. Implementation will hit unexpected issues. The code won't compile because of a type incompatibility research didn't catch.

This is normal. This is how brownfield development works, even with humans. The difference is that systematic workflows handle iteration gracefully:

  • Bad research? Iterate on research with better steering. Spawn more focused subagents. Read the specific files you now realize matter.
  • Bad plan? Update the plan and start implementation fresh with clean context. Don't try to "fix" implementation with a contaminated context.
  • Implementation stuck? Compact the progress you made, document what's not working, restart the phase with updated understanding.

The anti-pattern is "keep pushing forward with wrong understanding." AI will happily keep trying failed approaches forever. You need to catch when iteration is needed and restart with better information.

Watch for signs you need to iterate: same error appearing multiple times, AI reading the same files repeatedly, implementation taking 3x longer than planned. These are signals to stop, compact learnings, and restart with updated context.

6. Maintain Consistent Naming

AI agents rely heavily on pattern matching and semantic search. Consistent naming amplifies their ability to find relevant code. Inconsistent naming cripples it.

If you call authentication "auth" in some files, "authentication" in others, "authN" in a few places, and "user_verification" in the legacy module, AI can't find all the relevant code. It'll search for "auth," find half the files, miss the rest, and implement based on incomplete understanding.

The same principle applies to file naming patterns and directory structures. If all your route handlers are in routes/, service logic is in services/, and data access is in repositories/, AI can learn these patterns and navigate efficiently. If they're scattered randomly, every task starts with expensive discovery.

You don't need perfect consistency—that's unrealistic in brownfield code. But you need enough consistency that patterns are discoverable. When you refactor or add new code, nudge toward consistency rather than away from it.

This pays off exponentially. Each improvement in naming consistency makes every future research phase slightly faster and more accurate.

7. Document Architectural Decisions

The best investment you can make in brownfield AI productivity is lightweight architectural documentation.

Not 50-page specifications. Not exhaustive API docs. Just short documents that capture:

  • Why things are structured the way they are
  • What constraints exist and why
  • Patterns to follow for common tasks
  • History of key decisions

Example: A 200-line document that explains why sessions are split across Redis and Postgres, what the performance characteristics are, and what the failure modes are. This becomes a critical input during research for any auth-related changes.

These docs don't need to be perfect or complete. Start with the highest-pain areas—the parts of the codebase where AI consistently gets confused or makes mistakes. Document just enough context to avoid repeating those mistakes.

Over time, your research documents become architectural documentation. The research from fixing the auth timeout bug becomes the reference for the next auth-related feature. You're building institutional knowledge as a side effect of systematic workflows.

The teams with the best brownfield AI productivity have accumulated 20-30 of these lightweight docs. Each one represents a few hours of initial investment and saves dozens of hours across future tasks.

Common Brownfield Failure Modes

Every team learning brownfield AI coding hits these failure modes. The smart teams learn from others' mistakes instead of making them all themselves.

Failure Mode 1: Skipping Research

You're looking at what seems like a straightforward bug. The error message is clear. The fix is obvious. You can practically see the three-line change in your head.

So you skip research and tell the AI: "Fix the timeout bug in the auth system."

The AI starts searching. It reads auth/handler.ts. Makes sense—that's where requests come in. Then it reads auth/middleware.ts. Also reasonable—that's where validation happens. Then auth/tokens.ts, auth/sessions.ts, auth/utils.ts, config/auth.ts.

Twenty files later, context is 60% full, and the AI still hasn't found the actual bug. It takes a guess at a fix. The fix addresses a symptom but not the root cause. Tests pass—barely—because the tests don't cover the edge case. You ship it.

Three days later, the bug is back. Different trigger, same root cause. You spent two hours on a fix that didn't fix anything, and now you're starting over.

What went wrong? The bug wasn't in the auth handler or middleware or any of the places that seemed obvious. It was in a timeout manager utility that's called indirectly through three layers of abstraction. Research would have found this in 10 minutes by systematically tracing the execution path.

Without research, AI searches randomly and burns context on dead ends. With research, it goes directly to the right location with surgical precision.

Failure Mode 2: Not Reading Research

You've learned the lesson from Failure Mode 1. You're doing research! Your AI produces a detailed research document—300 lines explaining how the auth system works, where the bug is, what needs to change.

You skim the executive summary. Looks good. You feed it directly to planning. Planning produces an implementation plan. You feed that directly to implementation.

Three hours of implementation later, nothing works. The changes don't compile. When you fix the compilation errors, tests fail in weird ways. You're stuck debugging cryptic error messages.

Finally, you actually read the research document. Turns out the AI misidentified the root cause. The research looked thorough, but it traced the wrong execution path. It followed the happy path and missed the edge case where the bug actually happens.

If you'd read the research carefully before planning, you would have caught this in five minutes. Now you've wasted three hours of implementation, contaminated your context with failed attempts, and need to start over with better research.

This is the most expensive failure mode because it compounds. Bad research → bad plan → bad implementation → complete restart. You pay the full cost of the entire workflow based on a wrong foundation.

The fix isn't just "read the research"—it's read it skeptically. Does the execution path make sense? Are the identified files actually relevant? Does the root cause analysis explain all the symptoms? Challenge the research. Verify the key claims. Make sure the foundation is solid.

Failure Mode 3: Vague Success Criteria

Your implementation plan says:

  • Phase 1: Fix the timeout logic
  • Phase 2: Make sure it works
  • Phase 3: Clean up any issues

This seems fine—it's a plan! But watch what happens during implementation.

The AI fixes the timeout logic. Is it done with Phase 1? It runs the tests. Some pass, some fail. Are the failures expected? The plan doesn't say. It tries to fix the failing tests. Now different tests fail. It keeps iterating, changing both the fix and the tests, never sure if it's making progress or going in circles.

By the time you check in, the AI has made 15 different attempts, context is completely contaminated with failed experiments, and you have no idea which version was closest to correct.

The problem wasn't the AI—it was the plan. "Make sure it works" isn't actionable. The AI doesn't know what "works" means. Does it mean all existing tests pass? Does it mean the specific bug is fixed? Does it mean performance is maintained? Does it mean the code follows existing patterns?

Compare with specific success criteria:

  • Phase 1: Fix timeout logic
    • ALL unit tests in auth/ pass
    • Specifically: timeout.test.ts line 47 (the failing test) passes
    • Type checking passes: npm run type-check
    • No changes to any other files

Now the AI knows exactly when it's done. Pass all these checks? Move to Phase 2. Fail any check? Keep iterating, but only on Phase 1 scope.

Vague criteria lead to scope creep, endless iteration, and context contamination. Specific criteria lead to focused work and clear completion.

Failure Mode 4: No Scope Boundaries

The plan says: "Fix the authentication timeout bug." The AI reads the timeout code and notices the error handling isn't great. The patterns are inconsistent. The logging could be better. As long as it's fixing the timeout, why not clean this up?

So it refactors the error handling. And updates the logging. And standardizes the patterns. And extracts some helper functions. The diff is now 400 lines across 8 files instead of 20 lines in 1 file.

Your senior engineer reviews the PR: "I can't approve this. The timeout fix looks fine, but all this refactoring introduces risk. We don't refactor auth code without extensive testing. Break this into two PRs—the fix and the refactoring—and we'll need a separate testing plan for the refactoring."

You're blocked. The work is done but not shippable. You need to untangle the timeout fix from the refactoring, which is harder than it sounds because they're now intertwined.

This happens because AI has no inherent sense of scope. Without explicit boundaries, it follows local optimization: make each piece of code as good as it can be. But this violates the brownfield principle: minimize change surface area.

The fix is simple but requires discipline. Every plan needs a "What We're NOT Doing" section:

  • NOT refactoring error handling
  • NOT updating logging patterns
  • NOT extracting helper functions
  • NOT touching any code outside timeout-manager.ts

These boundaries feel restrictive—isn't it wasteful to leave suboptimal code when we're already there? But in brownfield development, scope discipline is safety. Each additional change is additional risk. Fix one thing at a time. Ship it. Then consider improvements separately.

Without boundaries, AI optimizes for code elegance. With boundaries, it optimizes for safe, shippable changes. In brownfield work, the second is vastly more valuable.

When Brownfield AI Still Struggles

Let's be honest: systematic workflows don't solve everything. Some brownfield scenarios remain genuinely difficult, even with perfect research and planning. Knowing when you're hitting the limits of current AI capabilities is as important as knowing the techniques themselves.

1. Deeply Nested Dependencies

When changing one file affects 20+ files in non-obvious ways, even good research might miss hidden dependencies.

Mitigation: Start with comprehensive integration tests before making changes.

2. Undocumented Legacy Behavior

When the code does something for a reason that's lost to history, AI might "fix" behavior that was actually intentional.

Mitigation: Be extra careful with code that seems "weird". Research why before changing.

3. Performance-Critical Code

When performance depends on subtle implementation details, AI might introduce regressions while maintaining functionality.

Mitigation: Include performance testing in success criteria. Benchmark before and after.

4. Complex State Machines

When correctness depends on subtle state transitions, AI might introduce bugs that only appear in edge cases.

Mitigation: Extensive property-based testing. Manual review of state transitions.

Measuring Success

Here's the thing about measuring brownfield AI effectiveness: the traditional metrics lie to you.

If you only track "time to ship code," you'll optimize for the wrong thing. You'll skip research, rush planning, and end up with code that technically works but introduces subtle bugs or violates architectural patterns. Your PR gets merged, metrics look good, and three weeks later you're debugging a production incident.

What actually matters is understanding where your workflow is breaking down and whether you're building sustainable velocity.

Think of it like measuring software development productivity—lines of code is a terrible metric, but cycle time combined with quality indicators tells the real story.

Process Metrics: Finding Your Bottlenecks

Time spent on research vs. implementation

This ratio tells you if you're doing enough discovery work. Teams new to brownfield AI often spend 10% on research, 90% on implementation, then wonder why the AI keeps going in circles.

Mature teams invert this: 40% research, 30% planning, 30% implementation. It feels slow at first—you're not shipping code!—but your actual time-to-working-PR drops dramatically because you're not wasting cycles on wrong implementations.

If you're spending less than 30% on research, you're probably shipping slop.

Number of research → plan iterations

This measures how often you have to redo research because the plan revealed gaps. One or two iterations is normal—you discover what you don't know when you try to plan. Five iterations means your research wasn't systematic enough.

Watch this number over time. It should decrease as you get better at asking the right research questions.

Plan adherence rate during implementation

Does implementation actually follow the plan, or does the AI constantly go off-script? High adherence (80%+) means your plan was specific enough and your research was accurate. Low adherence means either the plan was vague, the research was wrong, or your implementation context got contaminated.

When adherence drops, don't blame the AI—your research or planning probably missed something critical.

Quality Metrics: Are You Building Better?

First-time PR approval rate

This is your north star metric. It tells you if the research → plan → implement workflow is actually working.

Before systematic workflows, teams might see 20-30% first-time approval. The rest need significant rework. With mature workflows, you should hit 70-80% first-time approval.

If you're stuck below 50%, your research phase isn't capturing enough context about patterns and constraints.

Regression bugs introduced

Every regression is a research failure. You changed something without understanding its dependencies, or you "fixed" behavior that was actually intentional.

Track these ruthlessly. When a regression happens, ask: what did research miss? Update your research checklist to catch that category next time.

Code review feedback volume

How much feedback does the AI-generated code get? More importantly, what kind?

"This doesn't follow our patterns" → Research didn't capture patterns "This breaks edge case X" → Planning didn't think through edge cases "This is too complex" → Implementation went off-plan and over-engineered

Use feedback themes to improve your research and planning templates.

Efficiency Metrics: Getting Faster Without Breaking Things

Time from task start to mergeable PR

This is your cycle time. It includes research, planning, implementation, and iteration. Track it, but don't optimize it directly—optimize the process metrics above, and cycle time will improve as a side effect.

Early on, brownfield tasks might take 2-3x longer with systematic workflows than "just jump in and code." That's fine. You're learning the process. After a few weeks, you'll match your old velocity. After a few months, you'll be 3-5x faster.

Context utilization during implementation

How full is the AI's context window during implementation? If you're regularly hitting 70-80% utilization, your implementation agent is doing too much—probably because research and planning weren't specific enough, forcing implementation to do discovery work.

Healthy implementation runs at 40-50% context utilization. The rest is reserved for actually thinking about the code.

Token cost per feature

This matters for two reasons: your budget, and as a proxy for wasted work.

If token costs spike on a task, you probably spent a lot of context on failed attempts, random exploration, or implementation agents doing research. High token costs usually correlate with low first-time approval rates—you're spinning, not shipping.

Team Metrics: Building Organizational Capability

Knowledge sharing (via research docs)

Here's a hidden benefit of systematic brownfield workflows: the research docs you create become onboarding materials, architecture documentation, and institutional knowledge.

Count how many times research docs get referenced by other team members. If no one ever reads them except the person who created them, they're not valuable enough—probably too verbose or unstructured.

Good research docs get referenced 5-10 times over the next few months by teammates working on related features.

Onboarding time for new engineers

How long does it take a new engineer to make their first meaningful contribution to a brownfield codebase? With traditional "read the code and figure it out" approaches, this can take weeks or months.

With accumulated research docs from systematic AI workflows, new engineers can be productive in days. The research docs answer "where is this?", "how does this work?", and "why was this built this way?"—exactly what new engineers need.

If onboarding time isn't decreasing, your research docs aren't capturing the right context.

Confidence in AI-generated code

Measure this qualitatively through team discussions. Do engineers trust the AI's output, or do they treat it as suspect by default?

Low confidence is usually a symptom of skipping research or not reviewing plans. When engineers see AI produce high-quality code consistently—because it's following solid research and plans—confidence builds quickly.

High confidence doesn't mean blind trust. It means the team knows the process works and focuses review time on verification rather than suspicion.

The Mental Shift Required

Working effectively with AI in brownfield codebases requires a significant mental shift. This is often harder than learning the actual techniques, because it challenges deeply ingrained habits about how software development works.

From: Direct Implementation

Old: Dive into code, start changing things, iterate until it works

New: Research → Plan → Implement, with human review at each phase

From: Code as Primary Artifact

Old: Code is what matters, everything else is documentation

New: Research and plans are as important as code, sometimes more

From: Individual Context

Old: All the context is in my head while I work

New: Context is explicitly captured in documents for AI and team

From: Trusting Your Understanding

Old: I understand this code, I can change it safely

New: AI researches code, I verify its understanding before proceeding

From: Implementation Focus

Old: Spend 90% of time writing code, 10% planning

New: Spend 40% research, 30% planning, 30% implementation

The Uncomfortable Truth

Here's what separates teams shipping production brownfield code with AI from teams struggling:

They spend more time NOT coding.

I know, I know. You became an engineer to write code, not documents. But hear me out.

More time on:

  • Researching how code actually works
  • Planning changes in detail
  • Reviewing research and plans
  • Iterating on understanding

Less time on:

  • Directly writing code
  • Debugging AI-generated code
  • Rewriting slop from previous attempts

The paradox: By spending more time on research and planning, they ship code faster and at higher quality.

It's like the old woodcutter saying: "Give me six hours to chop down a tree and I will spend the first four sharpening the axe." Except in this case, the axe is an AI agent and you're sharpening it with research documents.

Conclusion: Brownfield is Solvable

The difference between teams successfully using AI in brownfield codebases and teams giving up isn't the codebases they work with or the models they use.

It's whether they:

  1. Research systematically before implementing
  2. Plan comprehensively with clear phases and success criteria
  3. Review research and plans, not just code
  4. Maintain context discipline throughout the process
  5. Treat specs as source code, not throwaway artifacts

Brownfield AI coding isn't a model problem. It's a workflow problem. And workflow problems have workflow solutions.

The teams that figured this out are already shipping 5-10x more code than their peers in complex, legacy codebases. Not because they have better models, but because they built systematic processes that work with today's models.

The future of brownfield development isn't waiting for smarter AI. It's building smarter workflows around the AI we have.

Ready to Implement AI Coding in Your Brownfield Codebase?

At FMKTech, we help engineering teams implement these systematic workflows to unlock AI coding productivity in real-world, legacy codebases. Whether you're working with 10-year-old monoliths, microservices spaghetti, or undocumented code that "just works," we can help you:

  • Design research and planning workflows tailored to your codebase architecture
  • Build quality feedback loops that prevent AI from shipping slop
  • Create architectural documentation that makes AI agents effective
  • Train your team on brownfield AI techniques that actually work
  • Navigate from pilot projects to production-ready AI-assisted development

We don't just consult—we work alongside your engineers to implement these workflows on real tasks in your actual codebase. Because the only way to learn brownfield AI coding is by doing it.

Interested in exploring how AI coding agents can work in your legacy systems? Contact us to discuss your specific codebase challenges and how systematic workflows can transform AI from frustrating to productive.

For more insights on AI agent implementation, check out our other posts:

Further Reading


This article is based on real-world experiences from teams shipping production code with AI in complex brownfield codebases, including detailed workflows from HumanLayer, results from the BAML project, and learnings from the AI That Works community. All techniques are battle-tested in 100k+ LOC production systems.