Article

How I Actually Use AI Coding Agents on Real Projects (and Where I Still Don’t Trust Them)

A practical, senior-level look at using AI coding agents like Claude Code on production codebases: where they save real time, where I keep full control, and how I structure work so the output is actually shippable.

A year ago, if someone told me they were shipping production code where an AI agent wrote most of it, I'd have asked "which production." A landing page, sure. A real backend with real users and real data? I wasn't convinced.

I'm a lot less skeptical now. Not because the models suddenly got smarter overnight — they got better, but that's not the main thing. What changed is how I work with them.

So this isn't a "10x your output" post. It's closer to notes from someone who's been doing this daily for months: what an agent is genuinely good at on a real codebase, what it still gets wrong, and the habits that keep the wrong parts from costing me anything.

I work across a Next.js site, an Expo app with a NestJS backend and PostgreSQL, and a few Chrome extensions (Cuelio, Crowra, TableSnap). Different stacks, different stakes, same agent. The lessons mostly transfer.

The mistake I made early on

My first instinct was to treat the agent like a junior dev I could hand a ticket to and walk away.

"Here's what I need. Go build it."

For small, boring, low-risk stuff, that's actually fine. But the moment a task touches architecture, data flow, or "how we do things in this codebase," that approach falls apart in a specific way: you get code that compiles, looks reasonable on a skim, and is wrong in some quiet way you only notice a week later.

It took me a while to realize this isn't really an "AI" problem. "Here's what I need, go build it" is a bad spec for a human too — we just don't feel it, because a human teammate fills in the gaps using context they already have. They remember why the last attempt at this broke. They know which shortcut is fine here and which one bit us in production six months ago.

An agent doesn't have any of that unless you hand it over. And it won't sit there confused — it'll just fill the gap with whatever sounds plausible. Plausible and correct-for-your-codebase are not the same thing, and the gap between them is exactly where the subtle bugs live.

What actually changed for me wasn't "trust it more" or "trust it less." It was treating context as part of the job, not an optional extra. I now think of the agent as a very capable engineer who started an hour ago and hasn't read the codebase yet — because, functionally, that's what it is on every single task.

That one mental shift did more for the quality of the output than any prompt trick I've tried.

An AI coding agent workflow: plan, scoped task with context, agent execution, human review, ship

Where it actually earns its keep

Once I stopped expecting it to "just know" things, a few areas turned out to be genuinely strong — stronger than I expected.

Codebase archaeology. "Where does this value actually get computed?" "What breaks if I change this function's signature?" "Why does this component refetch on every route change?" These used to be ten minutes of grep and scrolling through files I half-remembered. Now it's one prompt and an answer with file paths and line numbers I can check in seconds. Honestly, this alone would justify the workflow change for me — it's the thing I do most, all day, on every project.

Mechanical refactors. Renaming a prop across forty components, migrating a batch of API routes to Server Actions, updating every call site after a helper's signature changes. Correctness here is mostly about not missing a spot, and an agent doesn't get sloppy on file thirty-one the way I do at 4pm.

First drafts of well-specified features. If I can describe the shape of something precisely — inputs, outputs, edge cases, where it slots into the existing structure — the agent gets to a working first version faster than I'd type it myself. Not the final version. Maybe 80% there. But 80% there, on the first try, is a great place to start steering from.

Wiring things across the stack. Add a field, and it needs to exist in the PostgreSQL schema, the NestJS DTO and service, the sync payload, the Expo screen, and the Next.js dashboard. That used to be an afternoon of repetitive, low-creativity work. Now the agent reads how one similar field is wired and drafts the whole chain in a few minutes.

Edge cases I forgot. Empty arrays, null timestamps, duplicate sync writes, what happens if the device comes back online mid-write. I don't agree every single one matters, but having the list in front of me beats trying to brainstorm it cold.

Debugging across boundaries. When something breaks between, say, the Apple Watch app and the dashboard, the agent can hold the whole chain in its head at once — mobile write, NestJS endpoint, Postgres constraint, frontend read. I can do that too, it just takes me longer to load it all back in after I've been doing something else for an hour.

None of this is the agent designing my product. It's the agent closing the gap between "I know what needs to happen" and "it exists in the codebase."

Where I don't let it run unsupervised

This is the part that matters more, because it's where things go wrong if you get lazy.

Architecture and data model decisions. One table with a status column, or two tables and a join? Does this state live in the URL, a server component, or client state? These choices ripple for months. The agent will give you an answer — often a perfectly reasonable one — but it doesn't carry "we'll be living with this for two years and three other features will depend on it." That weight is mine, so the call stays mine. The agent can help me think it through. It doesn't get a vote on the final answer.

Anything touching auth or data access. This is where I'm closest to paranoid, on purpose. Code can look like it checks permissions correctly and not actually do it — especially across an ownership chain like "user owns a place, place belongs to a shared group, group membership decides who can see it." I read every line of this myself. Doesn't matter who or what wrote it.

Scope creep. Agents are eager to be helpful, which sounds nice until you ask for a one-line fix and get the fix plus a "while I was in there" refactor, plus a new abstraction "for reusability," plus a validation layer nobody asked for. On a side project that's mildly annoying. On anything shared, it turns a five-minute review into a forty-minute one. A diff that's bigger than I expected is my cue to stop and ask why before I read another line.

Naming and abstractions, over time. It'll happily produce a working useThing hook or another helpers.ts. What it's much worse at is noticing "we already have three things that almost do this, and this is now a fourth, slightly different one." That kind of drift doesn't show up in any single diff — it shows up six months later when you're staring at the folder structure wondering how it got like this. Catching that is still on me.

Anything where "it compiles and the demo looks fine" isn't the actual bar. Sync logic that has to survive a phone losing signal mid-write. An extension that has to behave when the tab closes mid-request. A migration that has to run safely against a table with real rows in it. The agent can write code for all of these — if I describe the scenario. It won't reliably think of the scenario itself, because it's never been the one paged at 2am when this kind of thing goes sideways. I have. That experience is still doing work here, even if I'm not the one typing.

If there's a pattern across all of these, it's this: the agent is excellent at execution once a problem is well-defined, and it's not the one who should decide what the problem is or what "done" actually means. That's still the job.

Where an AI coding agent is trusted to execute versus where a human keeps the decision: architecture, security, scope, and abstractions stay with the engineer

How I actually work, day to day

A handful of habits made the biggest difference, and none of them are clever.

I plan before I let it touch any files. For anything that isn't trivial, I have it read the relevant code first and tell me how it's going to approach the change before writing a line. Reading a plan takes ten seconds. Reading a thousand-line diff to discover we disagreed from the start takes a lot longer, and by then it's already written.

I scope things the way I'd brief a contractor I trust but haven't worked with yet — specific files, specific behavior, specific things not to break. "This has to keep working offline." "Don't touch the API shape, the mobile app depends on it." "Match the pattern in lib/posts.ts, don't invent a new one." Every piece of context I make explicit is one less thing it has to guess, and guessing is where things drift.

I try to keep diffs small enough to actually read. If something's going to touch twenty files, I'd rather it happen in a few steps I can review properly than one big change I skim because I'm tired by file twelve.

I let it run its own checks. Type checker, linter, tests, even firing up the dev server to look at the actual page — having it do this and fix what it finds before showing me anything has probably been the single biggest time saver. The "run it, copy the error, paste it back" loop basically disappears.

And I review its code differently than I'd review a teammate's PR — not less carefully, just differently. With a person's PR I'm partly checking whether they understood the requirement. With the agent, I already know what I asked for, so the review is: did it do that, did it do only that, and does it look like it belongs in this codebase or like it was dropped in from somewhere else. That last one is the one I see people skip most.

The rest — what to build, what not to build, when "good enough" is actually good enough versus a problem waiting to happen — stays mine. The agent doesn't have an opinion on whether a feature is worth building in the first place, and on a side project, that question is most of the actual work.

A real example: the Expo/NestJS sync layer

The clearest case from my own work is the sync architecture for the Expo app — Apple Watch and iPhone capture GPS places, sync through NestJS, land in PostgreSQL, show up later in the Next.js dashboard.

Here's what the agent handled well:

  • scaffolding the NestJS DTOs and service methods once I'd nailed down the sync payload shape
  • writing the Postgres migration for the new sync columns, matching the style of the existing migrations
  • generating the Expo-side API client and TypeScript types from the NestJS contract
  • a first pass at tests for "what happens if the same place gets synced twice"

What I kept for myself:

  • how conflicts get resolved when the same place is edited offline on two devices — last-write-wins, per-field merge, manual resolution. That's a product decision wearing a technical costume, and it has UX consequences I had to sit with myself
  • the idempotency strategy for sync writes, because getting it wrong means duplicate places quietly appearing, and that's the kind of bug that makes people stop trusting the app
  • reading the auth check on the sync endpoint myself, line by line, since that's the line between "your data" and someone else's

By lines of code, the agent probably wrote more than I did. But that's not really the split that matters. It handled the surface area; I handled the handful of decisions where being wrong is expensive. That's roughly the ratio I aim for on most real features.

The same pattern shows up in the Chrome extensions

Cuelio, Crowra, and TableSnap are smaller and more contained than the Expo/NestJS side, which makes the same pattern easier to see at a different scale.

With Crowra — a side-panel tool that audits a page for SEO, schema, and AI-readiness signals — the agent is great at adding a new check once the existing ones establish the pattern. "Also flag pages missing an og:image" is mechanical: the existing checks are basically the spec, and getting one wrong has a small blast radius.

What I keep for myself is deciding whether a check is worth adding at all. A side panel like this lives or dies on not being noisy, and every new check is a tradeoff between "more thorough" and "more overwhelming." That's a product call, and the agent has no stake in it either way.

With TableSnap, the constraint that matters most is that table data never leaves the browser — it's local-first by design. That's exactly the kind of rule an agent will respect if you say it out loud, and might quietly step around if you don't, the moment "add an export option" starts to sound like it'd be easier with some convenient API call. I don't read that as the agent being careless. It's a reminder that a constraint that only lives in my head doesn't exist as far as it's concerned.

The one rule that covers most of this

If I had to boil it down: give the agent a well-defined problem with a short feedback loop, and hold on to the decisions that are expensive to get wrong.

"Well-defined" means it isn't guessing at your conventions or priorities, because if it has to guess, it will — confidently, plausibly, and sometimes wrong in ways that are hard to spot until later. "Short feedback loop" means it can check its own work against the types, the tests, the running app, before you're the one finding the bug. And "expensive to get wrong" is a short list — architecture, data integrity, security boundaries, scope. Almost everything else is fair game.

That list isn't fixed, either. A year ago mine was longer. It'll probably be shorter again next year. But right now, on real projects with real users and real data, that's roughly where I draw the line — and it's why I can lean on these tools heavily without it feeling like a gamble.

Conclusion

Whether an AI agent can write code isn't really the interesting question anymore. It can, and it's gotten good at it fast.

What's more interesting is what changes about the job once that's true. For me, the answer has been: less than the hype would have you believe, and more than I expected when I started.

The work shifts from typing out the implementation to specifying the problem precisely, checking the result honestly, and owning the decisions that are hard to undo. That was always the harder half of the job. It's just that now it's nearly all of it — and if I'm honest, it's the half I find more interesting anyway.

Share this post

Send it to someone who might find it useful.