Loops work when you spend the proper amount of time to understand what you want ahead of time. The prerequisite is clarity — enough clarity that you could write a careful specification that you could hand off to a junior colleague.
Often, it takes 5-6 broken crappy versions of a thing until you understand that. There is no accelerating the 5-6 broken crappy versions - there’s no agent tech that’s going to help your meat brain avoid thinking time.
So most of my time is iterating between these two phases: I don’t understand what I want, I need to read and write and play with code, okay it’s been long enough I think I know what I want (it is extremely easy to deceive yourself) … okay now I do actually know what I want and I can write a loop.
Many people think they can jump ahead with agents. You cannot fake understanding or clarity. It is painfully obviously when someone skipped that meat brain understanding phase.
mccoyb
This ties into something I have been saying for months: LLMs are great at finishing tasks, but bad at aesthetics and taste.
There are two kinds of work: One is goal-driven work, where we have a goal to achieve, and we care very little about how we get there. Security is a perfect example; if you want to exploit a system, you rarely care about how beautiful the exploit is, all you want is access to those super secret nuclear plans. Research is also like this; "research-quality" code was famously terrible, even before the age of AI.
The other kind of work is taste-driven work. People think that, when they're adding a feature to a large codebase, their goal is to add that feature, but that is often not the case. Keeping the codebase amenable to future changes is often far, far more important than this specific feature, and that requires taste. Note that maintainability and code quality aren't synonymous, code quality is just a means to an end, and that end is maintainability.
miki123211
My experience is that I am bottle-necked on specs. The agent loop is less of a thing for me now.
If I can get a clear understanding of what I want to build, communicate that to Claude Code in planning mode with the goal to write an actionable spec (not code, plan to write the spec) then I tend to get very good results once the agent goes to implement.
But this strategy, while effective, puts a big load on me to write the specs. The agent tends to knock each one out of the park (usually 2 to 3 follow ups based on code review) but then I'm back at the stage that requires the spec.
Another issue for me is that when I step away, if the agent finishes a task and could technically start on an existing spec (no overlap on files so no conflict possible) it doesn't know it can just create a new branch and start. Before I go to bed I'll often say "do task X and once done and pushed start on task Y". But I haven't had luck beyond that. Often I find that it starts on Y and has a question and then the agent is idle the rest of the time.
The final issue is dependency coupled with the above. For example, today I was writing a background job processor. Obviously, the jobs that are in subsequent tasks require the system. That happens with some frequency. Even the specs need to be refreshed after the implementation to take any details that were resolved at coding time into account.
But I am just on the cusp of wanting the outer loop. The gate is almost entirely on spec creation and PR review. In places where those gates don't matter, I want the agent to keep chugging away.
As an aside, I strongly believe we need to start using tools that are better for LLMs even if they are worse for us. For example, Rust is annoying because the compiler is so strict. Bad for me, great for LLMs.
stillpointlab
The author is spot on about the paradigm change of software as a lifeform. Living things provide us with genuine interactions and experiences of learning and growing, without forcing us to understand the code - You can learn to work with animals and plants without understanding their genetics at all. I believe this is how our relationship with software must develop, and in order to get there, we'll need to learn to design and develop software in a completely new way. I've been testing this hypothesis in my spare time, hacking together a server-browser system I call Mycelium. It's a bit like OpenClaw, except you can use it to create private local Webs, and print custom 2D Electron browsers to view and work in these webs.
livingsoft
>Yet even with a lot of manual steering, that type of code does not come out of LLMs naturally, and even if the code comes out naturally like that, they will still attempt to handle now impossible errors.
This is something I’ve struggled to fight against in many PR reviews. Especially once already written, convincing someone that their excessive null checking is harmful is an uphill battle. Short of better modeling (and languages that allow for sum types to enable it), I haven’t been able to come up with a universally convincing argument against this kind of “shotgun parsing.”
Maybe it really just isn’t that big of a deal? But when actually reading through and refactoring a codebase I’ve always found it frustrating to manage these unnecessary checks. Sometimes they’re nearly impossible to delete safely once present without first adding some kind of logging or broad investigation.
mmillin
Code is part of a shared and built understanding of an information system.
If these loopers mean we all have to move at this continuous wave of software happening, then we get to the highest levels of logical information system design and its all human judgement and balancing of business requirements to fit a given niche in a company or market. So all the programmers have to become business analysts/market researchers/businessmen...except the specific niches where AI tooling can't really clank well...or the end of the subsidized AI token era makes all this looping too expensive to continue. This feels like expert systems and symbolics lisps machines redux, where we briefly ran into the fact that its not so much the code itself not being able to do stuff, it's that your company's org always gets shipped, so if you can't change your company org, your software only has so much flexibility.
Dataflow diagrams and domain knowledge / domain modeling / ubiquitous languages may become the metalanguage that we start to use and set the standards for quality, functional, and non-functional standards and conventions. We make the "looper clankers" ensure that they fulfill that data / behavior / performance contracts before saying what "done" is, because "done" is no longer just code that compiles, code that builds, code that deploys, or even code that sits in production; it's code that fulfills all of the user requirements, operator requirements, and maintainer requirements. So, the language used may be required to make us all turn into business analysts and software architects more than syntax knowers. The revenge of UML and the return of declarative / logical design / BDD triumphing?
(Typo scan by gemma4-12b but I didn't let it alter my message)
Multicomp
What does any of that mean in practice? it's just rambling about abstract concepts that seem to be designed to hint at a bigger picture, when it's just getting AI to write code for you.
Is this where it's going? Having to mystify our roles so it seems like we're still the thought leaders when actually we're just becoming pseudo-teachers that try and herd our group of AI idiots to the right conclusion for us so we don't have to, without ever giving away that it's just all techno-babble?
weego
> the right fix is not "handle every malformed case." ... [LLMs] will still attempt to handle now impossible errors.
This is the number one code smell from LLMs and I don't know why they are so obsessed with it. In python, it often comes as `hasattr` checks on types that are defined to have that attribute, in a code base that is fully type-checked.
Why do they do that? Is it from pre-training or re-enforcement? If that latter, can the labs please fix this?
boscillator
> My current status is that I have not had much success with this way of working for code I deeply care about
If something is judgement heavy, "code i care deeply about", then i don't really agree with the direction of travel here. Don't try to delegate decisions you care deeply about.
I do like the framing of agent loop vs harness loop, but only delegate stuff that you can accurately specify in advance, that usually means stuff that's repeatable in my case ("hey go see how i did X, do that but for Y"), and that inherently means stuff that's predictable.
For stuff where lack of my judgement as input is just going to cause me to say "no", we're down to collaborating in the "agent loop" as Armin puts it. And that's totally fine. It's fast, but also safe.
Remember before AI coding assistants, sometimes you'd get an engineer join your team who was SUPER productive, your peers would be jealous "oh yeah but you guys only got all that done because you have X on your team!" - they didn't live the curse of having that kind of person around - if you don't have them PERFECTLY aligned, then they run off at break neck speed in the wrong direction.
CraigJPerry
> the code it produces is slop, but that’s more the fault of the model than the harness not being a good judge on if a step in the workflow resulted in a net improvement or completion.
I don’t know. I’ve invested heavily in building internal tools that scaffold code and lint the filled in architecture/code design. That with a ratchet pattern, to allow for new rules that have errors across the existing code base, but to asymptotically fix them, is working pretty well.
Example - all modules have tightly scoped design primitives (I’m using hexagonal architecture for the backend, for example). And all code has BDD tests, which is what I spend much of my time reviewing, since cases written in human sentences is easier than looking at so many files of code.
There is a relentless upkeep to draft rules that respond to the workarounds the agents come up with to adhere to the design I want, but it’s slowly approaching perfect. What has helped here tremendously is I use hooks to llm as a judge the decisions the llms make, and then have them review/raise the questionable ones after a first pass is completed. In general, this is snuffing out the slop effectively.
All to say, someone asked me recently what model I prefer. In this approach, the model doesn’t really matter to me because the code is consistently what I want. I’ll choose a model because it has better mcp speed (codex), or a more thorough scope (Claude code).
Where this IS true is when we’re building a net new pattern. The agents are not great at it. BUT most code can fit into the few patterns I’ve created, and what can’t you lock down a new pattern to enforce over a couple iterations of it. Almost everything, at least in SaaS, follows a template.
comments (10)
Often, it takes 5-6 broken crappy versions of a thing until you understand that. There is no accelerating the 5-6 broken crappy versions - there’s no agent tech that’s going to help your meat brain avoid thinking time.
So most of my time is iterating between these two phases: I don’t understand what I want, I need to read and write and play with code, okay it’s been long enough I think I know what I want (it is extremely easy to deceive yourself) … okay now I do actually know what I want and I can write a loop.
Many people think they can jump ahead with agents. You cannot fake understanding or clarity. It is painfully obviously when someone skipped that meat brain understanding phase.
mccoyb
There are two kinds of work: One is goal-driven work, where we have a goal to achieve, and we care very little about how we get there. Security is a perfect example; if you want to exploit a system, you rarely care about how beautiful the exploit is, all you want is access to those super secret nuclear plans. Research is also like this; "research-quality" code was famously terrible, even before the age of AI.
The other kind of work is taste-driven work. People think that, when they're adding a feature to a large codebase, their goal is to add that feature, but that is often not the case. Keeping the codebase amenable to future changes is often far, far more important than this specific feature, and that requires taste. Note that maintainability and code quality aren't synonymous, code quality is just a means to an end, and that end is maintainability.
miki123211
If I can get a clear understanding of what I want to build, communicate that to Claude Code in planning mode with the goal to write an actionable spec (not code, plan to write the spec) then I tend to get very good results once the agent goes to implement.
But this strategy, while effective, puts a big load on me to write the specs. The agent tends to knock each one out of the park (usually 2 to 3 follow ups based on code review) but then I'm back at the stage that requires the spec.
Another issue for me is that when I step away, if the agent finishes a task and could technically start on an existing spec (no overlap on files so no conflict possible) it doesn't know it can just create a new branch and start. Before I go to bed I'll often say "do task X and once done and pushed start on task Y". But I haven't had luck beyond that. Often I find that it starts on Y and has a question and then the agent is idle the rest of the time.
The final issue is dependency coupled with the above. For example, today I was writing a background job processor. Obviously, the jobs that are in subsequent tasks require the system. That happens with some frequency. Even the specs need to be refreshed after the implementation to take any details that were resolved at coding time into account.
But I am just on the cusp of wanting the outer loop. The gate is almost entirely on spec creation and PR review. In places where those gates don't matter, I want the agent to keep chugging away.
As an aside, I strongly believe we need to start using tools that are better for LLMs even if they are worse for us. For example, Rust is annoying because the compiler is so strict. Bad for me, great for LLMs.
stillpointlab
livingsoft
This is something I’ve struggled to fight against in many PR reviews. Especially once already written, convincing someone that their excessive null checking is harmful is an uphill battle. Short of better modeling (and languages that allow for sum types to enable it), I haven’t been able to come up with a universally convincing argument against this kind of “shotgun parsing.”
Maybe it really just isn’t that big of a deal? But when actually reading through and refactoring a codebase I’ve always found it frustrating to manage these unnecessary checks. Sometimes they’re nearly impossible to delete safely once present without first adding some kind of logging or broad investigation.
mmillin
If these loopers mean we all have to move at this continuous wave of software happening, then we get to the highest levels of logical information system design and its all human judgement and balancing of business requirements to fit a given niche in a company or market. So all the programmers have to become business analysts/market researchers/businessmen...except the specific niches where AI tooling can't really clank well...or the end of the subsidized AI token era makes all this looping too expensive to continue. This feels like expert systems and symbolics lisps machines redux, where we briefly ran into the fact that its not so much the code itself not being able to do stuff, it's that your company's org always gets shipped, so if you can't change your company org, your software only has so much flexibility.
Dataflow diagrams and domain knowledge / domain modeling / ubiquitous languages may become the metalanguage that we start to use and set the standards for quality, functional, and non-functional standards and conventions. We make the "looper clankers" ensure that they fulfill that data / behavior / performance contracts before saying what "done" is, because "done" is no longer just code that compiles, code that builds, code that deploys, or even code that sits in production; it's code that fulfills all of the user requirements, operator requirements, and maintainer requirements. So, the language used may be required to make us all turn into business analysts and software architects more than syntax knowers. The revenge of UML and the return of declarative / logical design / BDD triumphing?
(Typo scan by gemma4-12b but I didn't let it alter my message)
Multicomp
Is this where it's going? Having to mystify our roles so it seems like we're still the thought leaders when actually we're just becoming pseudo-teachers that try and herd our group of AI idiots to the right conclusion for us so we don't have to, without ever giving away that it's just all techno-babble?
weego
This is the number one code smell from LLMs and I don't know why they are so obsessed with it. In python, it often comes as `hasattr` checks on types that are defined to have that attribute, in a code base that is fully type-checked.
Why do they do that? Is it from pre-training or re-enforcement? If that latter, can the labs please fix this?
boscillator
If something is judgement heavy, "code i care deeply about", then i don't really agree with the direction of travel here. Don't try to delegate decisions you care deeply about.
I do like the framing of agent loop vs harness loop, but only delegate stuff that you can accurately specify in advance, that usually means stuff that's repeatable in my case ("hey go see how i did X, do that but for Y"), and that inherently means stuff that's predictable.
For stuff where lack of my judgement as input is just going to cause me to say "no", we're down to collaborating in the "agent loop" as Armin puts it. And that's totally fine. It's fast, but also safe.
Remember before AI coding assistants, sometimes you'd get an engineer join your team who was SUPER productive, your peers would be jealous "oh yeah but you guys only got all that done because you have X on your team!" - they didn't live the curse of having that kind of person around - if you don't have them PERFECTLY aligned, then they run off at break neck speed in the wrong direction.
CraigJPerry
I don’t know. I’ve invested heavily in building internal tools that scaffold code and lint the filled in architecture/code design. That with a ratchet pattern, to allow for new rules that have errors across the existing code base, but to asymptotically fix them, is working pretty well.
Example - all modules have tightly scoped design primitives (I’m using hexagonal architecture for the backend, for example). And all code has BDD tests, which is what I spend much of my time reviewing, since cases written in human sentences is easier than looking at so many files of code.
There is a relentless upkeep to draft rules that respond to the workarounds the agents come up with to adhere to the design I want, but it’s slowly approaching perfect. What has helped here tremendously is I use hooks to llm as a judge the decisions the llms make, and then have them review/raise the questionable ones after a first pass is completed. In general, this is snuffing out the slop effectively.
All to say, someone asked me recently what model I prefer. In this approach, the model doesn’t really matter to me because the code is consistently what I want. I’ll choose a model because it has better mcp speed (codex), or a more thorough scope (Claude code).
Where this IS true is when we’re building a net new pattern. The agents are not great at it. BUT most code can fit into the few patterns I’ve created, and what can’t you lock down a new pattern to enforce over a couple iterations of it. Almost everything, at least in SaaS, follows a template.
dirtbag__dad