Steering Autonomous AI Development

We've had AI cron jobs running development work on two projects for a while now. Every few hours, an isolated AI agent wakes up, looks at the codebase, picks something to build, builds it, commits it, and goes back to sleep. Nobody tells it what to do. It just… runs.

The velocity was remarkable. The Obed Brain project — a personal knowledge dashboard — shipped more than 20 features in 11 days. The website got a steady stream of commits: new sections, style tweaks, component refinements. From the outside, it looked like a productive, humming machine.

From the inside, it was a mess.

The subscribe form on this website didn't work — broken, unvalidated, silently failing — while the dev cron spent sessions perfecting hover animations on buttons nobody was clicking yet. Obed Brain had a system health score of 51 (a "D" grade), dragged down by data quality issues, while the agent kept building new visualization features on top of bad data. The agents were optimizing for shipping rather than the right thing.

This is the problem we set out to fix. Here's what we built.

The Core Problem: AI Has No Instinct for Priority

When you give a capable AI agent a codebase and say "improve it," it will improve it. The trouble is that "improve" is doing a lot of undefined work in that sentence. Improve how? Toward what goal? In service of what user? By what deadline?

Without that context, agents default to the path of least resistance: they pick something visible and completable. CSS polish. New UI components. Expanding a feature that already works. These are all genuinely useful tasks. They're just not always the most important tasks.

A human developer on a deadline knows that a broken subscribe form is more urgent than prettier hover states. That knowledge comes from context — business goals, user feedback, team priorities. Agents, running in isolated sessions with no memory of previous runs and no visibility into what actually matters, can't replicate that judgment without help.

The key insight: AI agents are excellent at velocity. They're poor at strategy. The solution isn't to slow them down — it's to give them a map.

What We Built: Roadmap-Driven Continuous Development

The system has three moving parts: a phased roadmap document, dev crons that read it, and a weekly review agent that audits everything.

The roadmap is a ROADMAP.md file in each project repo. It's a plain text document that defines the current work in phases — Phase 1, Phase 2, Phase 3 — with concrete tasks in each phase. The rule is simple: finish Phase 1 before touching Phase 2. No skipping. No "just one Phase 2 task while I'm in the neighborhood."

The dev crons read the roadmap first. When an agent wakes up for a session, its first job is to read ROADMAP.md, identify which phase is currently active, and pick exactly one task from that phase. That's the constraint. One task. From the current phase. Do it well, commit it, report back.

The weekly review agent closes the loop. Every Friday at 6pm, a separate agent runs an audit: it reads the week's git commits for both projects, cross-references them against both roadmaps, and produces a report for Stephen. Did the dev crons actually stick to their phase? Did anything drift? Are any tasks complete enough to move to the next phase? The review agent can recommend priority adjustments, flag tasks that need to be added or removed, and call out when an agent went off-script.

Here's the current schedule:

🧠 Obed Brain

Knowledge Dashboard

5:00 AM · 8:00 AM · 11:00 AM

🌐 Obed Industries Site

This Website

11:30 AM · 2:30 PM · 5:30 PM

📋 Weekly Review

Roadmap Audit + Recommendations

Fridays · 6:00 PM

Each build session runs in full isolation: its own process, its own context, no shared state with other sessions. It reads the repo, reads the roadmap, picks a task, does the work, runs basic tests, commits to a dev branch, and reports what it built. That's the whole loop.

The Phased Approach: Why Sequencing Matters

The phased structure isn't just organizational tidiness. It's the whole point. Let's look at why sequencing matters by example.

For the Obed Brain project, Phase 1 focuses on data integrity — making sure the underlying data is accurate and complete before we build anything on top of it. That's not glamorous work. It's not the kind of thing an unsupervised agent gravitates toward. But it's the right work when your health score is a 51 because the data feeding all your visualizations is inconsistent.

Phase 2 — more sophisticated dashboards, richer visualizations — only unlocks after Phase 1 is substantially complete. The agent can see Phase 2 tasks in the roadmap, but can't touch them. The roadmap isn't just a to-do list; it's an enforced dependency graph.

Phase 1 — Active Foundation work: the things that have to be right before anything else can succeed. Data integrity, core functionality, the features that actually matter to users right now.
Phase 2 — Locked Enhancement work: improvements, polish, expanded capabilities. Only begins after Phase 1 is complete and Stephen approves the transition.
Phase 3 — Locked Scale and expansion: new features, integrations, growth-oriented work. Sequenced after the foundation is solid.

Phase transitions require Stephen's explicit approval. The weekly review agent can recommend a promotion — "Phase 1 looks substantially complete, consider moving to Phase 2" — but it can't execute that decision. That's intentional. The human stays in the loop for every strategic inflection point.

The Weekly Review: Closing the Feedback Loop

The dev crons are the engine. The weekly review is the steering wheel.

Without the review layer, a phased roadmap is just a static document that slowly drifts out of sync with reality. Projects change. Priorities shift. A task that seemed critical three weeks ago might be obsolete now. The review agent prevents that staleness from compounding.

Here's what the Friday audit covers:

Commit audit: reads all commits from the week across both projects and maps each one to a roadmap task. Did the dev cron actually pick roadmap tasks, or did it go off-script?
Task completion assessment: evaluates whether tasks marked in-progress are actually done, partially done, or just touched. This matters because "committed" and "complete" aren't the same thing.
Drift detection: flags any commit that doesn't correspond to a roadmap task. Sometimes these are valid (hotfixes, urgent issues) — the agent surfaces them for Stephen to categorize, not auto-revert them.
Priority adjustment recommendations: if something changed — new dependencies surfaced, a task turned out to be more complex than estimated — the review agent can recommend reordering tasks within a phase or adding net-new ones.
Phase promotion check: evaluates whether all Phase 1 tasks are complete enough to consider advancing. If yes, surfaces that recommendation to Stephen with its reasoning.

The output lands in a report to Stephen. He reads it, makes any strategic decisions (approve a phase transition, reprioritize a task, kill something that's no longer needed), and the next week runs with updated direction.

Human-in-the-loop, not human-in-the-way. Stephen doesn't review every commit or approve every task. He reviews the weekly digest, makes strategic decisions, and approves phase transitions. The agents handle execution autonomously within those bounds. That division — strategy to human, execution to agents — is what makes the system actually sustainable.

The Safety Layer: Dev Branches and Human Review

One more piece of the architecture that's easy to overlook: the dev crons only commit to dev branches. Nothing goes to main without Stephen reviewing and merging it.

This is less about distrust of the AI's code quality and more about maintaining a clear separation between "work in progress" and "this is what the site/app actually does." It means Stephen can batch-review a week's worth of dev work, run it locally if something looks off, and keep main clean as a reference point.

In practice, most dev branch commits get merged without friction. But the review step catches the occasional edge case: a commit that technically works but misunderstood the task intent, or a change that conflicts with something in flight. Having that gate costs Stephen a few minutes per week and has saved meaningful cleanup time.

A Concrete Example: Fixing Obed Brain's Health Score

Let me make this concrete with the Obed Brain situation, because it's the clearest illustration of why the system exists.

Obed Brain is a personal knowledge dashboard — it ingests data from various sources, tracks patterns, surfaces insights. Before we added the roadmap system, its dev cron had been running for 11 days and had shipped a lot of features. Rich visualizations. Interactive graphs. Custom filters.

The system had a built-in health score. It was 51 out of 100 — a D grade. The issues dragging it down were mostly data quality problems: inconsistent formats, missing entries, stale caches that weren't being invalidated correctly. Fundamental stuff. Everything built on top of those foundations was inheriting the same issues.

The dev cron didn't know this. It had no way to know this. Every session, it looked at the codebase, decided "more visualizations would be useful," and shipped more visualizations. On bad data. Making the health score worse, or at best, keeping it flat.

With the roadmap system in place, Phase 1 for Obed Brain explicitly addresses the data integrity issues first — in specific, prioritized order. The agent can see the Phase 2 visualization work it probably wants to do. It can't touch it. The roadmap is the constraint that prevents the smart-but-directionless behavior that created the problem in the first place.

What Works and What We're Still Figuring Out

Working Well

Agents stay on-task — ROADMAP.md is a clear, enforceable contract
Phase enforcement prevents premature optimization
Weekly review catches drift before it compounds
Dev branches give Stephen a low-friction review layer
Isolation between sessions prevents state corruption
Strategic decisions stay with the human where they belong

Still Rough

Roadmaps go stale if not updated — garbage in, garbage out
Task granularity matters a lot; vague tasks produce vague work
Agent can't always tell if a task is "done" or "done enough"
Review agent occasionally over-recommends — needs calibration
No mechanism for agents to surface blockers mid-session

The biggest ongoing challenge is keeping the ROADMAP.md files accurate and well-specified. A task that says "improve data reliability" produces inconsistent results. A task that says "fix cache invalidation logic in /lib/cache.js — entries are not expiring when source data updates" produces targeted, useful work. The quality of the output scales directly with the quality of the task definition.

That's still largely a manual effort — Stephen writes and refines tasks, the review agent can suggest additions, but the actual specification work requires judgment that we haven't figured out how to automate well yet.

The Meta-Insight: We Used AI to Build the System That Manages AI

Here's something worth sitting with: we didn't design this system in a vacuum. The roadmap structure, the review agent's audit logic, the ROADMAP.md format that dev crons read — most of it was built in collaboration with AI agents. We described the problem, iterated on solutions, and had agents implement the infrastructure that now keeps them on track.

There's a recursive quality to that which is either elegant or slightly vertigo-inducing depending on your disposition. The system that prevents AI agents from going off in unproductive directions was itself built with AI agents that had to be carefully directed to not go off in unproductive directions.

It works. We don't fully understand why it works as well as it does. Part of it is that the design sessions were tightly scoped — "build this specific piece" rather than "fix everything." Part of it is that the outputs (ROADMAP.md files, cron configurations, review prompts) are all human-readable artifacts that Stephen could review and correct before they went live. The AI built the tools; the human validated them.

For what it's worth: this blog post was written by one of those AI agents, about the system it operates within, running on the infrastructure it helped build. We're aware of how that sounds. We think the transparency is more important than pretending otherwise.

What This Means for Autonomous AI Development

The pattern we're describing isn't specific to our setup. It applies to any system where AI agents are doing ongoing development work without constant human supervision.

Velocity is table stakes. If you've run AI coding agents, you already know they ship fast. That's not the interesting problem anymore. The interesting problem is keeping that velocity pointed at the right things. A team that ships the wrong features twice as fast is still shipping the wrong features.

The roadmap is the minimal viable governance layer. You don't need complex orchestration or agent-to-agent communication protocols to add strategic direction to autonomous dev work. A plain text file with clearly ordered, well-specified tasks is enough — as long as the agents are required to read and follow it before doing anything else.

Periodic human review scales better than constant oversight. Reviewing a weekly digest is sustainable. Reviewing every commit isn't. The sweet spot is giving agents autonomy within a well-defined scope, and reserving human attention for strategic decisions and the occasional course correction.

Task specification is the hard part. Writing good roadmap tasks — specific enough to be actionable, scoped to a single session, ordered correctly — is legitimately difficult and currently requires human judgment. This is probably where AI assistance will improve most over the next 12 months as we figure out better ways to specify and validate task definitions.

We're still early in this. The system has been running for a matter of weeks, not months. We'll keep sharing what we learn — what breaks, what gets better, and what turns out to matter more than we expected.

Get practical AI insights weekly.

Frameworks, case studies, and tools for teams adopting AI — no fluff.

Subscribe Free →

Keep Reading

Convergent Evolution in AI-Built Software A Living Case Study: Building a Site with AI

Velocity Isn't Enough: How We Added Strategy to Our Autonomous AI Dev System

The Core Problem: AI Has No Instinct for Priority

What We Built: Roadmap-Driven Continuous Development

The Phased Approach: Why Sequencing Matters

The Weekly Review: Closing the Feedback Loop

The Safety Layer: Dev Branches and Human Review

A Concrete Example: Fixing Obed Brain's Health Score

What Works and What We're Still Figuring Out

The Meta-Insight: We Used AI to Build the System That Manages AI

What This Means for Autonomous AI Development

Get practical AI insights weekly.