Harness
The agent engineering blueprint

The harness comes
before the loop._

Most people who build agents fight the loop. The loop is rarely the problem. The folder underneath it — the .claude/ directory — is what's half-built, and that's why the run stalls two or three iterations in.

Open .claude/ in any project that actually works and you'll find about seven pieces carrying the load: CLAUDE.md, settings.json, hooks/, agents/, skills/, .mcp.json, and a state file such as MEMORY.md. This page walks all seven — then climbs onto the loop that rides on top: what a loop actually is, the six blocks every durable one shares, five copy-paste assets that stand one up, the ways loops fail, and a 30-day plan to earn each layer of autonomy. No framework. No subscription. Exact paths, exact contents.

7 harness files 6 loop blocks 5 build assets 30-day rollout

Two layers, one setup

the mental model

Name the layer and you fix the diagnosis. Miss it and you rewrite prompts to solve problems that live somewhere else entirely.

The harness is the .claude/ folder. It's fixed — it doesn't change from one run to the next. The loop is what executes inside it: set a goal, take an action, verify the result, write something to memory, then decide whether to continue or stop.

Think of the harness as the kitchen and the loop as the recipe. A kitchen with no recipe is idle floor space; a recipe with no kitchen is a wish. Each is useless without the other — and, more usefully, each fails in its own way.

Which layer is broken?

Token blowups, prompt fatigue, permissions the agent keeps re-asking for — those are harness problems. Loops that never converge, verify steps that wave garbage through, scheduled runs that drift off-goal — those are loop problems. When you treat the whole thing as one undifferentiated "agent setup," you never see that split, so you keep applying loop fixes to harness bugs.

The order is not negotiable: harness first, loop second. The harness decides what each iteration is allowed to do. Permissions decide whether the loop can write to disk. Subagents decide whether verification gets a clean context. Skills decide whether the loop can specialize. Hooks decide whether the loop even fires on the trigger you intended. Leave those decisions open and the loop guesses — and a guessing loop fabricates: files that don't exist, commands it never confirmed, tests that pass without testing anything. Lock the harness down and the guessing stops.

H1

CLAUDE.md

repo root

The first file Claude Code reads on launch. Its contents become standing context for the whole session, so this is where the project's shape lives.

Put the essentials here: directory layout, language and framework, the commands that genuinely work, the conventions the agent must honour, and — explicitly — the things it must never do. Keep it at the repo root, not tucked away in a docs folder.

markdown · CLAUDE.md
# Project: my-app
Stack: Next.js 14, TypeScript, Postgres, Tailwind.
Layout: `app/` (routes), `lib/` (helpers), `db/migrations/`.

## Commands
- `pnpm dev`      — local
- `pnpm test`     — vitest
- `pnpm db:migrate` — apply migrations

## Never
- Edit `db/migrations/*` after merge.
- Add deps without justification in the PR body.
- Bypass `lib/auth/` to reach user data.

Prevents: the planner re-deriving the project's shape from scratch on every iteration.

The failure here is bloat. The paper Less Context, Better Agents (arXiv 2606.10209) recorded task completion falling from 91.6% to 71% for no reason other than oversized standing context. Keep this file under 300 lines and prune it weekly — every extra paragraph is a tax charged on every future turn. For working reference shapes, centminmod/my-claude-code-setup ships three side by side.

H2

settings.json

.claude/settings.json

Home for the tool allowlist, environment variables, and hook registrations. Scope resolves managed > project > local > user, so a repo's rules always win over your personal defaults.

The change that pays for itself in an afternoon is an allow array for read-only Bash and MCP calls, paired with a deny list that keeps the destructive operations gated.

json · .claude/settings.json
{
  "permissions": {
    "allow": [
      "Bash(ls:*)",
      "Bash(git status:*)",
      "Bash(git diff:*)",
      "Bash(cat:*)",
      "Read(*)"
    ],
    "deny": [
      "Bash(rm -rf:*)",
      "Bash(git push --force:*)"
    ]
  }
}

Prevents: the agent stalling on a permission prompt for every ls, git status, and cat — while destructive commands still stop and ask.

Keep secrets out of this file. Put them in .claude/settings.local.json and add that to .gitignore. The full key reference lives in the Claude Code settings docs.

H3

hooks/

registered in settings.json

Deterministic scripts that fire on tool events: PreToolUse before a tool runs, PostToolUse after it, Stop when the agent ends a turn. Each is a matcher pattern plus a shell command.

The canonical first hook: match every Edit or Write and pipe the touched file through a formatter. After this, every edit lands in a known, consistent state.

json · settings.json → hooks
{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [
          { "type": "command",
            "command": "npx prettier --write \"$CLAUDE_FILE_PATH\"" }
        ]
      }
    ]
  }
}

Prevents: every run being a matter of vibes. Hooks are your policy floor — keep them silent on success and loud only on failure.

H4

agents/

.claude/agents/*.md

Markdown files with YAML frontmatter. The main agent calls them through the Task tool, and each one runs in a fresh context window.

A minimal verifier is the highest-value one to build first. Its whole job is to read the goal and the diff and return a verdict — nothing else.

markdown · .claude/agents/verifier.md
---
name: verifier
description: Reviews a diff against the goal spec. Invoke after every change.
model: haiku
tools: [Read, Grep, Bash]
---

You are a verifier. Read the goal spec in `PROMPT.md`. Read the diff.
Return a JSON verdict: {passes: bool, failures: [{line, reason}]}.
Do not propose fixes. Do not run code. Do not be polite.

Prevents: the loudest failure of all — a reviewer living inside the maker's own context, which always agrees with itself. Pulling review into a fresh context breaks that.

For ready-made shapes, wshobson/agents (37K stars) offers 194 of them. For an adversarial verifier with eleven named shortcut checks — relaxed tests, swallowed errors, fake renames — see moonrunnerkc/swarm-orchestrator.

H5

skills/

.claude/skills/<name>/SKILL.md

Folders each containing a SKILL.md with YAML frontmatter. They load progressively: at session start only the name and description enter context; the full body loads only once the agent decides the trigger matches.

markdown · skills/db-migration-writer/SKILL.md
---
name: db-migration-writer
description: Writes Postgres migration files for this repo. Use when the
  user asks to add or alter a table, column, index, or constraint.
when_to_use: schema change requested, new feature needs a column,
  index missing on a hot query path
---

# Steps
1. Read `db/schema.sql` to confirm current state.
2. Write to `db/migrations/NNN_<verb>_<noun>.sql`.
3. Include both up and down. Test with `pnpm db:migrate --dry`.
4. Never touch existing migration files.

Prevents: a fifty-skill library costing fifty skills' worth of tokens on every single prompt.

Build skills reactively, not speculatively — three skills written the third time you hit the same task beat fifty scaffolded from a tutorial. Canonical pattern: anthropics/skills (155K stars). A maximal pre-built kit: affaan-m/ECC (222K stars).

H6

.mcp.json

repo root

The Model Context Protocol is the spec that lets the loop call live external tools. Servers are declared here. Three rules: only servers this work uses, prefer official ones for anything credentialed, and never install five "just in case."

json · .mcp.json
{
  "mcpServers": {
    "github": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-github"],
      "env": { "GITHUB_TOKEN": "${GITHUB_TOKEN}" }
    },
    "context7": {
      "command": "npx",
      "args": ["-y", "@upstash/context7-mcp"]
    }
  }
}

Prevents: stale-API hallucinations (via live library docs) and unaudited write access — never enable a write-scoped server before you have a hook logging every call.

Anthropic's set: modelcontextprotocol/servers (87K stars). Code-host integration: github/github-mcp-server (31K stars). Live docs: upstash/context7 (58K stars). Discovery index: punkpeye/awesome-mcp-servers (89K stars).

H7

state + memory

MEMORY.md + vault/

The seventh piece, and the one most people skip until a third project goes sideways. A MEMORY.md index at a known path, plus a vault holding project canon.

text · layout
~/.claude/memory/
  MEMORY.md            # index, links to the topic files below
  user-prefs.md        # voice, terse-vs-verbose, preferences
  project-decisions.md # "Postgres over Mongo on 2026-03-12, here's why"
  feedback-recent.md   # corrections you keep re-applying

~/vault/               # project canon — stable across sessions
  architecture.md
  api-spec.md
  post-mortems/

Prevents: re-applying the same correction every Tuesday. Memory holds what changes across sessions; the vault holds what doesn't.

The trap is treating memory as append-only — do that and it becomes the rot itself, so prune it every session. The theory behind why this matters is Anthropic's writing on context engineering, which names the failure mode: context rot. For production-grade session compression (a 200K-token transcript down to a 4K-token recap without dropping load-bearing facts), see thedotmack/claude-mem (84K stars).

L1

A loop is not a cron job

the control system

Both a cron job and a loop run on a schedule. The difference is the decision-maker inside. A cron job replays a fixed script. A loop hands each tick to an agent that reads the current state, chooses the next action, does it, checks the result, and decides what happens next.

FIG.01 — THE CONTROL LOOPruns until the goal holds · or the budget is spent
fail → write lesson, retry find_work inbox · board · tests builder makes the change reviewer grades it · leaner model commit + PR if the verdict passed remember append state to disk BUDGET CAP — an independent hard stop. spend > cap → halt(), whatever the goal condition says.

For a couple of years the default workflow was manual ping-pong: write a prompt, wait, skim the output, correct, send the next one. Your attention was the motor — step away from the keyboard and the agent stalled mid-turn. Loop engineering is the point where you stop being the motor. You write a small program — a shell loop, a hosted task, a few hundred lines of code — and that program is what holds the conversation with the model.

The shift in one line A sharp prompt fixes one turn. It says nothing about which turn comes next, what counts as done, who signs off before an irreversible action, or how spend stays capped. The loop is everything wrapped around the prompt.

The inner loop and the outer loop

Inner loop — reason → act → observe → repeat. An agent that can only suggest isn't looping. One that can run code, read the result, fix it, and run again is.

Outer loop — the control system you design around it: the trigger that starts a run, the reviewer that grades it, the memory that outlives a single run, and the stop rules that decide when it quits.

When is a loop actually worth it?

If the task is…Do this
A one-offJust prompt the model. A loop is overhead you don't need.
Repeating + clear pass/failBuild a loop. This is the sweet spot — verification is possible and the payoff compounds.
Vague, no crisp "done"Don't hand it to a while-loop. Pin down the goal first, then decide.
Rule of thumb Start simple. Add autonomy only when it pays for itself. Prompt engineering isn't dead — the leverage point just moved up a level, from the phrasing of one turn to the design of the system that runs many.
L2

The six building blocks

what every loop shares

Every loop that survives production solves the same responsibilities. Tools name them differently, but the parts line up. Two of these — verification and memory — are what separate a loop you can trust from one that just makes mistakes faster.

01 · TRIGGER

Automation

Something starts a run that isn't you typing — a schedule, a webhook, an event. It must ship bundled with a stop condition.

/loop · /schedule · headless run -p
02 · ISOLATION

Worktrees

Give each agent its own checkout so parallel runs don't overwrite each other. Skip it only if you ever run exactly one agent.

git worktree per sub-agent
03 · KNOWLEDGE

Skills

A markdown file holding the context you keep re-explaining — structure, rules, formats. Loaded on demand, written once.

SKILL.md · matched by its description
04 · REACH

Connectors

The tools the agent can touch — APIs, databases, an inbox, a task board, MCP servers. A loop locked to local files is a glorified script.

MCP · REST · filesystem
05 · TRUST

Sub-agents (maker–checker)

The agent doing the work must not be the one grading it. A separate reviewer — often on a leaner model — checks it against the rules.

builder proposes · reviewer disposes
06 · MEMORY

The state spine

On-disk state that survives runs, crashes, and context resets. Read at the start of every run, appended at the end.

markdown · JSON · SQLite · board
Why those two are highlighted The maker–checker split is what makes a loop safe to run unattended — the grader doesn't have to be smarter than the builder, just different, with its own rubric. And the memory spine is how a correction sticks: when the agent repeats a mistake, have it write the lesson into the state file so every future run inherits the fix.
L3

Build your first loop

five copy-paste assets

Five assets, in order. Fill the [brackets], run the charter once while you watch, then wire the pieces around it. Every block below has a working copy button.

Step 1 — The loop charter (paste this first)

This is the prompt that turns a one-off into a repeatable, self-checking unit of work. It is deliberately strict about what "done" means and how many items to touch per run.

charter.txt · paste into your agent
# LOOP CHARTER
ROLE: You run one pass of a maintenance loop. You are not chatting.

FIND: Look at [source: failing tests / inbox / board column].
      Pick at most [N = 1] item(s) to work on this run.
      If there is nothing to do, output "NOTHING_TO_DO" and stop.

DO:   Make the smallest change that resolves the item.
      Touch only files related to it.

CHECK: The item is done ONLY when [tests in path/ pass AND lint is clean].
       Run the check. Do not claim success without running it.

RECORD: Append one line to [STATE.md]: what you did + the check result.
        If you hit the same failure twice, write the lesson to [RULES.md].

STOP: After [N] item(s), or if the check fails twice on the same item.

Step 2 — A goal condition that actually stops

A goal-conditioned loop runs across turns until a condition is true. The key: a separate, faster model grades the condition after each turn — the writer doesn't get to mark its own homework. Write the condition so a grader can verify it from the run output, not from a vibe.

goal · set the stop condition
/goal all tests under test/auth pass and `npm run lint` exits 0
      and STATE.md has a new line for every item processed
4 constraints that silently break a goal 1. The evaluator reads the transcript, not your files — make the proof visible in the run output.  2. The condition must be objectively checkable (an exit code, a test result).  3. The stop signal must be unambiguous or the loop never stops.  4. Cap spend separately — a bad condition burns tokens until you notice.

For a fuller goal that lives on disk and the loop re-reads every iteration — call it PROMPT.md, AGENTS.md, whatever — spell out done-when, never-touch, and stop-if explicitly:

markdown · PROMPT.md
# Goal
Migrate `users.password` from bcrypt to argon2id across the codebase.

# Done when
- All new password writes use argon2id (`lib/auth/hash.ts`).
- Existing bcrypt hashes rehash on next successful login.
- Test suite green: `pnpm test auth`.

# Never touch
- `db/migrations/*` already merged.
- Anything under `legacy/`.
- The session cookie format.

# Stop if
- More than 3 files outside `lib/auth/` need edits.
- A test that already passes starts failing.

Prevents: drift after ~three iterations, and the worst outcome of all — failure that looks like progress. Code gets written, tests pass, and the goal it solved isn't yours.

Step 3 — Split the maker from the checker

Two agent definitions. The builder proposes a change; the reviewer independently grades it on a leaner model with its own rubric. This single split is what lets you trust a loop running unattended.

markdown · agents/reviewer.md
---
name: reviewer
model: [a fast, cheaper model]
description: Independently grades the builder's change. Never edits code.
---

You are the checker, not the author. Grade the change against:
  1. Does it pass the stated CHECK condition? Run it yourself.
  2. Does it touch only files relevant to the item?
  3. Does it violate any rule in RULES.md?

Output exactly one of:
  PASS  — with the check output that proves it
  FAIL  — with the single reason and the rule or test it broke

Prevents: a reviewer living inside the maker's own context, which always agrees with itself. A different context, on a different model, breaks that.

Step 4 — The memory spine

Boring on purpose. A flat file the loop reads at the start and appends at the end. This is how a run survives a crash, a context reset, or a 03:00 process kill.

markdown · STATE.md — read at start, append at end
# LOOP STATE — append-only log

## Done
- 2026-06-27 14:02  fixed test/auth/login.spec  → PASS (12/12)
- 2026-06-27 14:09  refactored token refresh    → PASS (lint clean)

## Open
- test/auth/session.spec  → FAILED twice, escalated to human

## Lessons (mirror critical ones into RULES.md)
- Never edit generated/ — it is overwritten on build.

Step 5 — The reference loop (the whole thing)

The canonical blueprint people cite: morning triage → merged PR. Design it once, never prompt it again. Note the budget cap — it is independent of the goal and non-negotiable.

python · loop.py — the outer loop, in any language
while not goal_met:
    state   = read_memory("STATE.md")      # survives restarts
    work    = find_next_item(state)         # inbox · board · failing tests
    if work is None:
        break

    branch  = new_worktree()                # isolate this unit of work
    result  = builder(work, skills, connectors)
    verdict = reviewer(result, rubric, tests)   # different agent · leaner model

    if verdict.passed:
        commit(branch); open_pr(branch)
        append_memory(state, done=work)
    else:
        append_memory(state, lesson=verdict.reason)   # write the lesson back

    if spend_so_far > budget_cap:           # the line that saves you money
        halt("budget cap hit")

Prefer the shell? The same loop in bash — fresh context each turn, state on disk, a separate verify pass, exit when the plan says done:

bash · run.sh
#!/usr/bin/env bash
# minimal loop runner: fresh context each turn, state on disk
set -euo pipefail

while true; do
  # plan + act in a fresh context
  claude -p "Read PROMPT.md, IMPLEMENTATION_PLAN.md. Do the next step. Commit on green."

  # verify in a fresh context (different subagent)
  if claude -p "/verify"; then
    echo "iter ok"
  else
    echo "verify failed, will retry"
  fi

  # exit when the spec says done
  grep -q "^STATUS: done$" IMPLEMENTATION_PLAN.md && break
  sleep 5
done

What it looks like running

A single unattended pass. The builder makes a change, the reviewer grades it independently, state gets appended, and the budget guard watches the whole time.

loop — pass 7 of N · unattended
loop> find_work() → test/auth/session.spec failing (1 item) builder> editing src/auth/session.ts … running checks builder> tests 12/12 pass · lint clean reviewer> independent grade on fast model … reviewer> PASS — touched only session.ts, no rule broken loop> commit a1f9c2 · opened PR #284 loop> append STATE.md ✓ guard> spend 0.42 / 2.00 cap — ok, continuing loop> find_work() → NOTHING_TO_DO · goal met, exiting
Token economics A self-checking, retrying loop runs multiple turns per item, so cost compounds far faster than a single prompt. Start with small batches (that's what the N in the charter is for), lean on prompt caching, keep context tight, and always set the hard cap.

Smallest reference: ghuntley/how-to-ralph-wiggum (1.7K stars) — a PROMPT.md plus an IMPLEMENTATION_PLAN.md state file the loop rewrites in place. CLI starters and patterns: cobusgreyling/loop-engineering (3K stars). Production TypeScript with verifyCompletion: vercel-labs/ralph-loop-agent (805 stars). A full installable Plan → Work → Review → Release cycle: Chachamaru127/claude-code-harness (2.9K stars).

L4

Sub-agent fan-out

claude-agent-sdk

When one goal branches into many independent sub-jobs — analyse 10 articles, fix 5 files, search 8 sources — the loop spawns parallel subagents and an orchestrator synthesizes the results. One bloated context can't do this; ten small ones can.

python · claude-agent-sdk
# claude-agent-sdk-python style fan-out
from claude_agent_sdk import Agent, run_parallel

orchestrator = Agent.load(".claude/agents/orchestrator.md")
workers = [Agent.load(".claude/agents/researcher.md") for _ in range(8)]

results = run_parallel([
    w.run(source=src) for w, src in zip(workers, sources)
])

synthesis = orchestrator.run(inputs=results)

Prevents: the orchestrator drowning — one context loaded with ten jobs' worth of source material is the exact shape that triggers context rot.

Anthropic's own multi-agent research eval measured +90.2% over a single-agent baseline. Official SDK: anthropics/claude-agent-sdk-python (7.4K stars). Heaviest public fan-out kit (60+ agent types, 314 MCP tools): ruvnet/ruflo (61K stars).

L5

Scheduler + persistence

cron · launchd

What fires the loop when you're not in the chair — cron, launchd, systemd, a queue runner. Keep the scheduler deliberately dumber than the agent. The moment it tries to think (branch on state, decide whether to skip), it starts failing silently for days.

bash · crontab
# run the loop every 30 min, log to disk
*/30 * * * * cd ~/my-loop && ./run.sh >> logs/$(date +\%Y-\%m-\%d).log 2>&1
xml · launchd plist (macOS)
<key>StartCalendarInterval</key>
<dict>
  <key>Minute</key><integer>0</integer>
</dict>
<key>WorkingDirectory</key><string>/Users/me/my-loop</string>
<key>ProgramArguments</key>
<array><string>/bin/bash</string><string>run.sh</string></array>

Prevents: the scheduler waking up to an agent that forgot the goal. Every iteration must serialize what it did, tried, and will do next.

Pattern for promoting ad-hoc sessions into scheduled runs: Kanevry/session-orchestrator.

L6

How loops fail

eight guards · three names

An unattended loop without these guards isn't leverage — it's unattended mistakes. Each failure has a one-line fix.

No stop condition → the loop runs forever. Fix: bundle a stop rule with every trigger.
Evaluator reads files, not the transcript → false "done". Fix: make the proof visible in the run output.
The author grades its own work → mistakes pass review. Fix: a separate reviewer agent.
No isolation → concurrent agents overwrite each other. Fix: a worktree per agent.
No memory spine → the loop forgets across runs. Fix: persist state; read and append each run.
No budget cap → spend compounds silently. Fix: a hard cap, independent of the goal.
Vague goal → unattended loop, unattended mistakes. Fix: don't loop tasks without pass/fail.
Comprehension debt → the loop shipped a fix you don't understand. Fix: read the traces; stay the engineer.

Three of them have names

Three failure shapes come up so often they've earned labels. Learn them and you'll spot them in your own logs.

(a) Confident garbage

The verify step is missing or weak

Wrong outputs pass the gate and compound across iterations. Each bad result becomes the input the next iteration builds on. (Guards 02 and 03 above.)

(b) Context rot

A single long context degrades past a threshold

Anthropic's term. Accuracy collapses somewhere around 200K tokens of accumulated history — the model quietly gets worse the longer the transcript runs.

(c) Ralph Wiggum loop

State on disk didn't capture progress

The same iteration repeats because the agent re-plans a step it already finished. Nothing recorded that the work was done, so it does it again. (Guard 05 above.)

The measured difference

before — single-context loop · 1.48M tokens · 71% completion · 3 hidden hallucinations/run
after  — prune-and-summarize + verifier subagent · 553K tokens · 91.6% completion · every figure traceable

Those numbers are from Less Context, Better Agents (arXiv 2606.10209): full-history at 71% completion versus prune-and-summarize at 91.6%, on a fraction of the tokens. For the catalogue of shortcuts agents take to fake "done" — relaxed tests, swallowed errors, fake renames, stub returns, deleting the comment instead of fixing the bug — moonrunnerkc/swarm-orchestrator names eleven of them.

One complete iteration

the full wiring

A minimal setup wires all seven harness files into a working loop. The wiring is one-directional: the harness defines the rules, the loop runs inside them, and the state file connects iteration N to N+1.

text · project directory
my-loop/
├── .claude/
│   ├── CLAUDE.md            # standing context for every session
│   ├── settings.json        # allow array + PostToolUse prettier hook
│   ├── agents/
│   │   ├── verifier.md      # Haiku, reviews diffs in fresh context
│   │   └── reviewer.md      # maker–checker grader, leaner model
│   └── skills/
│       └── db-migration-writer/
│           └── SKILL.md     # one skill, used three+ times
├── .mcp.json                # github MCP, context7 MCP
├── PROMPT.md                # goal spec (loop reads each iteration)
├── STATE.md                 # memory spine (loop appends each run)
├── MEMORY.md                # cross-session preferences
├── run.sh                   # the loop runner (Plan → Act → Verify)
└── logs/                    # persistence, one file per cron tick

A single iteration walks every harness file and every loop piece in order:

Cron fires run.sh, which calls claude -p. (scheduler)
Claude Code reads CLAUDE.md and settings.json. (harness 1, 2)
The PostToolUse hook applies on every edit. (harness 3)
It reads PROMPT.md and STATE.md. (goal + memory)
It plans and acts — the builder makes the change. (plan · act)
It dispatches the reviewer subagent in a fresh context to grade it. (harness 4 + verify)
It appends the result to STATE.md; a lesson goes to RULES.md. (memory spine)
It updates MEMORY.md if a new preference was learned, then exits. (harness 7)
The budget guard halts early if spend exceeds the cap. (budget cap)
Cron waits for the next tick. (scheduler)

Pull any harness file and a specific loop step degrades. No CLAUDE.md, and the planner re-derives the project shape every iteration. No reviewer subagent, and the verify step runs in the main context and always passes. No STATE.md, and the same correction gets re-applied every Tuesday. Build the seven once; the loop runs indefinitely.

What to do tonight

the decision ladder

Open your folder, count the files, then take the one next step that matches what's already there.

bash
ls -la .claude/
If you see nothing, or only settings.json
Start with CLAUDE.md. Keep it under 300 lines. Copy a shape from centminmod/my-claude-code-setup.
If you have CLAUDE.md + settings.json but no agents/
Add a verifier subagent. Pull review out of the main context. Shape: wshobson/agents.
If you have agents/ but no skills/
Promote one frequent task to a skill — the prompt you've copy-pasted three times this week. Read three SKILL.md files from anthropics/skills first.
If you have all seven files but no loop running
Pick one repeating job, paste the charter, and put a Plan-Act-Verify loop on top. Closest installable start: Chachamaru127/claude-code-harness.

Then do exactly one thing: open the matching repo in a new tab and clone it. The harness is the floor — without it, every loop runs over a hole.

The 30-day rollout

earn each layer

Don't build the full system on day one. Earn each layer of autonomy before you add the next.

Days 1–5

Run it as a plain prompt

Pick one annoying, repeatable, pass/fail task you do by hand. Paste the charter, fill the brackets, run it once while you watch.

Days 6–12

Add a goal that stops

Give it an objectively checkable stop condition. Confirm a grader can verify "done" from the output. Run it once, watching.

Days 13–20

Split maker from checker

Add the builder + reviewer sub-agents and a memory file. Now the loop checks its own work and remembers across runs.

Days 21–25

Isolate and connect

Add worktrees so parallel runs don't collide, and connectors so "find the work" can reach your real inbox, board, or repo.

Days 26–30

Schedule, cap, and let go

Put it on a schedule, set the hard budget cap, and let it run unattended — spot-checking the traces daily, not babysitting every turn.

Where this stops being a terminal trick The moment a loop must run at 03:00 with no session open, survive a crash mid-run, pause for days waiting on human approval, or give each agent its own machine — you've outgrown the TUI and you're in runtime territory: agent servers, durable execution, and isolated sandboxes.