The harness comes
before the loop._
Most people who build agents fight the loop. The loop is rarely the problem. The folder underneath it — the .claude/ directory — is what's half-built, and that's why the run stalls two or three iterations in.
Open .claude/ in any project that actually works and you'll find about seven pieces carrying the load: CLAUDE.md, settings.json, hooks/, agents/, skills/, .mcp.json, and a state file such as MEMORY.md. This page walks all seven — then climbs onto the loop that rides on top: what a loop actually is, the six blocks every durable one shares, five copy-paste assets that stand one up, the ways loops fail, and a 30-day plan to earn each layer of autonomy. No framework. No subscription. Exact paths, exact contents.
Two layers, one setup
the mental modelName the layer and you fix the diagnosis. Miss it and you rewrite prompts to solve problems that live somewhere else entirely.
The harness is the .claude/ folder. It's fixed — it doesn't change from one run to the next. The loop is what executes inside it: set a goal, take an action, verify the result, write something to memory, then decide whether to continue or stop.
Think of the harness as the kitchen and the loop as the recipe. A kitchen with no recipe is idle floor space; a recipe with no kitchen is a wish. Each is useless without the other — and, more usefully, each fails in its own way.
Which layer is broken?
Token blowups, prompt fatigue, permissions the agent keeps re-asking for — those are harness problems. Loops that never converge, verify steps that wave garbage through, scheduled runs that drift off-goal — those are loop problems. When you treat the whole thing as one undifferentiated "agent setup," you never see that split, so you keep applying loop fixes to harness bugs.
The order is not negotiable: harness first, loop second. The harness decides what each iteration is allowed to do. Permissions decide whether the loop can write to disk. Subagents decide whether verification gets a clean context. Skills decide whether the loop can specialize. Hooks decide whether the loop even fires on the trigger you intended. Leave those decisions open and the loop guesses — and a guessing loop fabricates: files that don't exist, commands it never confirmed, tests that pass without testing anything. Lock the harness down and the guessing stops.
CLAUDE.md
repo rootThe first file Claude Code reads on launch. Its contents become standing context for the whole session, so this is where the project's shape lives.
Put the essentials here: directory layout, language and framework, the commands that genuinely work, the conventions the agent must honour, and — explicitly — the things it must never do. Keep it at the repo root, not tucked away in a docs folder.
# Project: my-app Stack: Next.js 14, TypeScript, Postgres, Tailwind. Layout: `app/` (routes), `lib/` (helpers), `db/migrations/`. ## Commands - `pnpm dev` — local - `pnpm test` — vitest - `pnpm db:migrate` — apply migrations ## Never - Edit `db/migrations/*` after merge. - Add deps without justification in the PR body. - Bypass `lib/auth/` to reach user data.
Prevents: the planner re-deriving the project's shape from scratch on every iteration.
The failure here is bloat. The paper Less Context, Better Agents (arXiv 2606.10209) recorded task completion falling from 91.6% to 71% for no reason other than oversized standing context. Keep this file under 300 lines and prune it weekly — every extra paragraph is a tax charged on every future turn. For working reference shapes, centminmod/my-claude-code-setup ships three side by side.
settings.json
.claude/settings.jsonHome for the tool allowlist, environment variables, and hook registrations. Scope resolves managed > project > local > user, so a repo's rules always win over your personal defaults.
The change that pays for itself in an afternoon is an allow array for read-only Bash and MCP calls, paired with a deny list that keeps the destructive operations gated.
{
"permissions": {
"allow": [
"Bash(ls:*)",
"Bash(git status:*)",
"Bash(git diff:*)",
"Bash(cat:*)",
"Read(*)"
],
"deny": [
"Bash(rm -rf:*)",
"Bash(git push --force:*)"
]
}
}
Prevents: the agent stalling on a permission prompt for every ls, git status, and cat — while destructive commands still stop and ask.
Keep secrets out of this file. Put them in .claude/settings.local.json and add that to .gitignore. The full key reference lives in the Claude Code settings docs.
hooks/
registered in settings.jsonDeterministic scripts that fire on tool events: PreToolUse before a tool runs, PostToolUse after it, Stop when the agent ends a turn. Each is a matcher pattern plus a shell command.
The canonical first hook: match every Edit or Write and pipe the touched file through a formatter. After this, every edit lands in a known, consistent state.
{
"hooks": {
"PostToolUse": [
{
"matcher": "Edit|Write",
"hooks": [
{ "type": "command",
"command": "npx prettier --write \"$CLAUDE_FILE_PATH\"" }
]
}
]
}
}
Prevents: every run being a matter of vibes. Hooks are your policy floor — keep them silent on success and loud only on failure.
agents/
.claude/agents/*.mdMarkdown files with YAML frontmatter. The main agent calls them through the Task tool, and each one runs in a fresh context window.
A minimal verifier is the highest-value one to build first. Its whole job is to read the goal and the diff and return a verdict — nothing else.
--- name: verifier description: Reviews a diff against the goal spec. Invoke after every change. model: haiku tools: [Read, Grep, Bash] --- You are a verifier. Read the goal spec in `PROMPT.md`. Read the diff. Return a JSON verdict: {passes: bool, failures: [{line, reason}]}. Do not propose fixes. Do not run code. Do not be polite.
Prevents: the loudest failure of all — a reviewer living inside the maker's own context, which always agrees with itself. Pulling review into a fresh context breaks that.
For ready-made shapes, wshobson/agents (37K stars) offers 194 of them. For an adversarial verifier with eleven named shortcut checks — relaxed tests, swallowed errors, fake renames — see moonrunnerkc/swarm-orchestrator.
skills/
.claude/skills/<name>/SKILL.mdFolders each containing a SKILL.md with YAML frontmatter. They load progressively: at session start only the name and description enter context; the full body loads only once the agent decides the trigger matches.
--- name: db-migration-writer description: Writes Postgres migration files for this repo. Use when the user asks to add or alter a table, column, index, or constraint. when_to_use: schema change requested, new feature needs a column, index missing on a hot query path --- # Steps 1. Read `db/schema.sql` to confirm current state. 2. Write to `db/migrations/NNN_<verb>_<noun>.sql`. 3. Include both up and down. Test with `pnpm db:migrate --dry`. 4. Never touch existing migration files.
Prevents: a fifty-skill library costing fifty skills' worth of tokens on every single prompt.
Build skills reactively, not speculatively — three skills written the third time you hit the same task beat fifty scaffolded from a tutorial. Canonical pattern: anthropics/skills (155K stars). A maximal pre-built kit: affaan-m/ECC (222K stars).
.mcp.json
repo rootThe Model Context Protocol is the spec that lets the loop call live external tools. Servers are declared here. Three rules: only servers this work uses, prefer official ones for anything credentialed, and never install five "just in case."
{
"mcpServers": {
"github": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-github"],
"env": { "GITHUB_TOKEN": "${GITHUB_TOKEN}" }
},
"context7": {
"command": "npx",
"args": ["-y", "@upstash/context7-mcp"]
}
}
}
Prevents: stale-API hallucinations (via live library docs) and unaudited write access — never enable a write-scoped server before you have a hook logging every call.
Anthropic's set: modelcontextprotocol/servers (87K stars). Code-host integration: github/github-mcp-server (31K stars). Live docs: upstash/context7 (58K stars). Discovery index: punkpeye/awesome-mcp-servers (89K stars).
state + memory
MEMORY.md + vault/The seventh piece, and the one most people skip until a third project goes sideways. A MEMORY.md index at a known path, plus a vault holding project canon.
~/.claude/memory/ MEMORY.md # index, links to the topic files below user-prefs.md # voice, terse-vs-verbose, preferences project-decisions.md # "Postgres over Mongo on 2026-03-12, here's why" feedback-recent.md # corrections you keep re-applying ~/vault/ # project canon — stable across sessions architecture.md api-spec.md post-mortems/
Prevents: re-applying the same correction every Tuesday. Memory holds what changes across sessions; the vault holds what doesn't.
The trap is treating memory as append-only — do that and it becomes the rot itself, so prune it every session. The theory behind why this matters is Anthropic's writing on context engineering, which names the failure mode: context rot. For production-grade session compression (a 200K-token transcript down to a 4K-token recap without dropping load-bearing facts), see thedotmack/claude-mem (84K stars).
A loop is not a cron job
the control systemBoth a cron job and a loop run on a schedule. The difference is the decision-maker inside. A cron job replays a fixed script. A loop hands each tick to an agent that reads the current state, chooses the next action, does it, checks the result, and decides what happens next.
For a couple of years the default workflow was manual ping-pong: write a prompt, wait, skim the output, correct, send the next one. Your attention was the motor — step away from the keyboard and the agent stalled mid-turn. Loop engineering is the point where you stop being the motor. You write a small program — a shell loop, a hosted task, a few hundred lines of code — and that program is what holds the conversation with the model.
The inner loop and the outer loop
Inner loop — reason → act → observe → repeat. An agent that can only suggest isn't looping. One that can run code, read the result, fix it, and run again is.
Outer loop — the control system you design around it: the trigger that starts a run, the reviewer that grades it, the memory that outlives a single run, and the stop rules that decide when it quits.
When is a loop actually worth it?
| If the task is… | Do this |
|---|---|
| A one-off | Just prompt the model. A loop is overhead you don't need. |
| Repeating + clear pass/fail | Build a loop. This is the sweet spot — verification is possible and the payoff compounds. |
| Vague, no crisp "done" | Don't hand it to a while-loop. Pin down the goal first, then decide. |
The six building blocks
what every loop sharesEvery loop that survives production solves the same responsibilities. Tools name them differently, but the parts line up. Two of these — verification and memory — are what separate a loop you can trust from one that just makes mistakes faster.
Automation
Something starts a run that isn't you typing — a schedule, a webhook, an event. It must ship bundled with a stop condition.
Worktrees
Give each agent its own checkout so parallel runs don't overwrite each other. Skip it only if you ever run exactly one agent.
Skills
A markdown file holding the context you keep re-explaining — structure, rules, formats. Loaded on demand, written once.
Connectors
The tools the agent can touch — APIs, databases, an inbox, a task board, MCP servers. A loop locked to local files is a glorified script.
Sub-agents (maker–checker)
The agent doing the work must not be the one grading it. A separate reviewer — often on a leaner model — checks it against the rules.
The state spine
On-disk state that survives runs, crashes, and context resets. Read at the start of every run, appended at the end.
Build your first loop
five copy-paste assetsFive assets, in order. Fill the [brackets], run the charter once while you watch, then wire the pieces around it. Every block below has a working copy button.
Step 1 — The loop charter (paste this first)
This is the prompt that turns a one-off into a repeatable, self-checking unit of work. It is deliberately strict about what "done" means and how many items to touch per run.
# LOOP CHARTER ROLE: You run one pass of a maintenance loop. You are not chatting. FIND: Look at [source: failing tests / inbox / board column]. Pick at most [N = 1] item(s) to work on this run. If there is nothing to do, output "NOTHING_TO_DO" and stop. DO: Make the smallest change that resolves the item. Touch only files related to it. CHECK: The item is done ONLY when [tests in path/ pass AND lint is clean]. Run the check. Do not claim success without running it. RECORD: Append one line to [STATE.md]: what you did + the check result. If you hit the same failure twice, write the lesson to [RULES.md]. STOP: After [N] item(s), or if the check fails twice on the same item.
Step 2 — A goal condition that actually stops
A goal-conditioned loop runs across turns until a condition is true. The key: a separate, faster model grades the condition after each turn — the writer doesn't get to mark its own homework. Write the condition so a grader can verify it from the run output, not from a vibe.
/goal all tests under test/auth pass and `npm run lint` exits 0 and STATE.md has a new line for every item processed
For a fuller goal that lives on disk and the loop re-reads every iteration — call it PROMPT.md, AGENTS.md, whatever — spell out done-when, never-touch, and stop-if explicitly:
# Goal Migrate `users.password` from bcrypt to argon2id across the codebase. # Done when - All new password writes use argon2id (`lib/auth/hash.ts`). - Existing bcrypt hashes rehash on next successful login. - Test suite green: `pnpm test auth`. # Never touch - `db/migrations/*` already merged. - Anything under `legacy/`. - The session cookie format. # Stop if - More than 3 files outside `lib/auth/` need edits. - A test that already passes starts failing.
Prevents: drift after ~three iterations, and the worst outcome of all — failure that looks like progress. Code gets written, tests pass, and the goal it solved isn't yours.
Step 3 — Split the maker from the checker
Two agent definitions. The builder proposes a change; the reviewer independently grades it on a leaner model with its own rubric. This single split is what lets you trust a loop running unattended.
--- name: reviewer model: [a fast, cheaper model] description: Independently grades the builder's change. Never edits code. --- You are the checker, not the author. Grade the change against: 1. Does it pass the stated CHECK condition? Run it yourself. 2. Does it touch only files relevant to the item? 3. Does it violate any rule in RULES.md? Output exactly one of: PASS — with the check output that proves it FAIL — with the single reason and the rule or test it broke
Prevents: a reviewer living inside the maker's own context, which always agrees with itself. A different context, on a different model, breaks that.
Step 4 — The memory spine
Boring on purpose. A flat file the loop reads at the start and appends at the end. This is how a run survives a crash, a context reset, or a 03:00 process kill.
# LOOP STATE — append-only log ## Done - 2026-06-27 14:02 fixed test/auth/login.spec → PASS (12/12) - 2026-06-27 14:09 refactored token refresh → PASS (lint clean) ## Open - test/auth/session.spec → FAILED twice, escalated to human ## Lessons (mirror critical ones into RULES.md) - Never edit generated/ — it is overwritten on build.
Step 5 — The reference loop (the whole thing)
The canonical blueprint people cite: morning triage → merged PR. Design it once, never prompt it again. Note the budget cap — it is independent of the goal and non-negotiable.
while not goal_met: state = read_memory("STATE.md") # survives restarts work = find_next_item(state) # inbox · board · failing tests if work is None: break branch = new_worktree() # isolate this unit of work result = builder(work, skills, connectors) verdict = reviewer(result, rubric, tests) # different agent · leaner model if verdict.passed: commit(branch); open_pr(branch) append_memory(state, done=work) else: append_memory(state, lesson=verdict.reason) # write the lesson back if spend_so_far > budget_cap: # the line that saves you money halt("budget cap hit")
Prefer the shell? The same loop in bash — fresh context each turn, state on disk, a separate verify pass, exit when the plan says done:
#!/usr/bin/env bash # minimal loop runner: fresh context each turn, state on disk set -euo pipefail while true; do # plan + act in a fresh context claude -p "Read PROMPT.md, IMPLEMENTATION_PLAN.md. Do the next step. Commit on green." # verify in a fresh context (different subagent) if claude -p "/verify"; then echo "iter ok" else echo "verify failed, will retry" fi # exit when the spec says done grep -q "^STATUS: done$" IMPLEMENTATION_PLAN.md && break sleep 5 done
What it looks like running
A single unattended pass. The builder makes a change, the reviewer grades it independently, state gets appended, and the budget guard watches the whole time.
N in the charter is for), lean on prompt caching, keep context tight, and always set the hard cap.
Smallest reference: ghuntley/how-to-ralph-wiggum (1.7K stars) — a PROMPT.md plus an IMPLEMENTATION_PLAN.md state file the loop rewrites in place. CLI starters and patterns: cobusgreyling/loop-engineering (3K stars). Production TypeScript with verifyCompletion: vercel-labs/ralph-loop-agent (805 stars). A full installable Plan → Work → Review → Release cycle: Chachamaru127/claude-code-harness (2.9K stars).
Sub-agent fan-out
claude-agent-sdkWhen one goal branches into many independent sub-jobs — analyse 10 articles, fix 5 files, search 8 sources — the loop spawns parallel subagents and an orchestrator synthesizes the results. One bloated context can't do this; ten small ones can.
# claude-agent-sdk-python style fan-out from claude_agent_sdk import Agent, run_parallel orchestrator = Agent.load(".claude/agents/orchestrator.md") workers = [Agent.load(".claude/agents/researcher.md") for _ in range(8)] results = run_parallel([ w.run(source=src) for w, src in zip(workers, sources) ]) synthesis = orchestrator.run(inputs=results)
Prevents: the orchestrator drowning — one context loaded with ten jobs' worth of source material is the exact shape that triggers context rot.
Anthropic's own multi-agent research eval measured +90.2% over a single-agent baseline. Official SDK: anthropics/claude-agent-sdk-python (7.4K stars). Heaviest public fan-out kit (60+ agent types, 314 MCP tools): ruvnet/ruflo (61K stars).
Scheduler + persistence
cron · launchdWhat fires the loop when you're not in the chair — cron, launchd, systemd, a queue runner. Keep the scheduler deliberately dumber than the agent. The moment it tries to think (branch on state, decide whether to skip), it starts failing silently for days.
# run the loop every 30 min, log to disk
*/30 * * * * cd ~/my-loop && ./run.sh >> logs/$(date +\%Y-\%m-\%d).log 2>&1
<key>StartCalendarInterval</key> <dict> <key>Minute</key><integer>0</integer> </dict> <key>WorkingDirectory</key><string>/Users/me/my-loop</string> <key>ProgramArguments</key> <array><string>/bin/bash</string><string>run.sh</string></array>
Prevents: the scheduler waking up to an agent that forgot the goal. Every iteration must serialize what it did, tried, and will do next.
Pattern for promoting ad-hoc sessions into scheduled runs: Kanevry/session-orchestrator.
How loops fail
eight guards · three namesAn unattended loop without these guards isn't leverage — it's unattended mistakes. Each failure has a one-line fix.
Three of them have names
Three failure shapes come up so often they've earned labels. Learn them and you'll spot them in your own logs.
The verify step is missing or weak
Wrong outputs pass the gate and compound across iterations. Each bad result becomes the input the next iteration builds on. (Guards 02 and 03 above.)
A single long context degrades past a threshold
Anthropic's term. Accuracy collapses somewhere around 200K tokens of accumulated history — the model quietly gets worse the longer the transcript runs.
State on disk didn't capture progress
The same iteration repeats because the agent re-plans a step it already finished. Nothing recorded that the work was done, so it does it again. (Guard 05 above.)
The measured difference
before — single-context loop · 1.48M tokens · 71% completion · 3 hidden hallucinations/run
after — prune-and-summarize + verifier subagent · 553K tokens · 91.6% completion · every figure traceable
Those numbers are from Less Context, Better Agents (arXiv 2606.10209): full-history at 71% completion versus prune-and-summarize at 91.6%, on a fraction of the tokens. For the catalogue of shortcuts agents take to fake "done" — relaxed tests, swallowed errors, fake renames, stub returns, deleting the comment instead of fixing the bug — moonrunnerkc/swarm-orchestrator names eleven of them.
One complete iteration
the full wiringA minimal setup wires all seven harness files into a working loop. The wiring is one-directional: the harness defines the rules, the loop runs inside them, and the state file connects iteration N to N+1.
my-loop/ ├── .claude/ │ ├── CLAUDE.md # standing context for every session │ ├── settings.json # allow array + PostToolUse prettier hook │ ├── agents/ │ │ ├── verifier.md # Haiku, reviews diffs in fresh context │ │ └── reviewer.md # maker–checker grader, leaner model │ └── skills/ │ └── db-migration-writer/ │ └── SKILL.md # one skill, used three+ times ├── .mcp.json # github MCP, context7 MCP ├── PROMPT.md # goal spec (loop reads each iteration) ├── STATE.md # memory spine (loop appends each run) ├── MEMORY.md # cross-session preferences ├── run.sh # the loop runner (Plan → Act → Verify) └── logs/ # persistence, one file per cron tick
A single iteration walks every harness file and every loop piece in order:
run.sh, which calls claude -p. (scheduler)CLAUDE.md and settings.json. (harness 1, 2)PostToolUse hook applies on every edit. (harness 3)PROMPT.md and STATE.md. (goal + memory)STATE.md; a lesson goes to RULES.md. (memory spine)MEMORY.md if a new preference was learned, then exits. (harness 7)Pull any harness file and a specific loop step degrades. No CLAUDE.md, and the planner re-derives the project shape every iteration. No reviewer subagent, and the verify step runs in the main context and always passes. No STATE.md, and the same correction gets re-applied every Tuesday. Build the seven once; the loop runs indefinitely.
What to do tonight
the decision ladderOpen your folder, count the files, then take the one next step that matches what's already there.
ls -la .claude/
centminmod/my-claude-code-setup.wshobson/agents.SKILL.md files from anthropics/skills first.Chachamaru127/claude-code-harness.Then do exactly one thing: open the matching repo in a new tab and clone it. The harness is the floor — without it, every loop runs over a hole.
The 30-day rollout
earn each layerDon't build the full system on day one. Earn each layer of autonomy before you add the next.
Run it as a plain prompt
Pick one annoying, repeatable, pass/fail task you do by hand. Paste the charter, fill the brackets, run it once while you watch.
Add a goal that stops
Give it an objectively checkable stop condition. Confirm a grader can verify "done" from the output. Run it once, watching.
Split maker from checker
Add the builder + reviewer sub-agents and a memory file. Now the loop checks its own work and remembers across runs.
Isolate and connect
Add worktrees so parallel runs don't collide, and connectors so "find the work" can reach your real inbox, board, or repo.
Schedule, cap, and let go
Put it on a schedule, set the hard budget cap, and let it run unattended — spot-checking the traces daily, not babysitting every turn.