Nanoclaw migration guide

From legacy SDK to the subprocessor architecture

A practical, action-oriented walkthrough for moving an existing nanoclaw deployment off the in-process SDK runner and onto the per-channel subprocess provider. Written for an operator who already runs nanoclaw and is comfortable with systemd, Docker, and Node.

Audience: operators Difficulty: medium Reversible: yes

1. What are subprocessors?

The legacy nanoclaw runner uses @anthropic-ai/claude-agent-sdk directly. Every channel calls query() from inside the same Node.js process. Hooks are JS callbacks. State lives in memory. One bad turn can cascade.

The subprocessor architecture spawns Anthropic's official claude CLI as a child process per channel, in headless mode (--print --output-format stream-json --input-format stream-json). The CLI handles auth, model dispatch, and tool routing. Nanoclaw orchestrates and pipes events back into the host.

Why bother:

Isolation per channel. Each channel runs its own subprocess. A crash in one channel does not take down the rest.
Per-channel session state. Each session lives under data/sessions/{channel}/ with its own settings, todos, and skills directory. Channels cannot leak state into one another.
Better stability under model errors. Auth failures and rate limits surface as typed events instead of unhandled rejections.
Per-session provider config. A channel can pin a model, swap providers, or add MCP servers without rebuilding the host.
Resumable sessions. The CLI session ID can be reused across turns with --resume, so long conversations survive process restarts.
Policy resilience. The CLI's headless mode is officially sanctioned by Anthropic and covered under Pro / Max subscriptions, so it survives if the OAuth-via-API path the SDK uses is restricted.

Subprocessors also enable a set of optional runtime hooks (tool-guide injection, memory-stubs, compliance middleware) that you can layer on later. See Optional advanced features. None of those are required for the core migration.

2. Differences from the legacy single-process runner

Concern	Legacy SDK runner	Subprocess provider
Process model	One Node process for the whole instance, all channels share it.	One `claude` CLI subprocess per channel container, kept alive across turns.
Where hooks run	In-process JS callbacks bound to the SDK `query()` options.	On-disk hook scripts under `container/agent-runner/src/hooks/*.ts`, referenced from a generated `settings.json` via `--settings`.
Session state	SDK `MessageStream` in memory, lost on restart.	CLI session ID, resumable with `--resume <sessionId>`. Long-running CLI keeps state across turns over stream-json stdin.
Auth	`CLAUDE_CODE_OAUTH_TOKEN` env var posted to `/v1/messages`.	`~/.claude/.credentials.json` on the host, mounted read-only into the container at `/home/node/.claude/.credentials.json`. One-time `claude login` covers all channels.
MCP servers	Inline `mcpServers` option to `query()`.	Generated `settings.json` in a temp dir, passed via `--settings`. CLI starts MCP processes per spawn.
Hook protocol	JS function signatures defined by the SDK.	Stdin JSON contract from the CLI, scripts respond on stdout. Versioned, may drift across CLI releases.
Cancellation	Abort signal on the iterable.	`SIGTERM` on the child process. Stub `killAgent()` in the provider, full plumbing pending.

Three runtime paths now coexist in the codebase. They are selected by environment flags per channel:

Channel `.env`	Path used
`USE_SUBPROCESS=1`	CLI subprocess provider (this guide)
`USE_PROVIDERS=1` (no subprocess)	Middleware-wrapped SDK provider (Tier 2)
neither	Legacy inline-hooks SDK path

USE_SUBPROCESS takes precedence over USE_PROVIDERS. They are not stacked.

3. Prerequisites

Node.js 22 or newer on the host. The agent-runner uses fetch(), Buffer tweaks, and other Node 22 built-ins.
Disk: at least 2 GB free. Each channel's compiled dist/, hooks, and tool guides land under data/sessions/{channel}/.
Anthropic CLI on the host PATH. Install once, run claude login as the same user that will own the container. The credentials file lands at ~/.claude/.credentials.json.
Backup of store/messages.db and every channel directory under groups/.
Docker if you run the containerized variant. The CLI subprocess path expects the credentials mount to be in place.

Read this before you start. Run the migration on one canary channel first. Do not flip the flag globally. The original rollout used personligt as the canary and stayed there for a week before expanding to other channels. Treat your most active channel as production and pick something low-traffic for the first run.

4. Migration steps

Step 1: Backup

# From the host. Adjust paths to match your install.
cd /path/to/nanoclaw
cp store/messages.db store/messages.db.bak-$(date +%Y%m%d)
tar czf groups-backup-$(date +%Y%m%d).tgz groups/

Step 2: Pull the latest code

cd /path/to/nanoclaw
git fetch origin
git checkout main
git pull origin main

If you forked from an older snapshot, rebase or cherry-pick the commits that landed the subprocess provider, the hooks under container/agent-runner/src/hooks/, and the credentials mount block in container-runner.ts. Look for cli-subprocess.ts in the providers folder as the marker.

Step 3: Install dependencies

cd /path/to/nanoclaw
npm install
cd container/agent-runner
npm install

If you skip the agent-runner install, the hook scripts will compile but fail at runtime when they reach better-sqlite3 or jose.

Step 4: Build the container image

cd /path/to/nanoclaw/container
./build.sh

The build script compiles the agent-runner with tsc, copies hook sources to /app/src/hooks/ in the image, and tags the image. entrypoint.sh writes compiled output to /tmp/dist at container start, so the mounted /app/src can stay read-only.

Step 5: One-time CLI login on the host

claude login

This writes ~/.claude/.credentials.json. Refresh tokens typically last 30 to 90 days; access tokens auto-refresh silently. You will need to re-run claude login roughly once per quarter, or after Anthropic-side revokes or password changes. All subprocess-enabled channels share the same credentials file.

Step 6: Verify the per-channel session layout

When the host service starts, container-runner.ts creates the per-channel session structure:

data/sessions/{channel}/
  .claude/
    settings.json
    skills/
    todos/
    ...
  agent-runner-src/
    index.ts
    ipc-mcp-stdio.ts

The settings.json is generated per channel and points at the hooks directory baked into the image. The agent-runner-src/ tree contains the per-session runtime scripts.

Step 7: Enable the flag on a canary channel

echo 'USE_SUBPROCESS=1' >> /path/to/nanoclaw/groups/<CANARY_CHANNEL>/.env

# If you run channels as Docker containers, kill the old container so
# the next message spawns a fresh one with the new env.
docker ps --filter name=nanoclaw-<CANARY_CHANNEL> -q | xargs -r docker kill

Leave any pre-existing USE_PROVIDERS=1 in the same file. USE_SUBPROCESS wins, but keeping the other flag means you can A/B by toggling one line.

Step 8: Restart the host service

systemctl --user restart nanoclaw.service
journalctl --user -u nanoclaw.service -f

Watch the logs as the canary channel boots its first subprocess.

Step 9: Smoke test

Send a message in the canary channel. The expected log progression:

[agent-runner] USE_SUBPROCESS=1: cli-subprocess provider loaded
[agent-runner] USE_SUBPROCESS=1: using long-running CLI subprocess path
[cli-subprocess] Spawning /usr/local/bin/claude (cwd=/workspace/group, ...)
[claude-cli] ... CLI bootstrap noise ...
[cli-subprocess] event: system/init
[cli-subprocess] event: assistant
[cli-subprocess] event: result
[agent-runner] Result #1 text=...

5. Verifying it works

Subprocess actually started

Look for [cli-subprocess] Spawning followed by event: system/init. If you see [agent-runner] USE_PROVIDERS=1 instead, the flag did not load. Check the channel's .env file for typos.

Roundtrip a normal message

Send a plain message in the canary channel. Expected sequence in the container logs:

[cli-subprocess] event: system/init
[cli-subprocess] event: assistant
[cli-subprocess] event: result
[agent-runner] Result #1 text=...

The reply should land in the channel just like before.

Session resume across turns

Send a follow-up message. The CLI should reuse the same session ID rather than spawning a fresh one. Look for --resume in the spawn args on turn two.

Auth-dead path

Move ~/.claude/.credentials.json aside on the host, restart the canary container, send a message. The channel should receive a message along the lines of Claude CLI auth is dead. Run claude login on the host. Restore the credentials and re-test before continuing.

Once these all pass, soak the canary for at least a few days under normal traffic before flipping additional channels. Roll out one channel at a time. Do not bulk-enable.

6. Common pitfalls

Agent-runner dependencies missing

You forgot npm install inside container/agent-runner/. The runtime pulls in better-sqlite3, jose, and a few others. Re-run install, rebuild the image, restart the container.

Empty trigger word in `register_group`

If you registered a group programmatically and passed an empty string as the trigger, the matcher will short-circuit and treat every message as a trigger. Set a sensible default such as the bot's name.

CLI not on PATH inside the container

If the image was built before the claude binary was installed, the subprocess provider will fail to spawn. Set CLAUDE_CLI_PATH=/usr/local/bin/claude in the channel .env, or rebuild the image with the binary baked in.

Cold start on first message

The CLI cold starts in roughly one to three seconds before the model call. There is no pre-warm equivalent for the SDK's startup() yet. Acceptable for most channels, noticeable on highly interactive ones.

MCP server lifecycle

Each spawn starts fresh MCP processes. The long-running CLI amortizes this across follow-ups in a turn, but the first turn still pays the cost. Plan accordingly if your MCP servers are heavyweight.

Cross-channel auth notifications

From a non-main channel, the auth-dead notification cannot cross channels by default. The IPC layer rejects cross-channel sends from non-main. Until you enable subprocess on a main channel, the auth-dead message lands in whichever channel detected it.

7. Optional advanced features Opt-in

Once you're on subprocessors, the architecture lets you layer extra runtime features on top. The three below are common in our internal deployments but are not part of the core migration. A vanilla nanoclaw install does not ship with these files, so treat them as opt-in. Only add them after the canary has soaked on plain subprocessors and you have a specific need.

Each of these is a separate body of code. None of them are required for subprocessors to work. Pick the ones that solve a problem you actually have.

Tool-guide injection Optional

What it is. A PreToolUse hook on Task / Agent that scans subagent prompts for trigger words and appends the matching tool-guide markdown to the spawned subagent's system prompt. Keeps the parent agent's prompt short while still delivering relevant guidance just-in-time.

Why it's useful. Tool guides for things like Google Ads, BigQuery, or Gmail are large. Loading all of them into every prompt is wasteful. Trigger-based injection means a subagent only sees what it actually needs.

Where the code lives. Hook script at container/agent-runner/src/hooks/inject-tool-guides.ts, guide content under groups/shared/tool-guides/, manifest at tool-guides/index.json.

High-level setup.

Create groups/shared/tool-guides/ on disk and populate it with guide markdown plus an index.json mapping trigger words to guide names.
Add the inject-tool-guides.ts hook to your container/agent-runner/src/hooks/ directory.
Wire it into the generated settings.json as a PreToolUse hook on Task and Agent.
Ensure container-runner.ts syncs the directory into data/sessions/{channel}/.claude/tool-guides/ on container start.

Memory-stubs runtime hook Optional

What it is. A per-session hook that runs cosine similarity between the user's incoming message and a per-channel stubs.db of memory snippets, then injects the top matches into the system prompt. Acts as a lightweight retrieval layer for long-lived channel memory.

Why it's useful. Lets you keep memory.md short by offloading older or topic-specific entries into stubs. The runtime pulls them back in only when relevant.

Where the code lives. Per-session hook at data/sessions/{channel}/agent-runner-src/memory-stubs.ts, embedding store at groups/{channel}/memory/stubs.db.

High-level setup.

Add memory-stubs.ts to the agent-runner-src template that container-runner.ts writes per channel.
Wire it as a UserPromptSubmit hook in the generated settings.json.
Build and populate stubs.db per channel using your embedding pipeline of choice. Until the DB exists, the hook is a no-op.
Mind the path convention: container path inside, host path for any builder script that runs on the host.

Compliance-engine middleware Optional

What it is. A bundle of PostToolUse and stop-hook scripts that enforce style and safety rules on the assistant's output. Examples include Swedish character validation (catches a where ä belongs), em-dash detection, sanitised bash output, and a PreCompact archive that snapshots conversations before the CLI auto-compacts them.

Why it's useful. Catches recurring style violations before they reach the user, and preserves conversation history that would otherwise be lost to compaction.

Where the code lives. Hook scripts under container/agent-runner/src/hooks/: compliance.ts, sanitize-bash.ts, read-malware-neutralizer.ts, precompact-archive.ts, taskoutput-timeout.ts, with shared helpers in _lib.ts.

High-level setup.

Drop the hook scripts into container/agent-runner/src/hooks/ and rebuild the image.
Register each as the appropriate event in the generated settings.json (PostToolUse, Stop, PreCompact as relevant).
Tune the rule set in compliance.ts to match your channel's style guide. The defaults assume Republiken's Swedish + English bilingual setup.
Verify by sending a deliberate violation and checking the container logs for the rule's log line.

Since your installation does not have these features yet, treat them as opt-in. The reference files in section 9 point at the upstream implementations if you want to copy them over later.

8. Rollback

The architecture was designed so that flipping back is a one-line change. No code revert needed.

# Remove the flag from the channel
sed -i '/USE_SUBPROCESS/d' /path/to/nanoclaw/groups/<CANARY_CHANNEL>/.env

# Force a fresh container spawn
docker ps --filter name=nanoclaw-<CANARY_CHANNEL> -q | xargs -r docker kill

Next message in that channel falls through to USE_PROVIDERS=1 if it is still set, otherwise to the legacy SDK path. No data migration. Conversations and memory stubs are unaffected.

If something corrupted the database (it should not, but just in case):

systemctl --user stop nanoclaw.service
cp /path/to/nanoclaw/store/messages.db.bak-YYYYMMDD /path/to/nanoclaw/store/messages.db
git checkout <PREV_COMMIT>
cd container && ./build.sh
systemctl --user start nanoclaw.service

Tip. Keep both USE_PROVIDERS=1 and USE_SUBPROCESS=1 in the canary channel's .env while you soak. Toggling one flag is faster than juggling commits, and you can A/B between paths if a regression shows up.

9. Reference files

Anchor points to read in the codebase if you need to dig deeper. Paths are relative to the nanoclaw repo root.

File	Why it matters
`container/agent-runner/POC-SUBPROCESS.md`	Original design document. Why it exists, what is and is not implemented, full migration plan.
`container/agent-runner/SUBPROCESS-CANARY.md`	Canary rollout playbook. Verification steps mirror the smoke checklist above.
`container/agent-runner/src/providers/cli-subprocess.ts`	The provider. Spawns the CLI, parses stream-json, manages session resume.
`container/agent-runner/src/hooks/`	All on-disk hook scripts. `inject-tool-guides.ts`, `compliance.ts`, `precompact-archive.ts`, `read-malware-neutralizer.ts`, `sanitize-bash.ts`, `taskoutput-timeout.ts`, plus a shared `_lib.ts`.
`src/container-runner.ts`	Host-side. Creates per-channel session dirs, syncs tool guides and rules, mounts `/home/node/.claude` and `/app/src`, mounts the credentials file.
`data/sessions/{channel}/agent-runner-src/memory-stubs.ts`	Per-session runtime hook. Cosine similarity over `stubs.db`, returns matched archive entries for prompt injection.
`container/agent-runner/src/providers/factory.ts`	Picks the provider based on `USE_SUBPROCESS` / `USE_PROVIDERS`. Read this if a flag does not seem to take effect.

1. What are subprocessors?

2. Differences from the legacy single-process runner

3. Prerequisites

4. Migration steps

Step 1: Backup

Step 2: Pull the latest code

Step 3: Install dependencies

Step 4: Build the container image

Step 5: One-time CLI login on the host

Step 6: Verify the per-channel session layout

Step 7: Enable the flag on a canary channel

Step 8: Restart the host service

Step 9: Smoke test

5. Verifying it works

Subprocess actually started

Roundtrip a normal message

Session resume across turns

Auth-dead path

6. Common pitfalls

Agent-runner dependencies missing

Empty trigger word in register_group

CLI not on PATH inside the container

Cold start on first message

MCP server lifecycle

Cross-channel auth notifications

7. Optional advanced features Opt-in

Tool-guide injection Optional

Memory-stubs runtime hook Optional

Compliance-engine middleware Optional

8. Rollback

9. Reference files

Empty trigger word in `register_group`