Nanoclaw migration guide

From legacy SDK to the subprocessor architecture

A practical, action-oriented walkthrough for moving an existing nanoclaw deployment off the in-process SDK runner and onto the per-channel subprocess provider. Written for an operator who already runs nanoclaw and is comfortable with systemd, Docker, and Node.

Audience: operators Difficulty: medium Reversible: yes

1. What are subprocessors?

The legacy nanoclaw runner uses @anthropic-ai/claude-agent-sdk directly. Every channel calls query() from inside the same Node.js process. Hooks are JS callbacks. State lives in memory. One bad turn can cascade.

The subprocessor architecture spawns Anthropic's official claude CLI as a child process per channel, in headless mode (--print --output-format stream-json --input-format stream-json). The CLI handles auth, model dispatch, and tool routing. Nanoclaw orchestrates and pipes events back into the host.

Why bother:

Subprocessors also enable a set of optional runtime hooks (tool-guide injection, memory-stubs, compliance middleware) that you can layer on later. See Optional advanced features. None of those are required for the core migration.

2. Differences from the legacy single-process runner

ConcernLegacy SDK runnerSubprocess provider
Process model One Node process for the whole instance, all channels share it. One claude CLI subprocess per channel container, kept alive across turns.
Where hooks run In-process JS callbacks bound to the SDK query() options. On-disk hook scripts under container/agent-runner/src/hooks/*.ts, referenced from a generated settings.json via --settings.
Session state SDK MessageStream in memory, lost on restart. CLI session ID, resumable with --resume <sessionId>. Long-running CLI keeps state across turns over stream-json stdin.
Auth CLAUDE_CODE_OAUTH_TOKEN env var posted to /v1/messages. ~/.claude/.credentials.json on the host, mounted read-only into the container at /home/node/.claude/.credentials.json. One-time claude login covers all channels.
MCP servers Inline mcpServers option to query(). Generated settings.json in a temp dir, passed via --settings. CLI starts MCP processes per spawn.
Hook protocol JS function signatures defined by the SDK. Stdin JSON contract from the CLI, scripts respond on stdout. Versioned, may drift across CLI releases.
Cancellation Abort signal on the iterable. SIGTERM on the child process. Stub killAgent() in the provider, full plumbing pending.

Three runtime paths now coexist in the codebase. They are selected by environment flags per channel:

Channel .envPath used
USE_SUBPROCESS=1CLI subprocess provider (this guide)
USE_PROVIDERS=1 (no subprocess)Middleware-wrapped SDK provider (Tier 2)
neitherLegacy inline-hooks SDK path

USE_SUBPROCESS takes precedence over USE_PROVIDERS. They are not stacked.

3. Prerequisites

Read this before you start. Run the migration on one canary channel first. Do not flip the flag globally. The original rollout used personligt as the canary and stayed there for a week before expanding to other channels. Treat your most active channel as production and pick something low-traffic for the first run.

4. Migration steps

Step 1: Backup

# From the host. Adjust paths to match your install.
cd /path/to/nanoclaw
cp store/messages.db store/messages.db.bak-$(date +%Y%m%d)
tar czf groups-backup-$(date +%Y%m%d).tgz groups/

Step 2: Pull the latest code

cd /path/to/nanoclaw
git fetch origin
git checkout main
git pull origin main

If you forked from an older snapshot, rebase or cherry-pick the commits that landed the subprocess provider, the hooks under container/agent-runner/src/hooks/, and the credentials mount block in container-runner.ts. Look for cli-subprocess.ts in the providers folder as the marker.

Step 3: Install dependencies

cd /path/to/nanoclaw
npm install
cd container/agent-runner
npm install

If you skip the agent-runner install, the hook scripts will compile but fail at runtime when they reach better-sqlite3 or jose.

Step 4: Build the container image

cd /path/to/nanoclaw/container
./build.sh

The build script compiles the agent-runner with tsc, copies hook sources to /app/src/hooks/ in the image, and tags the image. entrypoint.sh writes compiled output to /tmp/dist at container start, so the mounted /app/src can stay read-only.

Step 5: One-time CLI login on the host

claude login

This writes ~/.claude/.credentials.json. Refresh tokens typically last 30 to 90 days; access tokens auto-refresh silently. You will need to re-run claude login roughly once per quarter, or after Anthropic-side revokes or password changes. All subprocess-enabled channels share the same credentials file.

Step 6: Verify the per-channel session layout

When the host service starts, container-runner.ts creates the per-channel session structure:

data/sessions/{channel}/
  .claude/
    settings.json
    skills/
    todos/
    ...
  agent-runner-src/
    index.ts
    ipc-mcp-stdio.ts

The settings.json is generated per channel and points at the hooks directory baked into the image. The agent-runner-src/ tree contains the per-session runtime scripts.

Step 7: Enable the flag on a canary channel

echo 'USE_SUBPROCESS=1' >> /path/to/nanoclaw/groups/<CANARY_CHANNEL>/.env

# If you run channels as Docker containers, kill the old container so
# the next message spawns a fresh one with the new env.
docker ps --filter name=nanoclaw-<CANARY_CHANNEL> -q | xargs -r docker kill

Leave any pre-existing USE_PROVIDERS=1 in the same file. USE_SUBPROCESS wins, but keeping the other flag means you can A/B by toggling one line.

Step 8: Restart the host service

systemctl --user restart nanoclaw.service
journalctl --user -u nanoclaw.service -f

Watch the logs as the canary channel boots its first subprocess.

Step 9: Smoke test

Send a message in the canary channel. The expected log progression:

[agent-runner] USE_SUBPROCESS=1: cli-subprocess provider loaded
[agent-runner] USE_SUBPROCESS=1: using long-running CLI subprocess path
[cli-subprocess] Spawning /usr/local/bin/claude (cwd=/workspace/group, ...)
[claude-cli] ... CLI bootstrap noise ...
[cli-subprocess] event: system/init
[cli-subprocess] event: assistant
[cli-subprocess] event: result
[agent-runner] Result #1 text=...

5. Verifying it works

Subprocess actually started

Look for [cli-subprocess] Spawning followed by event: system/init. If you see [agent-runner] USE_PROVIDERS=1 instead, the flag did not load. Check the channel's .env file for typos.

Roundtrip a normal message

Send a plain message in the canary channel. Expected sequence in the container logs:

[cli-subprocess] event: system/init
[cli-subprocess] event: assistant
[cli-subprocess] event: result
[agent-runner] Result #1 text=...

The reply should land in the channel just like before.

Session resume across turns

Send a follow-up message. The CLI should reuse the same session ID rather than spawning a fresh one. Look for --resume in the spawn args on turn two.

Auth-dead path

Move ~/.claude/.credentials.json aside on the host, restart the canary container, send a message. The channel should receive a message along the lines of Claude CLI auth is dead. Run claude login on the host. Restore the credentials and re-test before continuing.

Once these all pass, soak the canary for at least a few days under normal traffic before flipping additional channels. Roll out one channel at a time. Do not bulk-enable.

6. Common pitfalls

Agent-runner dependencies missing

You forgot npm install inside container/agent-runner/. The runtime pulls in better-sqlite3, jose, and a few others. Re-run install, rebuild the image, restart the container.

Empty trigger word in register_group

If you registered a group programmatically and passed an empty string as the trigger, the matcher will short-circuit and treat every message as a trigger. Set a sensible default such as the bot's name.

CLI not on PATH inside the container

If the image was built before the claude binary was installed, the subprocess provider will fail to spawn. Set CLAUDE_CLI_PATH=/usr/local/bin/claude in the channel .env, or rebuild the image with the binary baked in.

Cold start on first message

The CLI cold starts in roughly one to three seconds before the model call. There is no pre-warm equivalent for the SDK's startup() yet. Acceptable for most channels, noticeable on highly interactive ones.

MCP server lifecycle

Each spawn starts fresh MCP processes. The long-running CLI amortizes this across follow-ups in a turn, but the first turn still pays the cost. Plan accordingly if your MCP servers are heavyweight.

Cross-channel auth notifications

From a non-main channel, the auth-dead notification cannot cross channels by default. The IPC layer rejects cross-channel sends from non-main. Until you enable subprocess on a main channel, the auth-dead message lands in whichever channel detected it.

7. Optional advanced features Opt-in

Once you're on subprocessors, the architecture lets you layer extra runtime features on top. The three below are common in our internal deployments but are not part of the core migration. A vanilla nanoclaw install does not ship with these files, so treat them as opt-in. Only add them after the canary has soaked on plain subprocessors and you have a specific need.

Each of these is a separate body of code. None of them are required for subprocessors to work. Pick the ones that solve a problem you actually have.

Tool-guide injection Optional

What it is. A PreToolUse hook on Task / Agent that scans subagent prompts for trigger words and appends the matching tool-guide markdown to the spawned subagent's system prompt. Keeps the parent agent's prompt short while still delivering relevant guidance just-in-time.

Why it's useful. Tool guides for things like Google Ads, BigQuery, or Gmail are large. Loading all of them into every prompt is wasteful. Trigger-based injection means a subagent only sees what it actually needs.

Where the code lives. Hook script at container/agent-runner/src/hooks/inject-tool-guides.ts, guide content under groups/shared/tool-guides/, manifest at tool-guides/index.json.

High-level setup.

Memory-stubs runtime hook Optional

What it is. A per-session hook that runs cosine similarity between the user's incoming message and a per-channel stubs.db of memory snippets, then injects the top matches into the system prompt. Acts as a lightweight retrieval layer for long-lived channel memory.

Why it's useful. Lets you keep memory.md short by offloading older or topic-specific entries into stubs. The runtime pulls them back in only when relevant.

Where the code lives. Per-session hook at data/sessions/{channel}/agent-runner-src/memory-stubs.ts, embedding store at groups/{channel}/memory/stubs.db.

High-level setup.

Compliance-engine middleware Optional

What it is. A bundle of PostToolUse and stop-hook scripts that enforce style and safety rules on the assistant's output. Examples include Swedish character validation (catches a where รค belongs), em-dash detection, sanitised bash output, and a PreCompact archive that snapshots conversations before the CLI auto-compacts them.

Why it's useful. Catches recurring style violations before they reach the user, and preserves conversation history that would otherwise be lost to compaction.

Where the code lives. Hook scripts under container/agent-runner/src/hooks/: compliance.ts, sanitize-bash.ts, read-malware-neutralizer.ts, precompact-archive.ts, taskoutput-timeout.ts, with shared helpers in _lib.ts.

High-level setup.

Since your installation does not have these features yet, treat them as opt-in. The reference files in section 9 point at the upstream implementations if you want to copy them over later.

8. Rollback

The architecture was designed so that flipping back is a one-line change. No code revert needed.

# Remove the flag from the channel
sed -i '/USE_SUBPROCESS/d' /path/to/nanoclaw/groups/<CANARY_CHANNEL>/.env

# Force a fresh container spawn
docker ps --filter name=nanoclaw-<CANARY_CHANNEL> -q | xargs -r docker kill

Next message in that channel falls through to USE_PROVIDERS=1 if it is still set, otherwise to the legacy SDK path. No data migration. Conversations and memory stubs are unaffected.

If something corrupted the database (it should not, but just in case):

systemctl --user stop nanoclaw.service
cp /path/to/nanoclaw/store/messages.db.bak-YYYYMMDD /path/to/nanoclaw/store/messages.db
git checkout <PREV_COMMIT>
cd container && ./build.sh
systemctl --user start nanoclaw.service

Tip. Keep both USE_PROVIDERS=1 and USE_SUBPROCESS=1 in the canary channel's .env while you soak. Toggling one flag is faster than juggling commits, and you can A/B between paths if a regression shows up.

9. Reference files

Anchor points to read in the codebase if you need to dig deeper. Paths are relative to the nanoclaw repo root.

FileWhy it matters
container/agent-runner/POC-SUBPROCESS.md Original design document. Why it exists, what is and is not implemented, full migration plan.
container/agent-runner/SUBPROCESS-CANARY.md Canary rollout playbook. Verification steps mirror the smoke checklist above.
container/agent-runner/src/providers/cli-subprocess.ts The provider. Spawns the CLI, parses stream-json, manages session resume.
container/agent-runner/src/hooks/ All on-disk hook scripts. inject-tool-guides.ts, compliance.ts, precompact-archive.ts, read-malware-neutralizer.ts, sanitize-bash.ts, taskoutput-timeout.ts, plus a shared _lib.ts.
src/container-runner.ts Host-side. Creates per-channel session dirs, syncs tool guides and rules, mounts /home/node/.claude and /app/src, mounts the credentials file.
data/sessions/{channel}/agent-runner-src/memory-stubs.ts Per-session runtime hook. Cosine similarity over stubs.db, returns matched archive entries for prompt injection.
container/agent-runner/src/providers/factory.ts Picks the provider based on USE_SUBPROCESS / USE_PROVIDERS. Read this if a flag does not seem to take effect.