Nanoclaw migration guide
A practical, action-oriented walkthrough for moving an existing nanoclaw deployment off the in-process SDK runner and onto the per-channel subprocess provider. Written for an operator who already runs nanoclaw and is comfortable with systemd, Docker, and Node.
The legacy nanoclaw runner uses @anthropic-ai/claude-agent-sdk directly. Every channel calls query() from inside the same Node.js process. Hooks are JS callbacks. State lives in memory. One bad turn can cascade.
The subprocessor architecture spawns Anthropic's official claude CLI as a child process per channel, in headless mode (--print --output-format stream-json --input-format stream-json). The CLI handles auth, model dispatch, and tool routing. Nanoclaw orchestrates and pipes events back into the host.
Why bother:
data/sessions/{channel}/ with its own settings, todos, and skills directory. Channels cannot leak state into one another.--resume, so long conversations survive process restarts.Subprocessors also enable a set of optional runtime hooks (tool-guide injection, memory-stubs, compliance middleware) that you can layer on later. See Optional advanced features. None of those are required for the core migration.
| Concern | Legacy SDK runner | Subprocess provider |
|---|---|---|
| Process model | One Node process for the whole instance, all channels share it. | One claude CLI subprocess per channel container, kept alive across turns. |
| Where hooks run | In-process JS callbacks bound to the SDK query() options. |
On-disk hook scripts under container/agent-runner/src/hooks/*.ts, referenced from a generated settings.json via --settings. |
| Session state | SDK MessageStream in memory, lost on restart. |
CLI session ID, resumable with --resume <sessionId>. Long-running CLI keeps state across turns over stream-json stdin. |
| Auth | CLAUDE_CODE_OAUTH_TOKEN env var posted to /v1/messages. |
~/.claude/.credentials.json on the host, mounted read-only into the container at /home/node/.claude/.credentials.json. One-time claude login covers all channels. |
| MCP servers | Inline mcpServers option to query(). |
Generated settings.json in a temp dir, passed via --settings. CLI starts MCP processes per spawn. |
| Hook protocol | JS function signatures defined by the SDK. | Stdin JSON contract from the CLI, scripts respond on stdout. Versioned, may drift across CLI releases. |
| Cancellation | Abort signal on the iterable. | SIGTERM on the child process. Stub killAgent() in the provider, full plumbing pending. |
Three runtime paths now coexist in the codebase. They are selected by environment flags per channel:
Channel .env | Path used |
|---|---|
USE_SUBPROCESS=1 | CLI subprocess provider (this guide) |
USE_PROVIDERS=1 (no subprocess) | Middleware-wrapped SDK provider (Tier 2) |
| neither | Legacy inline-hooks SDK path |
USE_SUBPROCESS takes precedence over USE_PROVIDERS. They are not stacked.
fetch(), Buffer tweaks, and other Node 22 built-ins.dist/, hooks, and tool guides land under data/sessions/{channel}/.claude login as the same user that will own the container. The credentials file lands at ~/.claude/.credentials.json.store/messages.db and every channel directory under groups/.Read this before you start. Run the migration on one canary channel first. Do not flip the flag globally. The original rollout used personligt as the canary and stayed there for a week before expanding to other channels. Treat your most active channel as production and pick something low-traffic for the first run.
# From the host. Adjust paths to match your install.
cd /path/to/nanoclaw
cp store/messages.db store/messages.db.bak-$(date +%Y%m%d)
tar czf groups-backup-$(date +%Y%m%d).tgz groups/
cd /path/to/nanoclaw
git fetch origin
git checkout main
git pull origin main
If you forked from an older snapshot, rebase or cherry-pick the commits that landed the subprocess provider, the hooks under container/agent-runner/src/hooks/, and the credentials mount block in container-runner.ts. Look for cli-subprocess.ts in the providers folder as the marker.
cd /path/to/nanoclaw
npm install
cd container/agent-runner
npm install
If you skip the agent-runner install, the hook scripts will compile but fail at runtime when they reach better-sqlite3 or jose.
cd /path/to/nanoclaw/container
./build.sh
The build script compiles the agent-runner with tsc, copies hook sources to /app/src/hooks/ in the image, and tags the image. entrypoint.sh writes compiled output to /tmp/dist at container start, so the mounted /app/src can stay read-only.
claude login
This writes ~/.claude/.credentials.json. Refresh tokens typically last 30 to 90 days; access tokens auto-refresh silently. You will need to re-run claude login roughly once per quarter, or after Anthropic-side revokes or password changes. All subprocess-enabled channels share the same credentials file.
When the host service starts, container-runner.ts creates the per-channel session structure:
data/sessions/{channel}/
.claude/
settings.json
skills/
todos/
...
agent-runner-src/
index.ts
ipc-mcp-stdio.ts
The settings.json is generated per channel and points at the hooks directory baked into the image. The agent-runner-src/ tree contains the per-session runtime scripts.
echo 'USE_SUBPROCESS=1' >> /path/to/nanoclaw/groups/<CANARY_CHANNEL>/.env
# If you run channels as Docker containers, kill the old container so
# the next message spawns a fresh one with the new env.
docker ps --filter name=nanoclaw-<CANARY_CHANNEL> -q | xargs -r docker kill
Leave any pre-existing USE_PROVIDERS=1 in the same file. USE_SUBPROCESS wins, but keeping the other flag means you can A/B by toggling one line.
systemctl --user restart nanoclaw.service
journalctl --user -u nanoclaw.service -f
Watch the logs as the canary channel boots its first subprocess.
Send a message in the canary channel. The expected log progression:
[agent-runner] USE_SUBPROCESS=1: cli-subprocess provider loaded
[agent-runner] USE_SUBPROCESS=1: using long-running CLI subprocess path
[cli-subprocess] Spawning /usr/local/bin/claude (cwd=/workspace/group, ...)
[claude-cli] ... CLI bootstrap noise ...
[cli-subprocess] event: system/init
[cli-subprocess] event: assistant
[cli-subprocess] event: result
[agent-runner] Result #1 text=...
Look for [cli-subprocess] Spawning followed by event: system/init. If you see [agent-runner] USE_PROVIDERS=1 instead, the flag did not load. Check the channel's .env file for typos.
Send a plain message in the canary channel. Expected sequence in the container logs:
[cli-subprocess] event: system/init
[cli-subprocess] event: assistant
[cli-subprocess] event: result
[agent-runner] Result #1 text=...
The reply should land in the channel just like before.
Send a follow-up message. The CLI should reuse the same session ID rather than spawning a fresh one. Look for --resume in the spawn args on turn two.
Move ~/.claude/.credentials.json aside on the host, restart the canary container, send a message. The channel should receive a message along the lines of Claude CLI auth is dead. Run claude login on the host. Restore the credentials and re-test before continuing.
Once these all pass, soak the canary for at least a few days under normal traffic before flipping additional channels. Roll out one channel at a time. Do not bulk-enable.
You forgot npm install inside container/agent-runner/. The runtime pulls in better-sqlite3, jose, and a few others. Re-run install, rebuild the image, restart the container.
register_groupIf you registered a group programmatically and passed an empty string as the trigger, the matcher will short-circuit and treat every message as a trigger. Set a sensible default such as the bot's name.
If the image was built before the claude binary was installed, the subprocess provider will fail to spawn. Set CLAUDE_CLI_PATH=/usr/local/bin/claude in the channel .env, or rebuild the image with the binary baked in.
The CLI cold starts in roughly one to three seconds before the model call. There is no pre-warm equivalent for the SDK's startup() yet. Acceptable for most channels, noticeable on highly interactive ones.
Each spawn starts fresh MCP processes. The long-running CLI amortizes this across follow-ups in a turn, but the first turn still pays the cost. Plan accordingly if your MCP servers are heavyweight.
From a non-main channel, the auth-dead notification cannot cross channels by default. The IPC layer rejects cross-channel sends from non-main. Until you enable subprocess on a main channel, the auth-dead message lands in whichever channel detected it.
Once you're on subprocessors, the architecture lets you layer extra runtime features on top. The three below are common in our internal deployments but are not part of the core migration. A vanilla nanoclaw install does not ship with these files, so treat them as opt-in. Only add them after the canary has soaked on plain subprocessors and you have a specific need.
Each of these is a separate body of code. None of them are required for subprocessors to work. Pick the ones that solve a problem you actually have.
What it is. A PreToolUse hook on Task / Agent that scans subagent prompts for trigger words and appends the matching tool-guide markdown to the spawned subagent's system prompt. Keeps the parent agent's prompt short while still delivering relevant guidance just-in-time.
Why it's useful. Tool guides for things like Google Ads, BigQuery, or Gmail are large. Loading all of them into every prompt is wasteful. Trigger-based injection means a subagent only sees what it actually needs.
Where the code lives. Hook script at container/agent-runner/src/hooks/inject-tool-guides.ts, guide content under groups/shared/tool-guides/, manifest at tool-guides/index.json.
High-level setup.
groups/shared/tool-guides/ on disk and populate it with guide markdown plus an index.json mapping trigger words to guide names.inject-tool-guides.ts hook to your container/agent-runner/src/hooks/ directory.settings.json as a PreToolUse hook on Task and Agent.container-runner.ts syncs the directory into data/sessions/{channel}/.claude/tool-guides/ on container start.What it is. A per-session hook that runs cosine similarity between the user's incoming message and a per-channel stubs.db of memory snippets, then injects the top matches into the system prompt. Acts as a lightweight retrieval layer for long-lived channel memory.
Why it's useful. Lets you keep memory.md short by offloading older or topic-specific entries into stubs. The runtime pulls them back in only when relevant.
Where the code lives. Per-session hook at data/sessions/{channel}/agent-runner-src/memory-stubs.ts, embedding store at groups/{channel}/memory/stubs.db.
High-level setup.
memory-stubs.ts to the agent-runner-src template that container-runner.ts writes per channel.UserPromptSubmit hook in the generated settings.json.stubs.db per channel using your embedding pipeline of choice. Until the DB exists, the hook is a no-op.What it is. A bundle of PostToolUse and stop-hook scripts that enforce style and safety rules on the assistant's output. Examples include Swedish character validation (catches a where รค belongs), em-dash detection, sanitised bash output, and a PreCompact archive that snapshots conversations before the CLI auto-compacts them.
Why it's useful. Catches recurring style violations before they reach the user, and preserves conversation history that would otherwise be lost to compaction.
Where the code lives. Hook scripts under container/agent-runner/src/hooks/: compliance.ts, sanitize-bash.ts, read-malware-neutralizer.ts, precompact-archive.ts, taskoutput-timeout.ts, with shared helpers in _lib.ts.
High-level setup.
container/agent-runner/src/hooks/ and rebuild the image.settings.json (PostToolUse, Stop, PreCompact as relevant).compliance.ts to match your channel's style guide. The defaults assume Republiken's Swedish + English bilingual setup.Since your installation does not have these features yet, treat them as opt-in. The reference files in section 9 point at the upstream implementations if you want to copy them over later.
The architecture was designed so that flipping back is a one-line change. No code revert needed.
# Remove the flag from the channel
sed -i '/USE_SUBPROCESS/d' /path/to/nanoclaw/groups/<CANARY_CHANNEL>/.env
# Force a fresh container spawn
docker ps --filter name=nanoclaw-<CANARY_CHANNEL> -q | xargs -r docker kill
Next message in that channel falls through to USE_PROVIDERS=1 if it is still set, otherwise to the legacy SDK path. No data migration. Conversations and memory stubs are unaffected.
If something corrupted the database (it should not, but just in case):
systemctl --user stop nanoclaw.service
cp /path/to/nanoclaw/store/messages.db.bak-YYYYMMDD /path/to/nanoclaw/store/messages.db
git checkout <PREV_COMMIT>
cd container && ./build.sh
systemctl --user start nanoclaw.service
Tip. Keep both USE_PROVIDERS=1 and USE_SUBPROCESS=1 in the canary channel's .env while you soak. Toggling one flag is faster than juggling commits, and you can A/B between paths if a regression shows up.
Anchor points to read in the codebase if you need to dig deeper. Paths are relative to the nanoclaw repo root.
| File | Why it matters |
|---|---|
container/agent-runner/POC-SUBPROCESS.md |
Original design document. Why it exists, what is and is not implemented, full migration plan. |
container/agent-runner/SUBPROCESS-CANARY.md |
Canary rollout playbook. Verification steps mirror the smoke checklist above. |
container/agent-runner/src/providers/cli-subprocess.ts |
The provider. Spawns the CLI, parses stream-json, manages session resume. |
container/agent-runner/src/hooks/ |
All on-disk hook scripts. inject-tool-guides.ts, compliance.ts, precompact-archive.ts, read-malware-neutralizer.ts, sanitize-bash.ts, taskoutput-timeout.ts, plus a shared _lib.ts. |
src/container-runner.ts |
Host-side. Creates per-channel session dirs, syncs tool guides and rules, mounts /home/node/.claude and /app/src, mounts the credentials file. |
data/sessions/{channel}/agent-runner-src/memory-stubs.ts |
Per-session runtime hook. Cosine similarity over stubs.db, returns matched archive entries for prompt injection. |
container/agent-runner/src/providers/factory.ts |
Picks the provider based on USE_SUBPROCESS / USE_PROVIDERS. Read this if a flag does not seem to take effect. |