name: gbrain-mcp-recovery description: Diagnose and fix gbrain MCP startup failures — PGLite lock contention, WASM init failures, and engine migration.

gbrain MCP Recovery

gbrain MCP uses PGLite, an embedded single-writer database. If a previous bun run src/cli.ts serve process is still alive, it holds the PGLite write lock. Any new gbrain process will fail during connectEngine() with "Timed out waiting for PGLite lock", causing Hermes to hang up to 120 seconds on discover_mcp_tools.

Diagnosis Tools (before killing)

# Check if mcp Python SDK is installed in Hermes venv
/home/ubuntu/.hermes/hermes-agent/venv/bin/python -c "from mcp import StdioServerParameters; print('OK')"
# If this fails: pip install mcp into Hermes venv
/home/ubuntu/.hermes/hermes-agent/venv/bin/pip install mcp

Symptoms

hermes startup hangs at discover_mcp_tools
~/.hermes/logs/mcp-stderr.log contains: GBrain: Timed out waiting for PGLite lock.
hermes mcp list shows gbrain as enabled but startup is slow or hangs

Fix (automatic)

Run the recovery script:

~/.hermes/skills/infrastructure/gbrain-mcp-recovery/scripts/recover.sh

This script: 1. Finds any stale bun run src/cli.ts serve processes 2. Kills them (SIGTERM, then SIGKILL if needed) 3. Verifies no processes remain 4. Restarts the current Hermes session's MCP servers

Manual Fix

If the script isn't available:

# Find and kill stale gbrain processes
pkill -f "bun.*src/cli.ts.*serve" 2>/dev/null
sleep 1
# Verify clean
ps aux | grep "bun.*cli.ts" | grep -v grep
# Should be empty

Switching to PostgreSQL Engine (when PGLite WASM is unfixable)

When Bun's WASM runtime is fundamentally broken on the host kernel (not just lock contention), switch gbrain to a real PostgreSQL:

# 1. Install PG15+ (gbrain 0.24.0 requires PG15 for NULLS NOT DISTINCT)
sudo apt install -y postgresql-16 postgresql-16-pgvector

# 2. Ensure PG16 is running on port 5432 (stop PG14 if it conflicts)
sudo systemctl stop postgresql@14-main
sudo systemctl start postgresql@16-main

# 3. Create database and superuser
sudo -u postgres psql -c "CREATE USER gbrain WITH SUPERUSER PASSWORD 'gbrain2026';"
sudo -u postgres psql -c "CREATE DATABASE gbrain OWNER gbrain;"
sudo -u postgres psql -c "ALTER USER gbrain BYPASSRLS;"

# 4. Update gbrain config
cat > ~/.gbrain/config.json << 'EOF'
{"engine": "postgres", "database_url": "postgresql://gbrain:gbrain2026@localhost:5432/gbrain"}
EOF

# 5. Init
gbrain init --url "postgresql://gbrain:gbrain2026@localhost:5432/gbrain"

# 6. Copy config to profile-specific HOME dirs used by sync scripts
for prof in chief core forge; do
  mkdir -p ~/.hermes/profiles/$prof/home/.gbrain
  cp ~/.gbrain/config.json ~/.hermes/profiles/$prof/home/.gbrain/config.json
done

gbrain put file bug (PostgreSQL engine)

gbrain put <slug> <file> stores the page but does NOT chunk the content (chunks: 0, compiled_truth empty). Use --content or --stdin instead:

# BROKEN: chunks=0, content lost
gbrain put my-page /tmp/page.md

# WORKS: chunks created, content stored
gbrain put my-page --stdin < /tmp/page.md

# ALSO WORKS: --content parameter (most reliable from scripts)
gbrain put my-page --content "$(cat /tmp/page.md)"

Sync scripts that call gbrain put <slug> <file> must be patched to use --content or --stdin piping.

gbrain put --stdin via Python subprocess (broken)

gbrain put <slug> --stdin from a Python subprocess.run(stdin=f) call fails because gbrain is a bash wrapper (exec bun run src/cli.ts "$@") — stdin doesn't propagate through the exec chain correctly in all subprocess configurations. The command returns exit code 1 with "Page not found: <slug>" even though the page doesn't exist yet.

Fix: Always use --content <markdown_string> instead of --stdin when calling gbrain from Python:

# BROKEN: returns "Page not found" even for new pages
r = subprocess.run([gbrain, "put", slug, "--stdin"], stdin=f)

# WORKS: creates/updates page correctly
r = subprocess.run([gbrain, "put", slug, "--content", page_content])

gbrain put with Chinese characters in slug (broken)

gbrain put fails with "Page not found: wiki/中文-slug" when the slug contains certain Chinese character + ASCII combinations, even though the page doesn't exist and other slugs work fine. The error is inconsistent — some Chinese characters work alone, others with specific mixed patterns fail.

Fix: Use pure ASCII slugs only. Strip Chinese characters from the slug (they're stored in the frontmatter title field regardless):

def slugify(title, obj_token):
    safe = re.sub(r'[^a-zA-Z0-9]', '-', title)
    safe = re.sub(r'-+', '-', safe).strip('-').lower()[:40]
    if not safe:
        safe = "page"
    h = hashlib.md5(obj_token.encode()).hexdigest()[:6]
    return f"wiki/{safe}-{h}"

KillMode hardcoded in Hermes source

Hermes hermes_cli/gateway.py hardcodes KillMode=mixed at lines 1768 and 1803. After patching the service file, gateway restarts or updates will revert it. Must also patch the source:

sed -i 's/KillMode=mixed/KillMode=control-group/g' \
  ~/.hermes/hermes-agent/hermes_cli/gateway.py

Permanent Fix: Switch to PostgreSQL Engine

When PGLite WASM is fundamentally broken (Bun version + kernel incompatibility), the permanent fix is switching gbrain to use a real PostgreSQL instance:

# 1. Install PostgreSQL 16+pgvector (gbrain 0.24.0 needs PG15+ for NULLS NOT DISTINCT)
sudo apt install -y postgresql-16 postgresql-16-pgvector

# 2. Create database
sudo -u postgres psql -c "CREATE USER gbrain WITH SUPERUSER PASSWORD 'gbrain2026';"
sudo -u postgres psql -c "CREATE DATABASE gbrain OWNER gbrain;"
sudo -u postgres psql -c "ALTER USER gbrain BYPASSRLS;"

# 3. Update config
cat > ~/.gbrain/config.json << 'EOF'
{"engine": "postgres", "database_url": "postgresql://gbrain:gbrain2026@localhost:5432/gbrain"}
EOF

# 4. Initialize
gbrain init --url "postgresql://gbrain:gbrain2026@localhost:5432/gbrain"

# 5. Update gbrain CLI wrapper to use correct bun binary if needed
# /usr/local/bin/gbrain — replace baseline binary path with AVX2 binary
sudo sed -i 's|bun-linux-x64-baseline|bun-linux-x64|g' /usr/local/bin/gbrain
sudo sed -i 's|bun-linux-x64-baseline|bun-linux-x64|g' /usr/local/bin/gbrain-mcp

Pitfalls: - sync scripts (feishu_wiki_sync.py, feishu_calendar_sync.py) hardcode HOME to profile-specific paths. Copy config.json to each profile's .gbrain/ directory. - gbrain put <file> has a bug in PG engine — use gbrain put <slug> --stdin < file instead - PostgreSQL 14 is insufficient — gbrain 0.24.0 uses NULLS NOT DISTINCT (PG15+)

Verification

gbrain search "test"   # Should return results or "No results" — NOT WASM error
gbrain stats           # Should show page counts
hermes mcp list        # gbrain should show ✓ enabled
tail ~/.hermes/logs/mcp-stderr.log  # No "Aborted()" or "PGLite failed"

Pitfalls

mcp Python package missing from Hermes venv: StdioServerParameters is not defined or NameError: name 'StdioServerParameters' is not defined in tools/mcp_tool.py means the mcp SDK package is installed in system Python but NOT in Hermes's venv. Always check:

bash /home/ubuntu/.hermes/hermes-agent/venv/bin/python3 -c "from mcp import StdioServerParameters; print('OK')" # ^ if ModuleNotFoundError → fix: /home/ubuntu/.hermes/hermes-agent/venv/bin/pip install mcp

This is the most common cause of "gbrain (stdio) — failed" with no PGLite errors in stderr logs. The Hermes CLI runs from its own venv, not the system Python. The try/except around the import swallows the error silently — _MCP_AVAILABLE becomes False and MCP test/connect functions crash with NameError at runtime.

PGLite does NOT support concurrent writers. Only one gbrain MCP process can run at a time.
If you need to run multiple Hermes profiles with gbrain MCP, run them sequentially, not in parallel.
If you switch profiles, the previous profile's MCP process may not auto-terminate. Kill it manually before starting another.
Recovery can permanently kill the server: If PGLite WASM fails to initialize on new processes (e.g., Bun version incompatibility with host kernel), the recovery script will kill the old working server but a new one can't restart. The gbrain serve command exits immediately with PGLite failed to initialize its WASM runtime. Original error: Aborted(). The Hermes gateway does NOT auto-restart in this case. Before running recovery, verify a fresh gbrain serve works. If it doesn't, do not kill the existing process — fix the WASM issue first. See references/pglite-wasm-startup-failure.md.
Hardcoded Bun paths: /usr/local/bin/gbrain and /usr/local/bin/gbrain-mcp have absolute paths to the Bun binary. When switching Bun installations (e.g. baseline ↔ non-baseline), update both scripts with sudo sed -i or they'll keep calling the old binary.
PostgreSQL 15+ required: gbrain 0.24.0 uses NULLS NOT DISTINCT syntax. PG14 will fail with "syntax error at or near NULLS". Install PG16 from PGDG repo. See references/postgres-engine-migration.md for the full recipe.
cron tick false positive: cron_monitor.sh uses a 10s timeout for hermes cron tick. Cold-start CLI init can exceed 10s in low-traffic periods, causing spurious gateway restarts even when the ticker is healthy. See references/cron-tick-false-positive.md.
gbrain init forces PGLite when run bare: Running gbrain init without --url always goes to initPGLite(), ignoring any config.json that says "engine": "postgres". Use gbrain init --url "postgresql://..." to force PostgreSQL engine. Or gbrain init --non-interactive with DATABASE_URL env var.
gbrain put <file> drops content in PostgreSQL engine: gbrain put <slug> <file> creates the page but stores 0 chunks — content is lost. Always use gbrain put <slug> --stdin < <file> as workaround. See references/postgres-engine-gotchas.md for full details including sync-script fixes.
Sync scripts need per-profile gbrain configs: feishu_wiki_sync.py and feishu_calendar_sync.py run gbrain under profile-specific $HOME. Each profile home needs its own .gbrain/config.json. See references/postgres-engine-gotchas.md.

Production Leak: Gateway Restart → Subprocess Accumulation → OOM

When the gateway is restarted (e.g. by cron_monitor.sh after a cron tick timeout), the old gateway's gbrain subprocess (bun run src/cli.ts serve) is NOT killed if systemd uses KillMode=mixed (only kills main PID, not cgroup children). Each leak costs ~270MB RSS — after 10+ cycles (~5 hours), total memory exceeds 3.6GB on small VMs with no swap, causing OOM and complete service outage.

Deeper root: MCP keepalive failure → reconnect → orphan accumulation — see references/keepalive-reconnect-orphan-chain.md for the full code-level trace. The short version: _wait_for_lifecycle_event() sends list_tools() every 180s; when it fails, _run_stdio() exits and its finally block (L1243-1259) adds the old bun PID to _orphan_stdio_pids but does NOT kill it; the reconnect loop spawns a new gbrain; the old one accumulates.

Symptoms: - Server becomes sluggish after 1-2 hours of user inactivity - cron_monitor.log shows "tick timed out" every 30 minutes - gateway.log shutdown diagnostics show growing list of orphaned bun run ... cli.ts serve processes - ps aux | grep "bun.*cli.ts.*serve" | wc -l > 2 (should be ≤2)

Immediate止血:

# Disable the restart loop
crontab -l | sed 's|^\\(\\*/30.*cron_monitor.sh\\)|# DISABLED: \\1|' | crontab -
# Clean up orphans
pkill -f "bun.*cli.ts.*serve"

2026-05-14: Root code fix applied to hermes-agent/tools/mcp_tool.py. The orphan-not-killed bug in _run_stdio() finally block (L1243-1259) is now fixed: instead of just adding surviving PIDs to _orphan_stdio_pids, the finally block actively sends SIGTERM → sleep 0.5s → SIGKILL to any child processes that didn't exit with the SDK teardown. Additionally, the reconnect loop in run() method now calls _kill_orphaned_mcp_children() before spawning a new subprocess, and _KEEPALIVE_INTERVAL was reduced from 180s to 60s for faster dead-connection detection. See references/keepalive-reconnect-orphan-chain.md for the updated code trace.

Ongoing monitoring: Even with the root fix, the defense-in-depth layers below are worth keeping as a safety net against edge cases (e.g., bun process trees that escape a single SIGKILL, or kernel-level process state wedges).

Defense in depth:

Layer	Fix	Effect
1	cron_monitor restart limit (max 3/30min)	Breaks the restart loop
2	cron_monitor zombie detection (kill >3 gbrain procs)	Cleans before restart
3	systemd `KillMode=control-group`	All cgroup children die with gateway
4	swap (4GB)	OOM buffer, prevents instant collapse

# 1. Fix KillMode in BOTH places (source code + generated service file)
#    Service file alone is NOT enough — it gets regenerated from gateway.py template
#    on every gateway restart/update. Must patch the source:
sed -i 's/KillMode=mixed/KillMode=control-group/g' ~/.hermes/hermes-agent/hermes_cli/gateway.py
#    Then edit the generated service file so it takes effect immediately:
sed -i 's/KillMode=mixed/KillMode=control-group/g' ~/.config/systemd/user/hermes-gateway.service
systemctl --user daemon-reload

# 2. Add swap
sudo fallocate -l 4G /swapfile && sudo chmod 600 /swapfile
sudo mkswap /swapfile && sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

# 3. Enhanced cron_monitor (restart limits + zombie detection)
# See references/cron-monitor-hardening.md for the full script

# 4. Re-enable cron_monitor if disabled
crontab -l | sed 's|^# DISABLED.*\\*/30|*/30|' | crontab -

Knowledge Graph Maintenance (PostgreSQL engine)

Once PostgreSQL is running, gbrain needs maintenance jobs to keep the knowledge graph healthy.

Initial Setup

# 1. Check current health
gbrain doctor

# 2. If Minions is half-migrated (schema behind, no preferences.json):
gbrain apply-migrations --yes

# 3. Install autopilot/worker (select linux-systemd when prompted)
#    Or set up Minions worker as systemd user service:
cat > ~/.config/systemd/user/gbrain-minions.service << 'SERVICEEOF'
[Unit]
Description=GBrain Minions Worker — process knowledge graph jobs
After=hermes-memory-gateway.service

[Service]
Type=simple
ExecStart=/usr/local/bin/gbrain jobs work --queue default
Restart=on-failure
RestartSec=30

[Install]
WantedBy=default.target
SERVICEEOF
systemctl --user daemon-reload
systemctl --user enable --now gbrain-minions.service

The systemd worker auto-restarts on failure and survives gateway restarts.

Maintenance Jobs

Submit via MCP or CLI. The Minions worker processes them asynchronously:

Job	Frequency	Effect
`embed`	Once, then weekly	Generate vector embeddings for all chunks
`backlinks`	Once, then weekly	Compute entity link relationships between pages
`extract`	Once	Extract timeline references from filesystem content
`autopilot-cycle`	Weekly	Self-maintaining brain daemon cycle

CLI commands for ad-hoc runs:

# Generate embeddings for stale pages
gbrain embed --stale

# Or submit to Minions queue via CLI
gbrain jobs submit embed
gbrain jobs submit backlinks

# Check queue
gbrain jobs list
gbrain jobs stats

Pitfall: zombie_cleanup.sh keeps oldest PID

The existing zombie_cleanup.sh (every 2h cron) keeps the oldest (smallest PID) gbrain process and kills newer ones. After a fresh gbrain starts serving MCP traffic, the old process becomes the orphan — but the script keeps it and kills the healthy new one. This is wrong.

Fix after migration cleanup: If only 1-2 healthy processes remain, disable or patch the zombie cleanup script. With the May-14 code fix in mcp_tool.py, orphan accumulation should no longer occur. The systemd Minions worker does not depend on gbrain MCP processes.

Verification

gbrain doctor
# Expected: Health score 80+ (embed 35/35, dead-links 10/10)
# Links and timeline scores remain low until structured content is added manually

gbrain jobs stats    # Queue healthy
systemctl --user status gbrain-minions.service  # Active (running)

PGLite → PostgreSQL Migration: references/postgresql-migration.md — complete guide when WASM is unfixable
Gateway Restart Leak: references/gateway-restart-leak.md — diagnostic recipe for memory leaks
PGLite WASM Failure: references/pglite-wasm-startup-failure.md — Bun 1.3.13 + kernel 5.15.0-176 bug
Cron monitor hardening recipe: references/cron-monitor-hardening.md
MCP Keepalive → Reconnect → Orphan Chain: references/keepalive-reconnect-orphan-chain.md — code-level trace of keepalive failure through _run_stdio() orphan accumulation. The root mechanism behind zombie gbrain processes. Includes exact code locations (mcp_tool.py L1243-1259 for the orphan-not-killed bug), constant values (_KEEPALIVE_INTERVAL=180s, _MAX_RECONNECT_RETRIES=5), and the 6-minute silent-window phenomenon during reconnect.

gbrain MCP Recovery

Diagnosis Tools (before killing)

Symptoms

Fix (automatic)

Manual Fix

Switching to PostgreSQL Engine (when PGLite WASM is unfixable)

gbrain put file bug (PostgreSQL engine)

gbrain put --stdin via Python subprocess (broken)

gbrain put with Chinese characters in slug (broken)

KillMode hardcoded in Hermes source

Permanent Fix: Switch to PostgreSQL Engine

Verification

Pitfalls

Production Leak: Gateway Restart → Subprocess Accumulation → OOM

Knowledge Graph Maintenance (PostgreSQL engine)

Initial Setup

Maintenance Jobs

Pitfall: zombie_cleanup.sh keeps oldest PID

Verification

Related