← 返回文档索引

name: gbrain-mcp-recovery description: Diagnose and fix gbrain MCP startup failures — PGLite lock contention, WASM init failures, and engine migration.


gbrain MCP Recovery

gbrain MCP uses PGLite, an embedded single-writer database. If a previous bun run src/cli.ts serve process is still alive, it holds the PGLite write lock. Any new gbrain process will fail during connectEngine() with "Timed out waiting for PGLite lock", causing Hermes to hang up to 120 seconds on discover_mcp_tools.

Diagnosis Tools (before killing)

# Check if mcp Python SDK is installed in Hermes venv
/home/ubuntu/.hermes/hermes-agent/venv/bin/python -c "from mcp import StdioServerParameters; print('OK')"
# If this fails: pip install mcp into Hermes venv
/home/ubuntu/.hermes/hermes-agent/venv/bin/pip install mcp

Symptoms

Fix (automatic)

Run the recovery script:

~/.hermes/skills/infrastructure/gbrain-mcp-recovery/scripts/recover.sh

This script: 1. Finds any stale bun run src/cli.ts serve processes 2. Kills them (SIGTERM, then SIGKILL if needed) 3. Verifies no processes remain 4. Restarts the current Hermes session's MCP servers

Manual Fix

If the script isn't available:

# Find and kill stale gbrain processes
pkill -f "bun.*src/cli.ts.*serve" 2>/dev/null
sleep 1
# Verify clean
ps aux | grep "bun.*cli.ts" | grep -v grep
# Should be empty

Switching to PostgreSQL Engine (when PGLite WASM is unfixable)

When Bun's WASM runtime is fundamentally broken on the host kernel (not just lock contention), switch gbrain to a real PostgreSQL:

# 1. Install PG15+ (gbrain 0.24.0 requires PG15 for NULLS NOT DISTINCT)
sudo apt install -y postgresql-16 postgresql-16-pgvector

# 2. Ensure PG16 is running on port 5432 (stop PG14 if it conflicts)
sudo systemctl stop postgresql@14-main
sudo systemctl start postgresql@16-main

# 3. Create database and superuser
sudo -u postgres psql -c "CREATE USER gbrain WITH SUPERUSER PASSWORD 'gbrain2026';"
sudo -u postgres psql -c "CREATE DATABASE gbrain OWNER gbrain;"
sudo -u postgres psql -c "ALTER USER gbrain BYPASSRLS;"

# 4. Update gbrain config
cat > ~/.gbrain/config.json << 'EOF'
{"engine": "postgres", "database_url": "postgresql://gbrain:gbrain2026@localhost:5432/gbrain"}
EOF

# 5. Init
gbrain init --url "postgresql://gbrain:gbrain2026@localhost:5432/gbrain"

# 6. Copy config to profile-specific HOME dirs used by sync scripts
for prof in chief core forge; do
  mkdir -p ~/.hermes/profiles/$prof/home/.gbrain
  cp ~/.gbrain/config.json ~/.hermes/profiles/$prof/home/.gbrain/config.json
done

gbrain put file bug (PostgreSQL engine)

gbrain put <slug> <file> stores the page but does NOT chunk the content (chunks: 0, compiled_truth empty). Use --content or --stdin instead:

# BROKEN: chunks=0, content lost
gbrain put my-page /tmp/page.md

# WORKS: chunks created, content stored
gbrain put my-page --stdin < /tmp/page.md

# ALSO WORKS: --content parameter (most reliable from scripts)
gbrain put my-page --content "$(cat /tmp/page.md)"

Sync scripts that call gbrain put <slug> <file> must be patched to use --content or --stdin piping.

gbrain put --stdin via Python subprocess (broken)

gbrain put <slug> --stdin from a Python subprocess.run(stdin=f) call fails because gbrain is a bash wrapper (exec bun run src/cli.ts "$@") — stdin doesn't propagate through the exec chain correctly in all subprocess configurations. The command returns exit code 1 with "Page not found: <slug>" even though the page doesn't exist yet.

Fix: Always use --content <markdown_string> instead of --stdin when calling gbrain from Python:

# BROKEN: returns "Page not found" even for new pages
r = subprocess.run([gbrain, "put", slug, "--stdin"], stdin=f)

# WORKS: creates/updates page correctly
r = subprocess.run([gbrain, "put", slug, "--content", page_content])

gbrain put with Chinese characters in slug (broken)

gbrain put fails with "Page not found: wiki/中文-slug" when the slug contains certain Chinese character + ASCII combinations, even though the page doesn't exist and other slugs work fine. The error is inconsistent — some Chinese characters work alone, others with specific mixed patterns fail.

Fix: Use pure ASCII slugs only. Strip Chinese characters from the slug (they're stored in the frontmatter title field regardless):

def slugify(title, obj_token):
    safe = re.sub(r'[^a-zA-Z0-9]', '-', title)
    safe = re.sub(r'-+', '-', safe).strip('-').lower()[:40]
    if not safe:
        safe = "page"
    h = hashlib.md5(obj_token.encode()).hexdigest()[:6]
    return f"wiki/{safe}-{h}"

KillMode hardcoded in Hermes source

Hermes hermes_cli/gateway.py hardcodes KillMode=mixed at lines 1768 and 1803. After patching the service file, gateway restarts or updates will revert it. Must also patch the source:

sed -i 's/KillMode=mixed/KillMode=control-group/g' \
  ~/.hermes/hermes-agent/hermes_cli/gateway.py

Permanent Fix: Switch to PostgreSQL Engine

When PGLite WASM is fundamentally broken (Bun version + kernel incompatibility), the permanent fix is switching gbrain to use a real PostgreSQL instance:

# 1. Install PostgreSQL 16+pgvector (gbrain 0.24.0 needs PG15+ for NULLS NOT DISTINCT)
sudo apt install -y postgresql-16 postgresql-16-pgvector

# 2. Create database
sudo -u postgres psql -c "CREATE USER gbrain WITH SUPERUSER PASSWORD 'gbrain2026';"
sudo -u postgres psql -c "CREATE DATABASE gbrain OWNER gbrain;"
sudo -u postgres psql -c "ALTER USER gbrain BYPASSRLS;"

# 3. Update config
cat > ~/.gbrain/config.json << 'EOF'
{"engine": "postgres", "database_url": "postgresql://gbrain:gbrain2026@localhost:5432/gbrain"}
EOF

# 4. Initialize
gbrain init --url "postgresql://gbrain:gbrain2026@localhost:5432/gbrain"

# 5. Update gbrain CLI wrapper to use correct bun binary if needed
# /usr/local/bin/gbrain — replace baseline binary path with AVX2 binary
sudo sed -i 's|bun-linux-x64-baseline|bun-linux-x64|g' /usr/local/bin/gbrain
sudo sed -i 's|bun-linux-x64-baseline|bun-linux-x64|g' /usr/local/bin/gbrain-mcp

Pitfalls: - sync scripts (feishu_wiki_sync.py, feishu_calendar_sync.py) hardcode HOME to profile-specific paths. Copy config.json to each profile's .gbrain/ directory. - gbrain put <file> has a bug in PG engine — use gbrain put <slug> --stdin < file instead - PostgreSQL 14 is insufficient — gbrain 0.24.0 uses NULLS NOT DISTINCT (PG15+)

Verification

gbrain search "test"   # Should return results or "No results" — NOT WASM error
gbrain stats           # Should show page counts
hermes mcp list        # gbrain should show ✓ enabled
tail ~/.hermes/logs/mcp-stderr.log  # No "Aborted()" or "PGLite failed"

Pitfalls

bash /home/ubuntu/.hermes/hermes-agent/venv/bin/python3 -c "from mcp import StdioServerParameters; print('OK')" # ^ if ModuleNotFoundError → fix: /home/ubuntu/.hermes/hermes-agent/venv/bin/pip install mcp

This is the most common cause of "gbrain (stdio) — failed" with no PGLite errors in stderr logs. The Hermes CLI runs from its own venv, not the system Python. The try/except around the import swallows the error silently — _MCP_AVAILABLE becomes False and MCP test/connect functions crash with NameError at runtime.

Production Leak: Gateway Restart → Subprocess Accumulation → OOM

When the gateway is restarted (e.g. by cron_monitor.sh after a cron tick timeout), the old gateway's gbrain subprocess (bun run src/cli.ts serve) is NOT killed if systemd uses KillMode=mixed (only kills main PID, not cgroup children). Each leak costs ~270MB RSS — after 10+ cycles (~5 hours), total memory exceeds 3.6GB on small VMs with no swap, causing OOM and complete service outage.

Deeper root: MCP keepalive failure → reconnect → orphan accumulation — see references/keepalive-reconnect-orphan-chain.md for the full code-level trace. The short version: _wait_for_lifecycle_event() sends list_tools() every 180s; when it fails, _run_stdio() exits and its finally block (L1243-1259) adds the old bun PID to _orphan_stdio_pids but does NOT kill it; the reconnect loop spawns a new gbrain; the old one accumulates.

Symptoms: - Server becomes sluggish after 1-2 hours of user inactivity - cron_monitor.log shows "tick timed out" every 30 minutes - gateway.log shutdown diagnostics show growing list of orphaned bun run ... cli.ts serve processes - ps aux | grep "bun.*cli.ts.*serve" | wc -l > 2 (should be ≤2)

Immediate止血:

# Disable the restart loop
crontab -l | sed 's|^\\(\\*/30.*cron_monitor.sh\\)|# DISABLED: \\1|' | crontab -
# Clean up orphans
pkill -f "bun.*cli.ts.*serve"

2026-05-14: Root code fix applied to hermes-agent/tools/mcp_tool.py. The orphan-not-killed bug in _run_stdio() finally block (L1243-1259) is now fixed: instead of just adding surviving PIDs to _orphan_stdio_pids, the finally block actively sends SIGTERM → sleep 0.5s → SIGKILL to any child processes that didn't exit with the SDK teardown. Additionally, the reconnect loop in run() method now calls _kill_orphaned_mcp_children() before spawning a new subprocess, and _KEEPALIVE_INTERVAL was reduced from 180s to 60s for faster dead-connection detection. See references/keepalive-reconnect-orphan-chain.md for the updated code trace.

Ongoing monitoring: Even with the root fix, the defense-in-depth layers below are worth keeping as a safety net against edge cases (e.g., bun process trees that escape a single SIGKILL, or kernel-level process state wedges).

Defense in depth:

Layer Fix Effect
1 cron_monitor restart limit (max 3/30min) Breaks the restart loop
2 cron_monitor zombie detection (kill >3 gbrain procs) Cleans before restart
3 systemd KillMode=control-group All cgroup children die with gateway
4 swap (4GB) OOM buffer, prevents instant collapse
# 1. Fix KillMode in BOTH places (source code + generated service file)
#    Service file alone is NOT enough — it gets regenerated from gateway.py template
#    on every gateway restart/update. Must patch the source:
sed -i 's/KillMode=mixed/KillMode=control-group/g' ~/.hermes/hermes-agent/hermes_cli/gateway.py
#    Then edit the generated service file so it takes effect immediately:
sed -i 's/KillMode=mixed/KillMode=control-group/g' ~/.config/systemd/user/hermes-gateway.service
systemctl --user daemon-reload

# 2. Add swap
sudo fallocate -l 4G /swapfile && sudo chmod 600 /swapfile
sudo mkswap /swapfile && sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

# 3. Enhanced cron_monitor (restart limits + zombie detection)
# See references/cron-monitor-hardening.md for the full script

# 4. Re-enable cron_monitor if disabled
crontab -l | sed 's|^# DISABLED.*\\*/30|*/30|' | crontab -

Knowledge Graph Maintenance (PostgreSQL engine)

Once PostgreSQL is running, gbrain needs maintenance jobs to keep the knowledge graph healthy.

Initial Setup

# 1. Check current health
gbrain doctor

# 2. If Minions is half-migrated (schema behind, no preferences.json):
gbrain apply-migrations --yes

# 3. Install autopilot/worker (select linux-systemd when prompted)
#    Or set up Minions worker as systemd user service:
cat > ~/.config/systemd/user/gbrain-minions.service << 'SERVICEEOF'
[Unit]
Description=GBrain Minions Worker — process knowledge graph jobs
After=hermes-memory-gateway.service

[Service]
Type=simple
ExecStart=/usr/local/bin/gbrain jobs work --queue default
Restart=on-failure
RestartSec=30

[Install]
WantedBy=default.target
SERVICEEOF
systemctl --user daemon-reload
systemctl --user enable --now gbrain-minions.service

The systemd worker auto-restarts on failure and survives gateway restarts.

Maintenance Jobs

Submit via MCP or CLI. The Minions worker processes them asynchronously:

Job Frequency Effect
embed Once, then weekly Generate vector embeddings for all chunks
backlinks Once, then weekly Compute entity link relationships between pages
extract Once Extract timeline references from filesystem content
autopilot-cycle Weekly Self-maintaining brain daemon cycle

CLI commands for ad-hoc runs:

# Generate embeddings for stale pages
gbrain embed --stale

# Or submit to Minions queue via CLI
gbrain jobs submit embed
gbrain jobs submit backlinks

# Check queue
gbrain jobs list
gbrain jobs stats

Pitfall: zombie_cleanup.sh keeps oldest PID

The existing zombie_cleanup.sh (every 2h cron) keeps the oldest (smallest PID) gbrain process and kills newer ones. After a fresh gbrain starts serving MCP traffic, the old process becomes the orphan — but the script keeps it and kills the healthy new one. This is wrong.

Fix after migration cleanup: If only 1-2 healthy processes remain, disable or patch the zombie cleanup script. With the May-14 code fix in mcp_tool.py, orphan accumulation should no longer occur. The systemd Minions worker does not depend on gbrain MCP processes.

Verification

gbrain doctor
# Expected: Health score 80+ (embed 35/35, dead-links 10/10)
# Links and timeline scores remain low until structured content is added manually

gbrain jobs stats    # Queue healthy
systemctl --user status gbrain-minions.service  # Active (running)

Related