name: systematic-debugging description: "4-phase root cause debugging: understand bugs before fixing." version: 1.4.0 author: Hermes Agent (adapted from obra/superpowers) license: MIT platforms: [linux, macos, windows] metadata: hermes: tags: [debugging, troubleshooting, problem-solving, root-cause, investigation] related_skills: [test-driven-development, writing-plans, subagent-driven-development] evaluations: path: evaluations/golden.jsonl description: Golden eval cases validating systematic debugging phases and root cause investigation.

Systematic Debugging

Overview

Random fixes waste time and create new bugs. Quick patches mask underlying issues.

Core principle: ALWAYS find root cause before attempting fixes. Symptom fixes are failure.

Violating the letter of this process is violating the spirit of debugging.

The Iron Law

NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST

If you haven't completed Phase 1, you cannot propose fixes.

When to Use

Use for ANY technical issue: - Test failures - Bugs in production - Unexpected behavior - Performance problems - Build failures - Integration issues

Use this ESPECIALLY when: - Under time pressure (emergencies make guessing tempting) - "Just one quick fix" seems obvious - You've already tried multiple fixes - Previous fix didn't work - You don't fully understand the issue

Don't skip when: - Issue seems simple (simple bugs have root causes too) - You're in a hurry (rushing guarantees rework) - Someone wants it fixed NOW (systematic is faster than thrashing)

The Four Phases

You MUST complete each phase before proceeding to the next.

Phase 1: Root Cause Investigation

BEFORE attempting ANY fix:

1. Read Error Messages Carefully

Don't skip past errors or warnings
They often contain the exact solution
Read stack traces completely
Note line numbers, file paths, error codes

Action: Use read_file on the relevant source files. Use search_files to find the error string in the codebase.

2. Reproduce Consistently

Can you trigger it reliably?
What are the exact steps?
Does it happen every time?
If not reproducible → gather more data, don't guess

Action: Use the terminal tool to run the failing test or trigger the bug:

# Run specific failing test
pytest tests/test_module.py::test_name -v

# Run with verbose output
pytest tests/test_module.py -v --tb=long

3. Check Recent Changes

What changed that could cause this?
Git diff, recent commits
New dependencies, config changes

Action:

# Recent commits
git log --oneline -10

# Uncommitted changes
git diff

# Changes in specific file
git log -p --follow src/problematic_file.py | head -100

3.5 Check Container vs Host Code Divergence

WHEN a service runs in Docker and the front-end shows unexpected behavior:

The running container may have different code than what's on the host filesystem. Containers are built at image creation time — host changes after docker build are NOT reflected in the running container.

Diagnostic pattern:

# Step 1 — Check if container uses API-based front-end vs static JSON
docker exec <container> cat /app/public/index.html | grep -E "fetch\(|loadData" | head -5

# Step 2 — Compare with host source
grep -E "fetch\(|loadData" ./public/index.html

# Step 3 — Pull container file for full diff
docker cp <container>:/app/public/index.html /tmp/container-version.html
diff /tmp/container-version.html ./public/index.html

What to look for: - Different API endpoints being called (e.g. latest.json vs api/stats + api/images) - Different authentication patterns (localStorage token usage vs none) - Different file size (index.html in container may be updated while host version is stale)

Also verify with nginx access logs — they show the REAL HTTP status codes the browser receives:

tail -100 /var/log/nginx/access.log | grep gallery | grep -v "200\|304"
# Look for 401, 403, 500, 502, 503

If nginx shows 401 for API calls that the front-end makes, the issue is missing auth credentials in the fetch calls, not a server-side outage.

4. Gather Evidence in Multi-Component Systems

WHEN system has multiple components (API → service → database, CI → build → deploy):

BEFORE proposing fixes, add diagnostic instrumentation:

For EACH component boundary: - Log what data enters the component - Log what data exits the component - Verify environment/config propagation - Check state at each layer

Run once to gather evidence showing WHERE it breaks. THEN analyze evidence to identify the failing component. THEN investigate that specific component.

5. Trace Data Flow

WHEN error is deep in the call stack:

Where does the bad value originate?
What called this function with the bad value?
Keep tracing upstream until you find the source
Fix at the source, not at the symptom

Action: Use search_files to trace references:

# Find where the function is called
search_files("function_name(", path="src/", file_glob="*.py")

# Find where the variable is set
search_files("variable_name\\s*=", path="src/", file_glob="*.py")

6. Trace Process Cascades (Multi-Process Systems)

WHEN failure involves multiple processes (parent/child, systemd services, daemons):

A single failure in one process can cascade into service-manager-level restarts, creating the illusion of a larger crash. Separate symptom from cause:

Step 1 — Map the process tree:

journalctl --user -u <service> --since "5 minutes ago"
ps auxf | grep -E "<relevant_processes>" | grep -v grep

Look for: - Orphan/zombie processes (status Z or child without parent entry in tree) - Multiple instances of the same process (leak) - Processes running longer than expected

Step 2 — Check systemd cgroup kill behavior:

With KillMode=mixed or KillMode=control-group, SIGTERM goes to ALL processes in the cgroup, not just the main process. If children don't exit within TimeoutStopSec, systemd escalates to SIGKILL. This makes normal shutdowns look like crashes (result='timeout').

systemctl --user cat <service> | grep -E "KillMode|TimeoutStopSec"

Step 3 — Trace subprocess lifecycle:

For MCP or stdio-based services with a wrapper chain (e.g. bash → bun/python):

Main process → bash wrapper → child process (e.g. bun, node, python)

When the main process exits: - Does the wrapper forward the signal? - Does the child survive as an orphan? - Is there a cleanup mechanism (finally blocks, SIGTERM handlers, PID tracking)?

Check the source code:

# Find orphan/zombie cleanup patterns
search_files("finally|SIGTERM|_kill_orphaned|_orphan", path="src/", file_glob="*.py")
# Find subprocess PID tracking
search_files("_stdio_pids|_snapshot_child", path="src/", file_glob="*.py")

Step 4 — Check subprocess reconnection behavior:

When connection to a subprocess fails (e.g. MCP keepalive timeout): - Does the framework try to reconnect automatically? - Does reconnection spawn a NEW subprocess without cleaning the old one? - What's the cleanup sequence before spawning?

search_files("reconnect|keepalive|_MAX_RECONNECT|_run_stdio", path="src/", file_glob="*.py")

Step 5 — Correlate timestamps across logs:

When you see SIGTERM or Stopping in the journal, check nearby timestamps for earlier warnings:

journalctl --user -u <service> --since "<failure_time - 10min>" --until "<failure_time>"

A keepalive failed or reconnect warning 5-10 minutes before a SIGTERM is a strong signal that the cascade started with the subprocess, not the main service.

Step 6 — Know who sent the signal:

Check the journal in verbose mode to find who initiated the stop:

journalctl --user -u <service> -o verbose --since "<time>" --until "<time+1s>"

Look for JOB_TYPE=stop on the systemd message (external systemctl stop) vs exit-code (self-initiated exit). Also check who _PID is — systemd itself (PID 1) vs another process.

Step 7 — Check subprocess stderr logs:

MCP servers redirect stderr to a shared log file. Check for startup messages or crashes:

ls -la ~/.hermes/logs/mcp-stderr.log*
zcat -f ~/.hermes/logs/mcp-stderr.log* | grep -E "error|Error|ERROR|crash|panic|traceback"

Empty or clean stderr logs → the failure is not in the child process itself but in the pipe/connection between parent and child.

Phase 1 Completion Checklist

[ ] Error messages fully read and understood
[ ] Issue reproduced consistently
[ ] Recent changes identified and reviewed
[ ] Evidence gathered (logs, state, data flow)
[ ] Problem isolated to specific component/code
[ ] Root cause hypothesis formed
[ ] For multi-process: process tree mapped, cgroup behavior checked, subprocess lifecycle traced

STOP: Do not proceed to Phase 2 until you understand WHY it's happening.

Phase 2: Pattern Analysis

Find the pattern before fixing:

1. Find Working Examples

Locate similar working code in the same codebase
What works that's similar to what's broken?

Action: Use search_files to find comparable patterns:

search_files("similar_pattern", path="src/", file_glob="*.py")

2. Compare Against References

If implementing a pattern, read the reference implementation COMPLETELY
Don't skim — read every line
Understand the pattern fully before applying

3. Identify Differences

What's different between working and broken?
List every difference, however small
Don't assume "that can't matter"

4. Understand Dependencies

What other components does this need?
What settings, config, environment?
What assumptions does it make?

6. Check Front-end API Auth Patterns

WHEN debugging 401 errors from front-end to API in SPA applications:

A common pattern: the front-end successfully logs in (stores token in localStorage/sessionStorage/cookie) but subsequent API calls do not include the auth token in request headers.

Diagnostic:

# Check nginx access logs for the actual HTTP status codes
tail -100 /var/log/nginx/access.log | grep "api/" | grep "401"

What to look for: - Login call returns 200 → token stored client-side - Subsequent fetch() calls to API endpoints return 401 - The fetch() calls lack Authorization: Bearer <token> header - The front-end stores the token in localStorage but never reads it back for API calls

Fix pattern:

// Add a fetch wrapper that injects the auth token
async function fetchWithAuth(url) {
  const token = localStorage.getItem('auth_token_key');
  const headers = {};
  if (token) headers['Authorization'] = 'Bearer ' + token;
  return await fetch(url, { headers });
}

// Replace all authenticated API calls:
// fetch('api/endpoint') → fetchWithAuth('api/endpoint')

Also handle the 'stale token trap': After container rebuilds or password changes, the stored token becomes invalid. If the SPA bootstrap hides the login form when a token exists (if (localStorage.getItem('token')) { showMainView(); }), the user gets trapped with 401 errors and no way to re-login. Fix: each auth'ed API call should check for 401 and redirect to login:

if (!sr.ok && sr.status === 401) {
  localStorage.removeItem('auth_token_key');
  showLoginForm();
  return;
}

See docker-infra-security SKILL.md pitfall #13 for full diagnosis and rebuild-time prevention.

This also applies to cookies: if login sets a cookie, check that the cookie is actually being sent with API requests (cross-origin issues, SameSite/Secure attributes, etc.).

See references/cos-gallery-frontend-auth-debug-2026-05-23.md for a real-world case.

7. Scope Creep Trap — Working Production Systems

CRITICAL: When fixing a single known bug in a WORKING production system, ONLY fix that bug.

The most dangerous pattern in production debugging is scope creep: 1. User reports bug X (e.g. "30-day cutoff hides old images") 2. You investigate and find bug X 3. You also notice "minor" issues Y, Z (no HEALTHCHECK, no CSRF, old backups) 4. You fix X + Y + Z +... "while you're in there" 5. One of those extras breaks the user's workflow → system now WORSE than before

This is NOT "being thorough" — it's "reintroducing risk into a working system."

Action	Risk
Fix the reported bug only	Low — tested path, known scope
Fix the bug + "minor improvements"	High — each extra change is a new failure surface
Use CSRF on a private/internal service	Severe — blocks existing trusted clients
Change auth flow (add password form to token-only app)	Severe — breaks existing user workflow
Add uncaughtException/graceful shutdown to stable service	Medium — may change process lifecycle behavior

Rules: 1. Every change beyond the asked scope needs user authorization. If you'd feel embarrassed asking "can I also add CSRF?" — don't do it silently. 2. "Working" is the default state. The burden of proof is on ANY change, not just the reported bug. 3. Production is not a sandbox. Don't use live systems to test improvements. 4. Rollback plan before any change. If you can't describe exactly how to undo it, don't make it. 5. No monitoring scripts or crons without asking. Never create health checks, keepalives, or automated monitors unless the user explicitly requests them. They become "invisible debt" that persists after code rollback and erodes user trust. 6. Rollback is not just git revert. Git only covers tracked files. External artifacts that survive rollback: - Standalone scripts (~/.hermes/scripts/*.sh) - Cron jobs (registered via cronjob tool) - Docker volumes, systemd services, nginx configs - Enumerate AND clean these explicitly when rolling back.

The primary skill here is restraint: the user's workflow was working. Changing it without consent is not "improvement" — it's breaking someone else's working system.

See references/cos-gallery-scope-creep-2026-05-24.md for a real-world case.

8. Check Credential Pool Exhaustion

WHEN debugging 401 / authentication errors in multi-profile or multi-credential systems:

A credential that was valid at one point may be permanently burned by the credential pool mechanism, even after the underlying key recovers.

Common scenario: - Multiple profiles/services share the same API key via env var (e.g. DEEPSEEK_API_KEY) - One profile encounters a transient 401 (DeepSeek rate limit, temporary outage, network glitch) - The credential pool marks the credential as last_status: "exhausted" with last_error_reset_at: null - The credential is never retried — the profile is permanently broken even though the key is valid

Diagnostic checklist:

Check auth.json credential pool entries — look for last_status: "exhausted": bash python3 -c "import json; d=json.load(open('~/.hermes/auth.json')); e=d['credential_pool']['deepseek'][0]; print(e['last_status'])"
Verify the key actually works by making a direct API call with the stored access_token
If the API returns 200 but last_status: "exhausted", the credential is permanently burned — needs manual reset
Check last_error_reset_at — if null, the credential has no automatic recovery schedule

Fix: Reset the credential status in the profile's auth.json: - Set "last_status" from "exhausted" to null - Clear last_error_code, last_error_reason, last_error_message - The credential pool will re-try the credential on next use

Multiple profiles sharing the same key: Each profile has its own auth.json (~/.hermes/profiles/<name>/auth.json). One profile being exhausted does not permanently affect profile-states. Check each separately, even when sourcing from the same env:DEEPSEEK_API_KEY.

See references/credential-pool-exhaustion-deepseek-401.md for a real-world case.

Phase 3: Hypothesis and Testing

Scientific method:

1. Form a Single Hypothesis

State clearly: "I think X is the root cause because Y"
Write it down
Be specific, not vague

2. Test Minimally

Make the SMALLEST possible change to test the hypothesis
One variable at a time
Don't fix multiple things at once

3. Verify Before Continuing

Did it work? → Phase 4
Didn't work? → Form NEW hypothesis
DON'T add more fixes on top

4. When You Don't Know

Say "I don't understand X"
Don't pretend to know
Ask the user for help
Research more

Phase 4: Implementation

Fix the root cause, not the symptom:

1. Create Failing Test Case

Simplest possible reproduction
Automated test if possible
MUST have before fixing
Use the test-driven-development skill

2. Implement Single Fix

Address the root cause identified
ONE change at a time
No "while I'm here" improvements
No bundled refactoring

3. Verify Fix

# Run the specific regression test
pytest tests/test_module.py::test_regression -v

# Run full suite — no regressions
pytest tests/ -q

4. If Fix Doesn't Work — The Rule of Three

STOP.
Count: How many fixes have you tried?
If < 3: Return to Phase 1, re-analyze with new information
If ≥ 3: STOP and question the architecture (step 5 below)
DON'T attempt Fix #4 without architectural discussion

5. If 3+ Fixes Failed: Question Architecture

Pattern indicating an architectural problem: - Each fix reveals new shared state/coupling in a different place - Fixes require "massive refactoring" to implement - Each fix creates new symptoms elsewhere

STOP and question fundamentals: - Is this pattern fundamentally sound? - Are we "sticking with it through sheer inertia"? - Should we refactor the architecture vs. continue fixing symptoms?

Discuss with the user before attempting more fixes.

This is NOT a failed hypothesis — this is a wrong architecture.

6. Subprocess Lifecycle Hardening (for Multi-Process Fixes)

When the root cause is a subprocess lifecycle issue (orphan processes, zombie accumulation, unclean shutdown), the fix pattern is:

Layer 1 — Active cleanup on teardown:

In the finally block of a subprocess manager: 1. Track spawned PIDs before each spawn (_snapshot_child_pids()) 2. On teardown (exit, exception, cancellation): actively kill survivors 3. Use SIGTERM → wait 0.5-2s → SIGKILL escalation 4. Don't defer cleanup to a later sweep — orphans accumulate fast in reconnect loops

# Pattern: active kill in finally block
finally:
    if new_pids:
        # SIGTERM
        for pid in survivors:
            os.kill(pid, signal.SIGTERM)
        time.sleep(0.5)
        # SIGKILL survivors
        for pid in survivors:
            if pid_exists(pid):
                os.kill(pid, signal.SIGKILL)

Layer 2 — Pre-spawn sweep:

Before spawning a new subprocess (in a reconnect loop), do a defensive sweep of any orphan tracking structures:

# Sweep before new spawn
_kill_orphaned_mcp_children()  # or equivalent cleanup

Layer 3 — Shorten detection interval:

Reduce keepalive/heartbeat intervals to catch failures faster. Faster detection means fewer accumulated orphans and shorter reconnect windows.

Layer 4 — Log what you killed:

Always log when force-killing a survivor:

logger.warning("Force-killed MCP subprocess %d (%s) on teardown", pid, name)

This prevents the orphan from appearing as a mysterious zombie with no trail.

Red Flags — STOP and Follow Process

If you catch yourself thinking: - "Quick fix for now, investigate later" - "Just try changing X and see if it works" - "Add multiple changes, run tests" - "Skip the test, I'll manually verify" - "It's probably X, let me fix that" - "I don't fully understand but this might work" - "Pattern says X but I'll adapt it differently" - "Here are the main problems: [lists fixes without investigation]" - Proposing solutions before tracing data flow - "One more fix attempt" (when already tried 2+) - Each fix reveals a new problem in a different place

ALL of these mean: STOP. Return to Phase 1.

If 3+ fixes failed: Question the architecture (Phase 4 step 5).

Common Rationalizations

Excuse	Reality
"Issue is simple, don't need process"	Simple issues have root causes too. Process is fast for simple bugs.
"Emergency, no time for process"	Systematic debugging is FASTER than guess-and-check thrashing.
"Just try this first, then investigate"	First fix sets the pattern. Do it right from the start.
"I'll write test after confirming fix works"	Untested fixes don't stick. Test first proves it.
"Multiple fixes at once saves time"	Can't isolate what worked. Causes new bugs.
"Reference too long, I'll adapt the pattern"	Partial understanding guarantees bugs. Read it completely.
"I see the problem, let me fix it"	Seeing symptoms ≠ understanding root cause.
"One more fix attempt" (after 2+ failures)	3+ failures = architectural problem. Question the pattern, don't fix again.
"The child process must have crashed" (in process cascade)	Verify: check parent+child stderr logs separately. A clean child log + keepalive failure = pipe issue, not child crash.

Quick Reference

Phase	Key Activities	Success Criteria
1. Root Cause	Read errors, reproduce, check changes, gather evidence, trace data flow, map process tree	Understand WHAT and WHY
2. Pattern	Find working examples, compare, identify differences	Know what's different
3. Hypothesis	Form theory, test minimally, one variable at a time	Confirmed or new hypothesis
4. Implementation	Create regression test, fix root cause, verify; harden subprocess lifecycle for multi-process bugs	Bug resolved, all tests pass

Hermes Agent Integration

Investigation Tools

Use these Hermes tools during Phase 1:

search_files — Find error strings, trace function calls, locate patterns
read_file — Read source code with line numbers for precise analysis
terminal — Run tests, check git history, reproduce bugs
web_search/web_extract — Research error messages, library docs

With delegate_task

For complex multi-component debugging, dispatch investigation subagents:

delegate_task(
    goal="Investigate why [specific test/behavior] fails",
    context="""
    Follow systematic-debugging skill:
    1. Read the error message carefully
    2. Reproduce the issue
    3. Trace the data flow to find root cause
    4. Report findings — do NOT fix yet

    Error: [paste full error]
    File: [path to failing code]
    Test command: [exact command]
    """,
    toolsets=['terminal', 'file']
)

With test-driven-development

When fixing bugs: 1. Write a test that reproduces the bug (RED) 2. Debug systematically to find root cause 3. Fix the root cause (GREEN) 4. The test proves the fix and prevents regression

Real-World Impact

From debugging sessions: - Systematic approach: 15-30 minutes to fix - Random fixes approach: 2-3 hours of thrashing - First-time fix rate: 95% vs 40% - New bugs introduced: Near zero vs common

Real-world cases documented under references/: - gbrain-zombie (2026-05-14): MCP keepalive failure → orphan accumulation → systemd SIGTERM cascade. Three-layer subprocess lifecycle fix. - credential-pool-exhaustion (2026-05-18): 4 profiles sharing DeepSeek key — 2 permanently burned by transient 401 because last_error_reset_at=null. Reset exhausted credential state in auth.json. - cos-gallery-frontend-auth (2026-05-23): Docker container had different index.html than host source. Front-end called API endpoints without auth headers → 401 blank page. Added fetchWithAuth() wrapper injecting localStorage token.

No shortcuts. No guessing. Systematic always wins.