Aigile · Failure Recovery Architecture

Claude Code path

Copilot path

Failure / blocked

Recovery / return-to-source

Human gate required

ObservabilityAgent / Observability Plane

Full Continuum with Failure Recovery Loops

flowchart TD DEV([👤 Developer]) -->|"/plan_to_build"| CC_PLAN["Claude Code
Planning Session
(local, interactive)"] CC_PLAN -->|"writes"| SPEC["specs/plan.md
(shared artifact)"] SPEC -->|"execute plan"| CC_BUILD["Claude Code
Build Session
(subagents/teams)"] CC_BUILD -->|"fix-validate
cycle ×2"| CC_VAL{Validation
pass?} CC_VAL -->|"✅ pass"| LOCAL_FILES["Local files
changed"] CC_VAL -->|"❌ fail after
2 cycles"| DEV_GATE1["⚠ Human Gate
escalate to developer
cycle exhausted"] DEV_GATE1 -->|"developer
diagnoses"| CC_BUILD LOCAL_FILES -->|"git commit
open PR"| PR["Pull Request
(GitHub)"] PR -->|"@copilot review
or issue assign"| CP_AGENT["Copilot Coding
Agent
(GitHub Actions,
async, cloud)"] CP_AGENT -->|"self-review
before PR update"| CP_SELF{Self-review
pass?} CP_SELF -->|"✅ iterates
and improves"| CP_SELF CP_SELF -->|"✅ done
iterated"| PR_UPDATED["PR updated
by Copilot"] PR_UPDATED -->|"human must
approve CI run"| HUMAN_CI["⚠ Human Gate
approve CI
(required by GitHub)"] HUMAN_CI -->|"approved"| CI["CI Pipeline
tests + lint
+ security scan"] CI -->|"✅ pass"| REVIEW["Code Review
(human)"] REVIEW -->|"✅ approved"| MERGE["Merge to main"] CI -->|"❌ fail"| CI_FAIL["CI Failure
PR comments
failure logs"] CI_FAIL -->|"Option A
same session"| CP_RETRY["Copilot retries
(unassign/reassign
or @copilot comment)"] CP_RETRY --> CP_AGENT CI_FAIL -->|"Option B
cross-platform
⚡ MISSING SEAM"| SEAM["❌ No native
return-to-source
mechanism"] SEAM -->|"today: manual"| DEV_GATE2["⚠ Human Gate
developer reads
CI logs manually"] DEV_GATE2 -->|"invokes Claude Code
with CI failure context"| CC_FIX["Claude Code
Fix Session
(local, targeted)"] CC_FIX -->|"pushes fix
to same PR branch"| PR CI_FAIL -->|"Option C
ObservabilityAgent
⚡ FUTURE"| RP_TRIGGER["Observability Plane
detects CI failure
from OTel trace"] RP_TRIGGER -->|"emits fix event
with full trace context"| CC_FIX MERGE -->|"OTel trace
complete"| RP["Observability Plane
cost + outcome
per workflow"] CC_BUILD -->|"metrics + events"| RP CP_AGENT -->|"session logs
(no traces)"| RP style DEV fill:#f3f0fa,stroke:#6450b4,color:#1a1a1a style CC_PLAN fill:#f7ece8,stroke:#c84b2f,color:#1a1a1a style SPEC fill:#fdf6e3,stroke:#b08a2e,color:#1a1a1a style CC_BUILD fill:#f7ece8,stroke:#c84b2f,color:#1a1a1a style CC_VAL fill:#f7ece8,stroke:#c84b2f,color:#1a1a1a style LOCAL_FILES fill:#f7ece8,stroke:#c84b2f,color:#1a1a1a style PR fill:#eef7f1,stroke:#2d7a4f,color:#1a1a1a style CP_AGENT fill:#edf2fb,stroke:#2a5fa5,color:#1a1a1a style CP_SELF fill:#edf2fb,stroke:#2a5fa5,color:#1a1a1a style PR_UPDATED fill:#edf2fb,stroke:#2a5fa5,color:#1a1a1a style CI fill:#eef7f1,stroke:#2d7a4f,color:#1a1a1a style REVIEW fill:#eef7f1,stroke:#2d7a4f,color:#1a1a1a style MERGE fill:#eef7f1,stroke:#2d7a4f,color:#1a1a1a style CI_FAIL fill:#fdf0ec,stroke:#c84b2f,color:#1a1a1a style CP_RETRY fill:#edf2fb,stroke:#2a5fa5,color:#1a1a1a style SEAM fill:#fdf0ec,stroke:#c84b2f,color:#b05a18 style DEV_GATE1 fill:#fdf6e3,stroke:#b08a2e,color:#1a1a1a style DEV_GATE2 fill:#fdf6e3,stroke:#b08a2e,color:#1a1a1a style HUMAN_CI fill:#fdf6e3,stroke:#b08a2e,color:#1a1a1a style CC_FIX fill:#f7ece8,stroke:#c84b2f,color:#1a1a1a style RP fill:#eef7f1,stroke:#2d7a4f,color:#1a1a1a style RP_TRIGGER fill:#eef7f1,stroke:#2d7a4f,color:#1a1a1a

Claude Code — self-healing

Your build.md already handles this: max 2 fix-validate cycles per task before escalating to the human. Failure is contained within the agent loop. The developer only sees it if both cycles are exhausted.

Copilot — partial self-heal

Copilot now self-reviews before opening a PR (one pass). But CI after the PR requires a human to approve the workflow run — GitHub's policy, not a Copilot limitation. There is no automatic re-trigger on CI failure.

The missing seam

When Copilot's CI fails, there is no native path back to Claude Code. Today: developer manually reads CI logs, invokes Claude Code, pushes to the same branch. This is the gap — a human becomes the message queue between two platforms.

Observability Plane closes the loop

ObservabilityAgent emits a full OTel trace for every run. When CI fails, Observability Plane has the causal chain: which plan task, which agent, which tool call, what it produced. It can emit a structured fix event back to Claude Code with context — no human copy-paste required.

Failure Taxonomy and Recovery Ownership

Failure Type	Where it surfaces	Self-heal?	Return path today	Return path with Observability Plane
Claude Code validation fail (≤2 cycles)	Terminal, agent report	✅ automatic	Builder agent fixes, validator re-runs	Same — OTel trace captures each retry cycle
Claude Code validation fail (>2 cycles)	Terminal, escalated to developer	❌ exhausted	Human reads report, diagnoses, re-invokes Claude Code	Observability Plane flags cycle exhaustion, surfaces trace context to developer
Copilot self-review fail (pre-PR)	Copilot internal loop	✅ automatic	Copilot iterates before PR opens	OTel event captures self-review cycles (if ObservabilityAgent wraps it)
CI fail on Copilot PR	GitHub Actions, PR checks	⚠ partial	Human approves re-run, or @copilot comment triggers new session	Observability Plane detects CI failure from trace, emits structured fix context to Claude Code session
CI fail — cross-platform return	GitHub PR, CI logs	❌ no native path	Human manually reads logs, invokes Claude Code locally	ObservabilityAgent receives structured failure event, starts targeted fix session with full context injected
Copilot session timeout (>1hr)	GitHub issue comment	❌ manual retry	Unassign + reassign issue to Copilot	Observability Plane detects timeout gap in trace, can trigger reassignment via GitHub MCP
Claude Agent Teams deadlock	Terminal, stuck teammates	❌ manual	Developer messages teammate directly via Shift+Down	Observability Plane detects idle teammate span, surfaces to developer with last known state

Return-to-Source: The Missing Seam as a Designed Flow

sequenceDiagram participant CP as Copilot Coding Agent
(GitHub Actions) participant GH as GitHub PR + CI participant DEV as Developer participant RP as Observability Plane
(Rust sidecar) participant CC as Claude Code
(local session) Note over CP,GH: Copilot opens PR after self-review CP->>GH: opens PR with code changes DEV->>GH: approves CI workflow run GH->>GH: CI runs tests GH-->>RP: webhook: CI failed
test names + error logs Note over GH: ❌ CI fails rect rgb(253, 240, 236) Note over GH,DEV: TODAY — manual seam GH->>DEV: PR notification
"CI failed" DEV->>DEV: reads CI logs manually DEV->>CC: invokes Claude Code
"fix auth test failure"
(no structured context) end rect rgb(238, 247, 241) Note over RP,CC: WITH REASON PLANE — designed flow RP->>RP: correlates CI failure
with original plan task trace
(which agent, which file,
which tool call produced it) RP->>CC: emits structured fix event
with injected context:
- failing test names
- originating plan task ID
- agent that wrote the code
- full tool call trace
- cost of failed run CC->>CC: starts targeted fix session
with full context pre-loaded
(no re-exploration needed) CC->>GH: pushes fix to same PR branch GH->>GH: CI re-runs GH-->>RP: webhook: CI passed RP->>RP: closes trace span
records: total cost,
cycles to pass,
plan → merge duration end Note over DEV: developer reviews PR
only when CI is green

Why the seam exists today

Claude Code and Copilot have no shared state. When CI fails on a Copilot PR, the failure context lives in GitHub Actions logs. Claude Code, running locally, has no knowledge of those logs unless a human copies them in. The developer becomes the message queue — reading one platform, typing into another.

What Observability Plane adds

Because ObservabilityAgent emits an OTel trace for every run, Observability Plane knows the causal chain: this CI failure comes from code written by the builder agent in task-3 of specs/add-auth.md, which made these specific file changes. That context can be injected directly into a new Claude Code session — no re-exploration, no manual copy-paste.

The broader pattern

This is the core value proposition of cross-platform observability. The issue isn't that either platform fails — all agents fail sometimes. The issue is that today there is no continuous trace that spans the full workflow. Without that trace, failure recovery is manual. With it, recovery can be automated, context-preserving, and measurable.

The Three Unsealed Seams

CI failure → Claude Code: No native return path. Human reads logs, manually invokes Claude Code. Context is lost between platforms.
Human CI approval gate: GitHub requires a human to approve CI runs on Copilot PRs. This is a deliberate security policy but breaks the async automation promise for teams that need fully unattended flows.
Agent Teams deadlock: No automatic detection when a teammate goes idle or reaches an impasse. Developer must monitor split-pane view and intervene manually.

What Completing the Loop Looks Like

Every agent run — Claude Code or Copilot — emits a structured OTel trace to the Rust sidecar
CI failure events are correlated to the originating trace span (which task, which agent, which file)
Observability Plane emits a structured fix event to a new Claude Code session with full context pre-loaded
The fix session pushes to the same PR branch — no new PR, no context loss, no developer in the loop until CI is green
The complete trace — plan → build → PR → CI fail → fix → CI pass → merge — is recorded with cost and duration at every step

Runtime Architecture: Failure Recovery + Return-to-Source