‹ Aigile Playbook Architecture
Engineering Playbook · Architecture Reference

Runtime Architecture: Failure Recovery + Return-to-Source

Completing the loop — what happens when agents fail, and how failures return to source
Claude Code path
Copilot path
Failure / blocked
Recovery / return-to-source
Human gate required
ObservabilityAgent / Observability Plane
Full Continuum with Failure Recovery Loops
flowchart TD DEV([👤 Developer]) -->|"/plan_to_build"| CC_PLAN["Claude Code
Planning Session
(local, interactive)"] CC_PLAN -->|"writes"| SPEC["specs/plan.md
(shared artifact)"] SPEC -->|"execute plan"| CC_BUILD["Claude Code
Build Session
(subagents/teams)"] CC_BUILD -->|"fix-validate
cycle ×2"| CC_VAL{Validation
pass?} CC_VAL -->|"✅ pass"| LOCAL_FILES["Local files
changed"] CC_VAL -->|"❌ fail after
2 cycles"| DEV_GATE1["⚠ Human Gate
escalate to developer
cycle exhausted"] DEV_GATE1 -->|"developer
diagnoses"| CC_BUILD LOCAL_FILES -->|"git commit
open PR"| PR["Pull Request
(GitHub)"] PR -->|"@copilot review
or issue assign"| CP_AGENT["Copilot Coding
Agent
(GitHub Actions,
async, cloud)"] CP_AGENT -->|"self-review
before PR update"| CP_SELF{Self-review
pass?} CP_SELF -->|"✅ iterates
and improves"| CP_SELF CP_SELF -->|"✅ done
iterated"| PR_UPDATED["PR updated
by Copilot"] PR_UPDATED -->|"human must
approve CI run"| HUMAN_CI["⚠ Human Gate
approve CI
(required by GitHub)"] HUMAN_CI -->|"approved"| CI["CI Pipeline
tests + lint
+ security scan"] CI -->|"✅ pass"| REVIEW["Code Review
(human)"] REVIEW -->|"✅ approved"| MERGE["Merge to main"] CI -->|"❌ fail"| CI_FAIL["CI Failure
PR comments
failure logs"] CI_FAIL -->|"Option A
same session"| CP_RETRY["Copilot retries
(unassign/reassign
or @copilot comment)"] CP_RETRY --> CP_AGENT CI_FAIL -->|"Option B
cross-platform
⚡ MISSING SEAM"| SEAM["❌ No native
return-to-source
mechanism"] SEAM -->|"today: manual"| DEV_GATE2["⚠ Human Gate
developer reads
CI logs manually"] DEV_GATE2 -->|"invokes Claude Code
with CI failure context"| CC_FIX["Claude Code
Fix Session
(local, targeted)"] CC_FIX -->|"pushes fix
to same PR branch"| PR CI_FAIL -->|"Option C
ObservabilityAgent
⚡ FUTURE"| RP_TRIGGER["Observability Plane
detects CI failure
from OTel trace"] RP_TRIGGER -->|"emits fix event
with full trace context"| CC_FIX MERGE -->|"OTel trace
complete"| RP["Observability Plane
cost + outcome
per workflow"] CC_BUILD -->|"metrics + events"| RP CP_AGENT -->|"session logs
(no traces)"| RP style DEV fill:#f3f0fa,stroke:#6450b4,color:#1a1a1a style CC_PLAN fill:#f7ece8,stroke:#c84b2f,color:#1a1a1a style SPEC fill:#fdf6e3,stroke:#b08a2e,color:#1a1a1a style CC_BUILD fill:#f7ece8,stroke:#c84b2f,color:#1a1a1a style CC_VAL fill:#f7ece8,stroke:#c84b2f,color:#1a1a1a style LOCAL_FILES fill:#f7ece8,stroke:#c84b2f,color:#1a1a1a style PR fill:#eef7f1,stroke:#2d7a4f,color:#1a1a1a style CP_AGENT fill:#edf2fb,stroke:#2a5fa5,color:#1a1a1a style CP_SELF fill:#edf2fb,stroke:#2a5fa5,color:#1a1a1a style PR_UPDATED fill:#edf2fb,stroke:#2a5fa5,color:#1a1a1a style CI fill:#eef7f1,stroke:#2d7a4f,color:#1a1a1a style REVIEW fill:#eef7f1,stroke:#2d7a4f,color:#1a1a1a style MERGE fill:#eef7f1,stroke:#2d7a4f,color:#1a1a1a style CI_FAIL fill:#fdf0ec,stroke:#c84b2f,color:#1a1a1a style CP_RETRY fill:#edf2fb,stroke:#2a5fa5,color:#1a1a1a style SEAM fill:#fdf0ec,stroke:#c84b2f,color:#b05a18 style DEV_GATE1 fill:#fdf6e3,stroke:#b08a2e,color:#1a1a1a style DEV_GATE2 fill:#fdf6e3,stroke:#b08a2e,color:#1a1a1a style HUMAN_CI fill:#fdf6e3,stroke:#b08a2e,color:#1a1a1a style CC_FIX fill:#f7ece8,stroke:#c84b2f,color:#1a1a1a style RP fill:#eef7f1,stroke:#2d7a4f,color:#1a1a1a style RP_TRIGGER fill:#eef7f1,stroke:#2d7a4f,color:#1a1a1a

Claude Code — self-healing

Your build.md already handles this: max 2 fix-validate cycles per task before escalating to the human. Failure is contained within the agent loop. The developer only sees it if both cycles are exhausted.

Copilot — partial self-heal

Copilot now self-reviews before opening a PR (one pass). But CI after the PR requires a human to approve the workflow run — GitHub's policy, not a Copilot limitation. There is no automatic re-trigger on CI failure.

The missing seam

When Copilot's CI fails, there is no native path back to Claude Code. Today: developer manually reads CI logs, invokes Claude Code, pushes to the same branch. This is the gap — a human becomes the message queue between two platforms.

Observability Plane closes the loop

ObservabilityAgent emits a full OTel trace for every run. When CI fails, Observability Plane has the causal chain: which plan task, which agent, which tool call, what it produced. It can emit a structured fix event back to Claude Code with context — no human copy-paste required.
Failure Taxonomy and Recovery Ownership
Failure Type Where it surfaces Self-heal? Return path today Return path with Observability Plane
Claude Code validation fail (≤2 cycles) Terminal, agent report ✅ automatic Builder agent fixes, validator re-runs Same — OTel trace captures each retry cycle
Claude Code validation fail (>2 cycles) Terminal, escalated to developer ❌ exhausted Human reads report, diagnoses, re-invokes Claude Code Observability Plane flags cycle exhaustion, surfaces trace context to developer
Copilot self-review fail (pre-PR) Copilot internal loop ✅ automatic Copilot iterates before PR opens OTel event captures self-review cycles (if ObservabilityAgent wraps it)
CI fail on Copilot PR GitHub Actions, PR checks ⚠ partial Human approves re-run, or @copilot comment triggers new session Observability Plane detects CI failure from trace, emits structured fix context to Claude Code session
CI fail — cross-platform return GitHub PR, CI logs ❌ no native path Human manually reads logs, invokes Claude Code locally ObservabilityAgent receives structured failure event, starts targeted fix session with full context injected
Copilot session timeout (>1hr) GitHub issue comment ❌ manual retry Unassign + reassign issue to Copilot Observability Plane detects timeout gap in trace, can trigger reassignment via GitHub MCP
Claude Agent Teams deadlock Terminal, stuck teammates ❌ manual Developer messages teammate directly via Shift+Down Observability Plane detects idle teammate span, surfaces to developer with last known state
Return-to-Source: The Missing Seam as a Designed Flow
sequenceDiagram participant CP as Copilot Coding Agent
(GitHub Actions) participant GH as GitHub PR + CI participant DEV as Developer participant RP as Observability Plane
(Rust sidecar) participant CC as Claude Code
(local session) Note over CP,GH: Copilot opens PR after self-review CP->>GH: opens PR with code changes DEV->>GH: approves CI workflow run GH->>GH: CI runs tests GH-->>RP: webhook: CI failed
test names + error logs Note over GH: ❌ CI fails rect rgb(253, 240, 236) Note over GH,DEV: TODAY — manual seam GH->>DEV: PR notification
"CI failed" DEV->>DEV: reads CI logs manually DEV->>CC: invokes Claude Code
"fix auth test failure"
(no structured context) end rect rgb(238, 247, 241) Note over RP,CC: WITH REASON PLANE — designed flow RP->>RP: correlates CI failure
with original plan task trace
(which agent, which file,
which tool call produced it) RP->>CC: emits structured fix event
with injected context:
- failing test names
- originating plan task ID
- agent that wrote the code
- full tool call trace
- cost of failed run CC->>CC: starts targeted fix session
with full context pre-loaded
(no re-exploration needed) CC->>GH: pushes fix to same PR branch GH->>GH: CI re-runs GH-->>RP: webhook: CI passed RP->>RP: closes trace span
records: total cost,
cycles to pass,
plan → merge duration end Note over DEV: developer reviews PR
only when CI is green

Why the seam exists today

Claude Code and Copilot have no shared state. When CI fails on a Copilot PR, the failure context lives in GitHub Actions logs. Claude Code, running locally, has no knowledge of those logs unless a human copies them in. The developer becomes the message queue — reading one platform, typing into another.

What Observability Plane adds

Because ObservabilityAgent emits an OTel trace for every run, Observability Plane knows the causal chain: this CI failure comes from code written by the builder agent in task-3 of specs/add-auth.md, which made these specific file changes. That context can be injected directly into a new Claude Code session — no re-exploration, no manual copy-paste.

The broader pattern

This is the core value proposition of cross-platform observability. The issue isn't that either platform fails — all agents fail sometimes. The issue is that today there is no continuous trace that spans the full workflow. Without that trace, failure recovery is manual. With it, recovery can be automated, context-preserving, and measurable.

The Three Unsealed Seams

What Completing the Loop Looks Like