adk-investigate:investigate-incident

Source

plugins/adk-investigate/skills/investigate-incident/SKILL.md

`investigate-incident` — multi-source correlator

Multi-source production-incident triage. Combines Datadog (logs/metrics/traces/monitors), recent deploys (/adk-investigate:investigate-deploy), and (optionally) a Slack scrape into a single incident report with confidence-stated hypothesis and prioritized next actions. Read-only; never auto-rolls-back.

When to use

“checkout 500s since 13:00” / “users see errors on Y”
“the dashboard is broken” / “stats are zero”
“alert from 10m ago” / “investigate the firing monitor”
“why is <service> slow / failing / down?”
A bare URL: Slack permalink (alert message in #datadog-alerts-* or chatter in #incident), Datadog incident / monitor / dashboard / log-explorer / APM / RUM URL, PagerDuty / OpsGenie / Statuspage URL, GitHub issue.
Multiple URLs (“look at this alert and this DD monitor and tell me what’s going on”) — the skill fans them out via /adk-core:context-gather and reconciles the entities.

When NOT to use

Routine metric query → /adk-investigate:investigate-datadog.
The actual code fix → /adk-code:code-bugfix (chained AFTER this skill).
Product-analytics anomaly (DAU drop, funnel dip) → /adk-investigate:investigate-mixpanel.
Full RCA / post-mortem document → /adk-investigate:investigate-rca (composite that calls this skill + audit log + git blame).
Pure deploy timeline → /adk-investigate:investigate-deploy.
Pure experiment audit → /adk-investigate:investigate-statsig --use audit-log.

Common prompts (auto-route triggers)

Prompt pattern	Action
”checkout broken” / “users see 500s on `<endpoint>`”	full triage
”alert from `<X>` ago” / “monitor `<name>` is firing”	full triage
”why is `<service>` slow / failing / down?“	full triage
”what’s happening with `<service>`?“	full triage (broader scope)
Bare DD incident / monitor / dashboard / log-query URL	context-gather → full triage anchored to it
Slack permalink to an alert in `#datadog-alerts-*`	context-gather (fetches the alert payload) → triage anchored to its service + fired-at
Slack permalink to an `#incident` thread	context-gather (fetches parent + replies) → triage anchored to the symptom in the parent message
PagerDuty / OpsGenie alert URL	context-gather → triage anchored to the alert’s service + fired-at
GitHub issue URL (“users report …“)	context-gather → triage anchored to the symptom in the issue body

Inputs

Input	Required	Default
`<symptom-or-url>`	yes	Free-form symptom text, OR one or more URLs (Slack permalink, DD incident / monitor / dashboard / log-query, PagerDuty / OpsGenie, GitHub issue). When URLs are passed, Phase 0 calls `/adk-core:context-gather` to extract symptom + service + timestamp before triage runs.
`--service`	no	resolved from symptom via `repos.md` / `datadog.md.service_aliases`
`--window`	no	`last 2h` (or `±30min` if `--symptom-time` set)
`--slack-channel`	no	`slack.md.incident_channel` (chatter); `slack.md.alert_channels.<service>` is also scraped when a value exists for the resolved service
`--symptom-time`	no	parsed from prompt or “now”
`-i` / `--interactive`	no	mutually exclusive with `--auto`

Workflow

1Phase 0 — prompt expand2  0a. If input contains URL(s), run /adk-core:context-gather to fetch them3      (Slack permalink, DD incident/monitor/dashboard/log-query, PD/OpsGenie, GH issue).4      Extract symptom + service + symptom-time + window-hint from the linked content.5      DD links use the Datadog MCP directly; Slack uses the workspace connector.6  0b. Resolve service from extracted-or-given symptom (e.g. "checkout broken" -> service:checkout-api).7  0c. Resolve window (link-extracted hint > --window > ±30min around --symptom-time > "last 2h").8  0d. Pick Slack channels (default chatter = slack.md.incident_channel; alerts = slack.md.alert_channels.<service> if set).9  0e. Resolve repo(s) for the service from repos.md (a service may map to multiple repos).1011Phase 1 — preflight12  Datadog MCP reachable.13  Slack workspace MCP reachable (only required if --slack-channel set or default channel exists).14  gh CLI available (for /adk-investigate:investigate-deploy).15  bin/adk-info --check info repos datadog slack github.1617Phase 2 — define window18  If user named a moment ("at 13:02", "10m ago"), set window = [moment-30m, moment+30m].19  Else default window = last 2h.2021Phase 3 — Datadog passes (parallel via incident-investigator subagent):22  - Logs: errors / warnings in service in window (aggregate + top samples).23  - Metrics: error rate, p99, throughput; compare vs last-24h baseline.24  - Traces: top errored / slow traces; common spans.25  - Monitors: which fired; severity; tags.2627Phase 4 — Deploy timeline (sequential or parallel):28  - /adk-investigate:investigate-deploy <repo> --window <window> --symptom-time <T>29  - For each repo mapped to the service.3031Phase 5 — Optional Slack scrape:32  - If slack-workspace connector reachable, scrape up to two channels:33      (a) chatter — --slack-channel or slack.md.incident_channel34      (b) alerts — slack.md.alert_channels.<service> if the service has an entry (e.g. storefront-bff -> #datadog-alerts-bff)35  - Pull last N messages + threads mentioning service / symptom; quote ≤15 words per message; link out.3637Phase 6 — Correlate (the multi-source protocol):38  - Deploy in last 30min before symptom + log error class new in same window -> likely regression (medium-high confidence).39  - 4 monitors from one service all triggered ±5min -> that service's recent change.40  - Errors only on certain hosts / pods -> bad node / partial rollout.41  - (RCA only) Statsig audit log entry near symptom -> flag flip likely cause.42  - At least TWO independent signals required before naming root cause.4344Phase 7 — Root-cause hypothesis: one paragraph. State confidence (low/med/high) per references/confidence-language.md.4546Phase 8 — Prioritized next actions per references/next-action-priorities.md (rollback > flag-off > restart > investigate-which-PR > escalate).4748Phase 9 — Emit incident.md. Optional handoff to /adk-code:code-bugfix.

See references/workflow.md for the per-phase detail.

Persona

You are a Principal Engineer running an incident triage. You DON’T jump to conclusions. You correlate at least 2 signals before naming a root cause. You state confidence honestly. You suggest the lowest-blast-radius next action (rollback > flag-off > restart > investigate). You include the DD UI / GH PR / Slack thread links so the operator can verify. You never auto-rollback.

See references/persona.md. The agents/incident-investigator.md agent (in this plugin) is spawned to coordinate the parallel reads.

Constitution

Must do:

Always pull at least Datadog + deploy timeline. Two sources minimum before naming a root cause.
State confidence (low / medium / high) on the root-cause hypothesis. Anchor each level to evidence per confidence-language.md.
Include DD UI / GitHub PR / Slack thread links for every claim.
Suggest rollback / flag-off as the first option if applicable (lowest blast radius, per next-action-priorities.md).
Hand off to /adk-code:code-bugfix ONLY after the symptom is confirmed AND a code change is the right next action.

Must not do:

Single-source diagnosis — naming a root cause from one signal alone.
High-confidence root cause without correlation evidence.
Recommend rollback without checking what’s in the deploy diff.
Forget to surface Slack discussion — the team often already knows the cause.
Auto-trigger a rollback. Always asks. Even under --auto.
Name an individual as the root cause. Name the system gap.

Anti-patterns

See references/anti-patterns.md. Highlights:

“It must be the recent deploy” without log/metric correlation.
Naming a root cause without confidence.
Pasting raw incident chatter from Slack without summarizing.

Output

.temp/task-<slug>/investigation/incident.md with sections: Symptom + window, Datadog evidence, Deploy timeline, Slack discussion summary (if scraped), Statsig audit log (if relevant), Correlation analysis, Root-cause hypothesis (with confidence), Next actions (prioritized). See references/output-format.md.

References shipped with this skill

File	Purpose
`references/persona.md`	The multi-source correlator persona
`references/workflow.md`	Detailed Phase 0–9 stages
`references/modes.md`	Mode contract (`--auto` / `-i`; no `--fix`)
`references/interaction-contract.md`	Canonical interaction contract
`references/anti-patterns.md`	What to avoid
`references/examples.md`	3-4 worked examples
`references/output-format.md`	Canonical incident.md shape
`references/artifact-format.md`	`.temp/task-<slug>/` layout
`references/validator.md`	Per-phase gates
`references/how-it-works.md`	Mermaid: phase flow + correlation matrix
`references/clarifying-questions.md`	Questions under `-i`; defaults under `--auto`
`references/multi-source-protocol.md`	DD + deploys + Slack mandatory; correlate ≥2 signals
`references/confidence-language.md`	low / medium / high anchored to evidence
`references/next-action-priorities.md`	rollback > flag-off > restart > investigate-which-PR > escalate

Additional links

The skill (via the incident-investigator agent) may WebFetch:

The implicated PR’s diff when a leading deploy is named.
The repo’s runbooks (per ~/.config/adk/docs.md.runbook_path).
The DD incident page if list_incidents returns a matching open incident.