Incident Response
Detect, triage, and resolve incidents automatically with trigger skills and agent coordination
When something breaks in production, speed matters. This recipe sets up an automated incident response pipeline using trigger skills, agent coordination, and structured documentation — so your team can focus on fixing the problem instead of managing the process.
The Pattern
DETECT → Webhook trigger fires on external alert
TRIAGE → Agent creates urgent task, notifies team
RESOLVE → Agent surfaces context and tracks progress
POSTMORTEM → Agent compiles incident timeline and learnings
Step 1: Set Up Detection
Configure the Webhook Connector
- Go to Settings > Connectors
- Create a Generic Webhook connector (or use Sentry, PagerDuty, etc.)
- Copy the generated webhook URL
- Configure your monitoring tool to send alerts to this URL
Create the Trigger Skill
Go to Settings > Skills and create a trigger skill:
- Name: Incident Alert Triage
- Event type:
webhook_received - Conditions: (optional, filter by severity)
severity = critical - Tools:
create_task,post_message,search_tasks,search_messages,create_document
Write Trigger Instructions
An incident alert has been received. Take these actions immediately:
1. CREATE URGENT TASK
- Title: "INCIDENT: [extract title from webhook payload]"
- State: in_progress
- Add label: "incident"
- Include the full alert details in the task description
2. NOTIFY THE TEAM
- Post in #engineering:
"🚨 INCIDENT DETECTED: [title]
Severity: [severity from payload]
Source: [monitoring tool name]
Task: [link to created task]
Please acknowledge if you're investigating."
3. SEARCH FOR CONTEXT
- Search recent messages for related keywords
- Search tasks for related open issues
- If similar past incidents exist, link them in the task description
4. CREATE INCIDENT CHANNEL (if critical)
- For critical severity, suggest creating a dedicated
incident channel for coordination
Attach to your Ops or PM agent.
Step 2: During the Incident
Real-Time Updates
Team members post updates in the incident task's thread. The agent can help:
@Ops update the incident task with:
- Status: investigating root cause
- Affected: payment processing API
- Impact: ~500 users seeing timeout errors
- ETA: investigating, no ETA yet
Searching for Context
@Ops search for any recent deployments or changes
that might be related to the payment API timeout.
Check messages in #engineering and #deployments
from the last 24 hours.
Tracking Timeline
@Ops add a timeline entry to the incident task:
"14:32 — Root cause identified: database connection pool exhausted
14:45 — Fix deployed: increased pool size from 10 to 25
14:48 — Monitoring shows recovery, error rate dropping"
Step 3: Resolution
When the incident is resolved:
@Ops mark the incident task as completed and update
the description with:
- Resolution: [what fixed it]
- Duration: [how long it lasted]
- Impact: [how many users/requests affected]
Step 4: Postmortem
After the dust settles, generate a postmortem document:
@Writer create a document titled
"Postmortem: [Incident Title] — [Date]" using
the incident task and its thread. Include:
1. SUMMARY
- What happened (1 paragraph)
- Duration and impact
- Severity level
2. TIMELINE
- Detection time
- Key investigation steps (from thread)
- Root cause identification
- Fix deployment
- Full recovery
3. ROOT CAUSE
- Technical explanation
- Why it wasn't caught earlier
4. IMPACT
- Users affected
- Revenue impact (if applicable)
- SLA implications
5. ACTION ITEMS
- Create tasks for each preventive measure
- Assign owners and due dates
- Tag as "postmortem-action"
6. LESSONS LEARNED
- What went well
- What could improve
- Process changes needed
Auto-Create Follow-Up Tasks
@PM from the postmortem action items, create tasks
in the "Infrastructure" task group:
- Each with label "postmortem-action"
- Assign to relevant team members
- Due within 2 weeks
Full Automation Setup
For mature teams, the entire flow can be automated:
Detection → Triage (Trigger Skill)
Fires on webhook_received, creates task, posts alert.
Postmortem Reminder (Automation Skill)
Schedule: 0 10 * * 1 (Monday 9 AM)
Instructions: Check for incidents resolved in the past week that don't have a postmortem document. Post reminders in #engineering.
Action Item Tracking (Automation Skill)
Schedule: 0 9 * * 3 (Wednesday 9 AM)
Instructions: Search for open tasks labeled "postmortem-action". If any are overdue, post a reminder in #engineering.
Variations
Customer-Reported Incidents
Replace webhook trigger with a message trigger:
- Event: message_created in #support channel
- Conditions: Message contains keywords like "bug", "broken", "error", "not working"
- Action: Create triage task, search for known issues, suggest response
Security Incidents
Add security-specific steps:
- Immediately restrict access if credentials compromised
- Search for related access logs
- Create a security incident document with compliance requirements
- Notify legal/compliance team
Scheduled Health Checks
Use an automation skill to proactively check for issues:
- Schedule: 0 */4 * * * (every 4 hours)
- Instructions: Check monitoring endpoints, review error rates, flag anomalies before they become incidents