Patterns & Recipes

Incident Response

Detect, triage, and resolve incidents automatically with trigger skills and agent coordination

5 min read · Advanced

When something breaks in production, speed matters. This recipe sets up an automated incident response pipeline using trigger skills, agent coordination, and structured documentation — so your team can focus on fixing the problem instead of managing the process.

The Pattern

DETECT    → Webhook trigger fires on external alert
TRIAGE    → Agent creates urgent task, notifies team
RESOLVE   → Agent surfaces context and tracks progress
POSTMORTEM → Agent compiles incident timeline and learnings

Step 1: Set Up Detection

Configure the Webhook Connector

Go to Settings > Connectors
Create a Generic Webhook connector (or use Sentry, PagerDuty, etc.)
Copy the generated webhook URL
Configure your monitoring tool to send alerts to this URL

Create the Trigger Skill

Go to Settings > Skills and create a trigger skill:

Name: Incident Alert Triage
Event type: webhook_received
Conditions: (optional, filter by severity) severity = critical
Tools: create_task, post_message, search_tasks, search_messages, create_document

Write Trigger Instructions

An incident alert has been received. Take these actions immediately:

1. CREATE URGENT TASK
   - Title: "INCIDENT: [extract title from webhook payload]"
   - State: in_progress
   - Add label: "incident"
   - Include the full alert details in the task description

2. NOTIFY THE TEAM
   - Post in #engineering:
     "🚨 INCIDENT DETECTED: [title]
     Severity: [severity from payload]
     Source: [monitoring tool name]
     Task: [link to created task]
     Please acknowledge if you're investigating."

3. SEARCH FOR CONTEXT
   - Search recent messages for related keywords
   - Search tasks for related open issues
   - If similar past incidents exist, link them in the task description

4. CREATE INCIDENT CHANNEL (if critical)
   - For critical severity, suggest creating a dedicated 
     incident channel for coordination

Attach to your Ops or PM agent.

Step 2: During the Incident

Real-Time Updates

Team members post updates in the incident task's thread. The agent can help:

@Ops update the incident task with:
- Status: investigating root cause
- Affected: payment processing API
- Impact: ~500 users seeing timeout errors
- ETA: investigating, no ETA yet

Searching for Context

@Ops search for any recent deployments or changes 
that might be related to the payment API timeout.
Check messages in #engineering and #deployments 
from the last 24 hours.

Tracking Timeline

@Ops add a timeline entry to the incident task:
"14:32 — Root cause identified: database connection pool exhausted
14:45 — Fix deployed: increased pool size from 10 to 25
14:48 — Monitoring shows recovery, error rate dropping"

Step 3: Resolution

When the incident is resolved:

@Ops mark the incident task as completed and update 
the description with:
- Resolution: [what fixed it]
- Duration: [how long it lasted]
- Impact: [how many users/requests affected]

Step 4: Postmortem

After the dust settles, generate a postmortem document:

@Writer create a document titled 
"Postmortem: [Incident Title] — [Date]" using 
the incident task and its thread. Include:

1. SUMMARY
   - What happened (1 paragraph)
   - Duration and impact
   - Severity level

2. TIMELINE
   - Detection time
   - Key investigation steps (from thread)
   - Root cause identification
   - Fix deployment
   - Full recovery

3. ROOT CAUSE
   - Technical explanation
   - Why it wasn't caught earlier

4. IMPACT
   - Users affected
   - Revenue impact (if applicable)
   - SLA implications

5. ACTION ITEMS
   - Create tasks for each preventive measure
   - Assign owners and due dates
   - Tag as "postmortem-action"

6. LESSONS LEARNED
   - What went well
   - What could improve
   - Process changes needed

Auto-Create Follow-Up Tasks

@PM from the postmortem action items, create tasks 
in the "Infrastructure" task group:
- Each with label "postmortem-action"
- Assign to relevant team members
- Due within 2 weeks

Full Automation Setup

For mature teams, the entire flow can be automated:

Detection → Triage (Trigger Skill)

Fires on webhook_received, creates task, posts alert.

Postmortem Reminder (Automation Skill)

Schedule: 0 10 * * 1 (Monday 9 AM)
Instructions: Check for incidents resolved in the past week that don't have a postmortem document. Post reminders in #engineering.

Action Item Tracking (Automation Skill)

Schedule: 0 9 * * 3 (Wednesday 9 AM)
Instructions: Search for open tasks labeled "postmortem-action". If any are overdue, post a reminder in #engineering.

Variations

Customer-Reported Incidents

Replace webhook trigger with a message trigger:
- Event: message_created in #support channel
- Conditions: Message contains keywords like "bug", "broken", "error", "not working"
- Action: Create triage task, search for known issues, suggest response

Security Incidents

Add security-specific steps:
- Immediately restrict access if credentials compromised
- Search for related access logs
- Create a security incident document with compliance requirements
- Notify legal/compliance team

Scheduled Health Checks

Use an automation skill to proactively check for issues:
- Schedule: 0 */4 * * * (every 4 hours)
- Instructions: Check monitoring endpoints, review error rates, flag anomalies before they become incidents

Features Managing Skills Create and configure tool, trigger, and automation skills for your agents AI Concepts Skills and Automation Extend agent capabilities with tools, triggers, and scheduled automations Patterns & Recipes Weekly Status Report Automate status report generation with scheduled agents that gather and summarize workspace activity

J next / K prev