Incidents & SLA

This page covers what happens during a Driftstack outage: how we detect, communicate, and resolve incidents, and the credit policy if we miss our SLA.

Status page

status.driftstack.dev is the single source of truth during an incident. The same data the status page renders is also published via GET /v1/status (overall + per-component status) and GET /v1/status/incidents (recent / live incidents). To get notified by email when an incident is filed or resolved, subscribe via POST /v1/status/subscribe — see /docs/status-subscriptions.

Severity ladder

Severity	Definition	First update	Update cadence
Critical	Core API down across all customers, or data-loss risk.	≤ 15 min	Every 30 min until resolved.
Major	API degraded (>5% error rate) OR a critical surface (auth, sessions) unavailable for a subset of customers.	≤ 30 min	Every 60 min.
Minor	Single non-critical surface (dashboard, an SDK build pipeline) degraded.	≤ 60 min	At resolution.
Maintenance	Planned change with potential impact. Always announced ≥48h in advance.	Pre-announced	Start + end.

Detection

Three signals trigger an incident:

V-295b health probes: 60-second poller against /v1/health + per-region API endpoints. Three consecutive failures auto-create a Critical incident.
Customer reports: emails to [email protected] and Slack channel monitoring. We acknowledge within 30 min during EU business hours.
Internal alerting: Sentry + cost-monitoring thresholds page on-call. Anything that warrants customer comms gets escalated to a public status-page entry.

Customer communications during an incident

Status page entry filed with severity + title + affected components.
Email fan-out to confirmed /v1/status/subscribe subscribers.
Progress updates on the cadence above. Each update appends to the incident's timeline (visible via GET /v1/status/incidents).
Resolution marks the incident resolved and triggers a final email to subscribers with a root-cause summary + remediation steps.
Postmortem for Critical / Major incidents published within 7 business days on the public status page as a permanent entry under the resolved incident. Minor incidents get an inline summary on the resolved status entry.

Note: incident.created / incident.updated / incident.resolved are admin-audit / internal SSE event types — they are not yet in SubscribableWebhookEventTypeSchema, so they can't be the target of a POST /v1/webhooks subscription. Email subscription is the customer-facing notification path today.

SLA + credit policy

Driftstack's SLA is on the API control plane (session create + lifecycle endpoints). The dashboard, SDK distribution, and any features marked roadmap in /docs/api-versioning are best-effort.

Tier-by-tier SLA targets, the windowing methodology, the credit bands, and the dispute process all live in /docs/sla-policy — that is the authoritative reference. Tier identifiers used there match the AccountTier enum exactly.

Reading SLA data from the API

GET /v1/status/sla

→ {
  "data": [
    {
      "target": "api.driftstack.dev",
      "uptimePct": 99.99,
      "totalProbes": 43200,
      "okCount": 43196,
      "failCount": 4,
      "lastProbeAt": "2026-05-11T13:00:00Z",
      "lastFailureAt": "2026-05-09T14:23:00Z",
      "windowStart": "2026-04-11T13:00:00Z",
      "windowEnd": "2026-05-11T13:00:00Z"
    }
  ]
}

No auth — status surface is public. Window is a fixed rolling 30 days. Field names are camelCase (the SLA report serialises its internal model directly).

Postmortems

Public postmortems for Critical + Major incidents live on the public status page, attached to the resolved incident entry. Each follows the same template: timeline, root cause, what we changed to prevent recurrence. Postmortems are blameless and detailed enough to be useful — we'd rather over-share than under-share.

Reporting a problem

Acute outage: if the status page doesn't already show the incident, email [email protected]. That goes straight to on-call.
Non-acute bug / weird behaviour: [email protected] with a session id + timestamp.
Security: [email protected] — PGP available on the page. See also /docs/api-security-headers for the response-header reference reviewers ask about.