Incident response
A short playbook for when something is on fire.
Solo-dev caveat: there is no on-call rotation today. Incidents reach the owner via Sentry → email/Slack. If you’re an agent or contractor reading this during an incident, the first step is always contact the owner.
Severity quick-reference
| Sev | Definition | Examples | Response |
|---|---|---|---|
| 1 | Site is down or money is at risk | API returns 5xx broadly; checkout broken; subscription auto-renew misfiring; PII leaking | Drop everything; revert/roll back first, diagnose second |
| 2 | Major feature broken for ≥1 org | Forms can’t be signed; bookings can’t be created; Spotter is hard-down | Within hours; same-day fix or partial workaround |
| 3 | Single feature degraded, workaround exists | Search slow; embeddings out of date; PostHog ingest delayed | Within a day or two |
| 4 | Cosmetic / single user / non-blocking | RTL spacing off in one component | Backlog |
Detection
| Channel | What it catches |
|---|---|
Sentry (fitkit-backend, fitkit-frontend) | Server exceptions, web client errors. Source maps uploaded via SENTRY_AUTH_TOKEN. |
| PostHog | Funnel drops, event volume anomalies. Server-side payment-stage events: checkout_started, form_created, webhook_received, etc. |
| Pino logs (Railway log stream) | Structured API logs with request IDs |
Bull-board (/admin/bull-board) | Stuck/failed background jobs |
User report (in-app feedback → LINEAR_EMAIL via Resend) | Anything the above misses |
First-five-minutes checklist
- Confirm scope. One user? One org? All orgs? Check Sentry’s “Users affected” count and event rate.
- Recent deploy? Railway → Deployments → top of the list. If yes and the regression is clearly correlated, roll back first (Railway one-click). Diagnose after.
- Recent migration? Check
libs/db/drizzle/meta/_journal.jsonfor new entries. If a migration ran and the failure references a column / table, see migrations.md. - External dependency? Cardcom, Morning, Clerk, R2, Neon/Railway — check their status pages. A common failure mode is upstream timeout cascading to our 5xx.
- Tell the owner. Even if you’ve already mitigated.
Common failure modes
API boots, then dies with “relation/column does not exist”
Migration didn’t apply (often the monotonic-when bug). Run pnpm db:migrate against prod. If the migration is genuinely missing, fix the journal and redeploy.
CORS errors on web
ALLOWED_ORIGINS doesn’t include the calling origin. Update Railway env, redeploy API.
Clerk session not honored
Webhook for user create/update is failing (apps/api/src/webhooks/clerk*). Symptom: new sign-ups don’t appear in our users table. Check CLERK_WEBHOOK_SECRET matches the Clerk dashboard’s current value, and that Clerk’s webhook endpoint is pointed at ${API_URL}/webhooks/clerk.
Payment provider webhook idempotency failure
A provider may retry the same webhook; our handlers must be idempotent. If you see duplicate charges/refunds in the DB, check apps/api/src/payments/ for the idempotency key and confirm __drizzle_migrations shows the migration that added that index. See features/payments/behavior.md.
Spotter agent returns 5xx
Common causes (in order of frequency):
ANTHROPIC_API_KEYrate-limit or auth — check the response detail in Sentry.- Per-org daily $ cap hit — return is structured, not a 5xx; if it IS a 5xx, the cap-enforcement path threw. Look at
apps/api/src/ai/agent/. - Postgres timeout on tool execution — verify DB pool isn’t saturated.
Forms PDF generation fails
Puppeteer / @sparticuz/chromium issue. Re-deploy the API (Chromium’s tmp files can corrupt). If it persists, fall back to “PDF unavailable; raw HTML signing record stored” while you diagnose. Forms must still be signable; PDF is a downstream artifact.
Bull jobs stuck
/admin/bull-board UI shows queues. Pause a queue, drain failed jobs, resume. If a job is poisoning the queue, move it to a “graveyard” queue with bullmq and unblock the rest.
Post-incident
- Note the date in
docs/decisions/if the incident changes how we work (e.g., the migration outage prompted the_journal.jsonmonotonic-when discipline in CLAUDE.md). - Add a regression test under the relevant
apps/api/src/<module>/__tests__/orapps/web/e2e/specs/. - Update this runbook with the new failure mode.
What we don’t have yet (gaps)
- No PagerDuty / Opsgenie. Sentry-only.
- No structured incident retro template (see TODO above).
- No status page.
- No automatic rollback on health-check failure. Railway rolls only on manual click.
- Audit logging (FIT-20) is not implemented — incidents involving “who did what” can only be reconstructed from Sentry breadcrumbs + Pino logs +
__drizzle_migrations.created_at. That’s a known gap and the blocker on truly-confident root cause for any data-mutation incident.