Incident response

A short playbook for when something is on fire.

Solo-dev caveat: there is no on-call rotation today. Incidents reach the owner via Sentry → email/Slack. If you’re an agent or contractor reading this during an incident, the first step is always contact the owner.

Severity quick-reference

Sev	Definition	Examples	Response
1	Site is down or money is at risk	API returns 5xx broadly; checkout broken; subscription auto-renew misfiring; PII leaking	Drop everything; revert/roll back first, diagnose second
2	Major feature broken for ≥1 org	Forms can’t be signed; bookings can’t be created; Spotter is hard-down	Within hours; same-day fix or partial workaround
3	Single feature degraded, workaround exists	Search slow; embeddings out of date; PostHog ingest delayed	Within a day or two
4	Cosmetic / single user / non-blocking	RTL spacing off in one component	Backlog

Detection

Channel	What it catches
Sentry (`fitkit-backend`, `fitkit-frontend`)	Server exceptions, web client errors. Source maps uploaded via `SENTRY_AUTH_TOKEN`.
PostHog	Funnel drops, event volume anomalies. Server-side payment-stage events: `checkout_started`, `form_created`, `webhook_received`, etc.
Pino logs (Railway log stream)	Structured API logs with request IDs
Bull-board (`/admin/bull-board`)	Stuck/failed background jobs
User report (in-app feedback → `LINEAR_EMAIL` via Resend)	Anything the above misses

First-five-minutes checklist

Confirm scope. One user? One org? All orgs? Check Sentry’s “Users affected” count and event rate.
Recent deploy? Railway → Deployments → top of the list. If yes and the regression is clearly correlated, roll back first (Railway one-click). Diagnose after.
Recent migration? Check libs/db/drizzle/meta/_journal.json for new entries. If a migration ran and the failure references a column / table, see migrations.md.
External dependency? Cardcom, Morning, Clerk, R2, Neon/Railway — check their status pages. A common failure mode is upstream timeout cascading to our 5xx.
Tell the owner. Even if you’ve already mitigated.

Common failure modes

API boots, then dies with “relation/column does not exist”

Migration didn’t apply (often the monotonic-when bug). Run pnpm db:migrate against prod. If the migration is genuinely missing, fix the journal and redeploy.

CORS errors on web

ALLOWED_ORIGINS doesn’t include the calling origin. Update Railway env, redeploy API.

Clerk session not honored

Webhook for user create/update is failing (apps/api/src/webhooks/clerk*). Symptom: new sign-ups don’t appear in our users table. Check CLERK_WEBHOOK_SECRET matches the Clerk dashboard’s current value, and that Clerk’s webhook endpoint is pointed at ${API_URL}/webhooks/clerk.

Payment provider webhook idempotency failure

A provider may retry the same webhook; our handlers must be idempotent. If you see duplicate charges/refunds in the DB, check apps/api/src/payments/ for the idempotency key and confirm __drizzle_migrations shows the migration that added that index. See features/payments/behavior.md.

Spotter agent returns 5xx

Common causes (in order of frequency):

ANTHROPIC_API_KEY rate-limit or auth — check the response detail in Sentry.
Per-org daily $ cap hit — return is structured, not a 5xx; if it IS a 5xx, the cap-enforcement path threw. Look at apps/api/src/ai/agent/.
Postgres timeout on tool execution — verify DB pool isn’t saturated.

Forms PDF generation fails

Puppeteer / @sparticuz/chromium issue. Re-deploy the API (Chromium’s tmp files can corrupt). If it persists, fall back to “PDF unavailable; raw HTML signing record stored” while you diagnose. Forms must still be signable; PDF is a downstream artifact.

Bull jobs stuck

/admin/bull-board UI shows queues. Pause a queue, drain failed jobs, resume. If a job is poisoning the queue, move it to a “graveyard” queue with bullmq and unblock the rest.

Post-incident

Note the date in docs/decisions/ if the incident changes how we work (e.g., the migration outage prompted the _journal.json monotonic-when discipline in CLAUDE.md).
Add a regression test under the relevant apps/api/src/<module>/__tests__/ or apps/web/e2e/specs/.
Update this runbook with the new failure mode.

What we don’t have yet (gaps)

No PagerDuty / Opsgenie. Sentry-only.
No structured incident retro template (see TODO above).
No status page.
No automatic rollback on health-check failure. Railway rolls only on manual click.
Audit logging (FIT-20) is not implemented — incidents involving “who did what” can only be reconstructed from Sentry breadcrumbs + Pino logs + __drizzle_migrations.created_at. That’s a known gap and the blocker on truly-confident root cause for any data-mutation incident.