Skip to Content
Living documentation — last reviewed 2026-05-28
RunbooksIncident response

Incident response

A short playbook for when something is on fire.

Solo-dev caveat: there is no on-call rotation today. Incidents reach the owner via Sentry → email/Slack. If you’re an agent or contractor reading this during an incident, the first step is always contact the owner.

Severity quick-reference

SevDefinitionExamplesResponse
1Site is down or money is at riskAPI returns 5xx broadly; checkout broken; subscription auto-renew misfiring; PII leakingDrop everything; revert/roll back first, diagnose second
2Major feature broken for ≥1 orgForms can’t be signed; bookings can’t be created; Spotter is hard-downWithin hours; same-day fix or partial workaround
3Single feature degraded, workaround existsSearch slow; embeddings out of date; PostHog ingest delayedWithin a day or two
4Cosmetic / single user / non-blockingRTL spacing off in one componentBacklog

Detection

ChannelWhat it catches
Sentry (fitkit-backend, fitkit-frontend)Server exceptions, web client errors. Source maps uploaded via SENTRY_AUTH_TOKEN.
PostHogFunnel drops, event volume anomalies. Server-side payment-stage events: checkout_started, form_created, webhook_received, etc.
Pino logs (Railway log stream)Structured API logs with request IDs
Bull-board (/admin/bull-board)Stuck/failed background jobs
User report (in-app feedback → LINEAR_EMAIL via Resend)Anything the above misses

First-five-minutes checklist

  1. Confirm scope. One user? One org? All orgs? Check Sentry’s “Users affected” count and event rate.
  2. Recent deploy? Railway → Deployments → top of the list. If yes and the regression is clearly correlated, roll back first (Railway one-click). Diagnose after.
  3. Recent migration? Check libs/db/drizzle/meta/_journal.json for new entries. If a migration ran and the failure references a column / table, see migrations.md.
  4. External dependency? Cardcom, Morning, Clerk, R2, Neon/Railway — check their status pages. A common failure mode is upstream timeout cascading to our 5xx.
  5. Tell the owner. Even if you’ve already mitigated.

Common failure modes

API boots, then dies with “relation/column does not exist”

Migration didn’t apply (often the monotonic-when bug). Run pnpm db:migrate against prod. If the migration is genuinely missing, fix the journal and redeploy.

CORS errors on web

ALLOWED_ORIGINS doesn’t include the calling origin. Update Railway env, redeploy API.

Clerk session not honored

Webhook for user create/update is failing (apps/api/src/webhooks/clerk*). Symptom: new sign-ups don’t appear in our users table. Check CLERK_WEBHOOK_SECRET matches the Clerk dashboard’s current value, and that Clerk’s webhook endpoint is pointed at ${API_URL}/webhooks/clerk.

Payment provider webhook idempotency failure

A provider may retry the same webhook; our handlers must be idempotent. If you see duplicate charges/refunds in the DB, check apps/api/src/payments/ for the idempotency key and confirm __drizzle_migrations shows the migration that added that index. See features/payments/behavior.md.

Spotter agent returns 5xx

Common causes (in order of frequency):

  1. ANTHROPIC_API_KEY rate-limit or auth — check the response detail in Sentry.
  2. Per-org daily $ cap hit — return is structured, not a 5xx; if it IS a 5xx, the cap-enforcement path threw. Look at apps/api/src/ai/agent/.
  3. Postgres timeout on tool execution — verify DB pool isn’t saturated.

Forms PDF generation fails

Puppeteer / @sparticuz/chromium issue. Re-deploy the API (Chromium’s tmp files can corrupt). If it persists, fall back to “PDF unavailable; raw HTML signing record stored” while you diagnose. Forms must still be signable; PDF is a downstream artifact.

Bull jobs stuck

/admin/bull-board UI shows queues. Pause a queue, drain failed jobs, resume. If a job is poisoning the queue, move it to a “graveyard” queue with bullmq and unblock the rest.

Post-incident

  1. Note the date in docs/decisions/ if the incident changes how we work (e.g., the migration outage prompted the _journal.json monotonic-when discipline in CLAUDE.md).
  2. Add a regression test under the relevant apps/api/src/<module>/__tests__/ or apps/web/e2e/specs/.
  3. Update this runbook with the new failure mode.

What we don’t have yet (gaps)

  • No PagerDuty / Opsgenie. Sentry-only.
  • No structured incident retro template (see TODO above).
  • No status page.
  • No automatic rollback on health-check failure. Railway rolls only on manual click.
  • Audit logging (FIT-20) is not implemented — incidents involving “who did what” can only be reconstructed from Sentry breadcrumbs + Pino logs + __drizzle_migrations.created_at. That’s a known gap and the blocker on truly-confident root cause for any data-mutation incident.