Skip to Content
Living documentation — last reviewed 2026-05-28
DecisionsADR-0009: Background jobs via BullMQ on Redis

ADR-0009: Background jobs via BullMQ on Redis

Status: Accepted Date: ~2026-01 (estimate) Context owner: Owner

Context

Several flows can’t happen in the request/response cycle:

  • Sending invitation emails after a member signs up.
  • Generating PDFs for signed compliance forms (Puppeteer renders take seconds).
  • Reconciling payment provider statuses on webhook receipt.
  • Cron-driven workflows: pre-charge reminders, no-show sweeps, AI storage snapshots, billing retry, embeddings enrichment.
  • Spotter agent has periodic maintenance (cache eviction, daily usage rollups).

We need durable queues with retries, scheduled jobs (cron-equivalent), and observability.

Decision

Use BullMQ on Redis.

  • BullModule.forRootAsync registers a global connection in apps/api/src/app/app.module.ts.
  • Per-module queues are registered via BullModule.registerQueue({ name: '...' }).
  • The bull-board UI is mounted at /admin/bull-board for operational visibility (apps/api/src/bull-board/). Access gated to platform admins.
  • CRONS_ENABLED env flag gates @Cron decorators globally — kept off in dev to spare the Neon free-tier compute.

Redis also serves as the Socket.IO adapter (multi-instance WebSocket presence) and as the Spotter agent’s caching/rate-limit store. One Redis, multiple use cases.

Consequences

Positive

  • Durable: a crashed worker resumes from Redis.
  • Retries with exponential backoff are built in.
  • Scheduling (delayed, repeated, cron) is one library, one mental model.
  • bull-board gives an operator-friendly view of stuck queues, failed jobs.
  • Reusing Redis for sockets + cache reduces infrastructure count.

Negative

  • Single Redis = single point of failure. Mitigation: use a managed Redis with a sensible SLA.
  • Coupling: a Redis outage takes out jobs, sockets, and the Spotter rate-limiter at once.
  • Job code must be idempotent — BullMQ retries on transient failures, and a “succeeded but DB write didn’t commit” job is the wrong default.
  • Cron jobs default to off (CRONS_ENABLED=false) so dev environments don’t burn DB compute. Operational gotcha when troubleshooting: make dev-up doesn’t enable crons.

Discipline

  • Every job must be idempotent. A second run with the same input must not double-charge, double-send, or double-allocate.
  • Don’t queue ad-hoc. New queues are registered in the relevant module, named per the domain (forms.pdf-render, notifications.send, etc.).
  • Long-running jobs need keepalive heartbeats if they exceed the visibility timeout.
  • Failures are observable. A job that fails three times moves to failed; bull-board surfaces it.
  • Don’t let CI run scheduled cron jobs by surprise. CI sets CRONS_ENABLED=true only because some tests exercise the cron handler directly via /testing/* endpoints.