ADR-0009: Background jobs via BullMQ on Redis

Status: Accepted Date: ~2026-01 (estimate) Context owner: Owner

Context

Several flows can’t happen in the request/response cycle:

Sending invitation emails after a member signs up.
Generating PDFs for signed compliance forms (Puppeteer renders take seconds).
Reconciling payment provider statuses on webhook receipt.
Cron-driven workflows: pre-charge reminders, no-show sweeps, AI storage snapshots, billing retry, embeddings enrichment.
Spotter agent has periodic maintenance (cache eviction, daily usage rollups).

We need durable queues with retries, scheduled jobs (cron-equivalent), and observability.

Decision

Use BullMQ on Redis.

BullModule.forRootAsync registers a global connection in apps/api/src/app/app.module.ts.
Per-module queues are registered via BullModule.registerQueue({ name: '...' }).
The bull-board UI is mounted at /admin/bull-board for operational visibility (apps/api/src/bull-board/). Access gated to platform admins.
CRONS_ENABLED env flag gates @Cron decorators globally — kept off in dev to spare the Neon free-tier compute.

Redis also serves as the Socket.IO adapter (multi-instance WebSocket presence) and as the Spotter agent’s caching/rate-limit store. One Redis, multiple use cases.

Consequences

Positive

Durable: a crashed worker resumes from Redis.
Retries with exponential backoff are built in.
Scheduling (delayed, repeated, cron) is one library, one mental model.
bull-board gives an operator-friendly view of stuck queues, failed jobs.
Reusing Redis for sockets + cache reduces infrastructure count.

Negative

Single Redis = single point of failure. Mitigation: use a managed Redis with a sensible SLA.
Coupling: a Redis outage takes out jobs, sockets, and the Spotter rate-limiter at once.
Job code must be idempotent — BullMQ retries on transient failures, and a “succeeded but DB write didn’t commit” job is the wrong default.
Cron jobs default to off (CRONS_ENABLED=false) so dev environments don’t burn DB compute. Operational gotcha when troubleshooting: make dev-up doesn’t enable crons.

Discipline

Every job must be idempotent. A second run with the same input must not double-charge, double-send, or double-allocate.
Don’t queue ad-hoc. New queues are registered in the relevant module, named per the domain (forms.pdf-render, notifications.send, etc.).
Long-running jobs need keepalive heartbeats if they exceed the visibility timeout.
Failures are observable. A job that fails three times moves to failed; bull-board surfaces it.
Don’t let CI run scheduled cron jobs by surprise. CI sets CRONS_ENABLED=true only because some tests exercise the cron handler directly via /testing/* endpoints.

architecture/observability.md — bull-board access
features/notifications/README.md
features/payments/behavior.md — webhook idempotency