Living documentation — last reviewed 2026-05-28

Spotter Agent — QA Plan

This plan exercises each tool router, every confirmation/disambiguation/undo path, the cost cap, the prompt cache, and observability. Prompts assume Hebrew or English locale unless noted.

Setup

Org on tier pro (cap $5/day). Owner role.
Seed data: 5 programs across delivery modes (schedule, feed, coaching), 30 members, 100 exercises in library, 10 workouts, 5 class types.

Read-only prompts

Prompt	Expected tools	Notes
”who am I?”	`read.get_current_context`	Returns org + caller identity.
”what programs do we run?”	`read.programs_list`	Lists active programs.
”find Saar”	`read.search_anything` or `read.members_search`	Returns matches.
”find a squat variation”	`read.exercises_search`	Hybrid search results.
”any new comments in coaching?”	`read.workout_comments_in_program`	Resolves programId via `programs_list` first.
”what did Dani say on Monday’s WOD?”	`read.workout_comments_for_assignment`	Resolves assignmentId first.
”show me the program with id “	`read.lookup_by_id`	Type=`program`.

Disambiguation flow

Step	Expected
Have two members named “Saar”. Prompt: “give Saar a workout for tomorrow”.	Agent emits `read.ask_user_to_pick { kind: 'member', candidates: [...] }`; UI shows the picker card.
Click candidate 1	SSE `tool_completed` with `output: { pickedId, pickedLabel, kind }`. Loop resumes; agent proceeds with the assignment.
Click an id not in candidates	`400 / invalid_pick`.
Submit pick twice for the same `toolUseId`	Second call returns `tool_already_resolved`.
Type a free-text reply instead of clicking	Replay-integrity violation logged; auto-repaired synthetic tool_result; loop continues.

Confirmation flow

Destructive (single)

Prompt	Expected
”delete the workout titled ‘Murph 2.0‘“	Agent resolves the id, emits `workouts.delete`; orchestrator suspends with `confirmation_pending { confirm: 'destructive' }`. UI shows Approve/Reject.
Approve	Tool runs; `tool_completed { ok: true }`. Workout soft-deleted. Audit row written with `metadata.agent: true`.
Reject	Status flips `rejected_by_user`; agent receives the error tool_result and acknowledges.

Always-confirm

Prompt	Expected
”publish all draft sessions for next week”	Agent enumerates sessions, emits `class_sessions.bulk_publish` with `confirm: 'always'` → confirmation card.
Approve	Bulk-publish executes; resource ids logged.

Multi-tool turn

Step	Expected
Prompt that triggers two destructive tools in one assistant message	Both persisted as `pending`; UI shows two confirmation cards.
Approve first	Tool runs; remaining pending count > 0 → SSE `done` with usage 0; loop waits.
Approve second	Tool runs; merge phase appends one unified tool_result; loop resumes.

Undo

Prompt	Expected
”create a workout titled ‘Test‘“	Non-destructive write executes inline. UI shows Undo toast for 15s. `inverseAvailable: true` on the card.
Click Undo within 15s	`POST .../undo/:toolUseId` invokes `workouts.delete` with `{ id: <newWorkoutId> }`. Audit rows for both.
Click Undo on a tool with no declared inverse	`422 { code: 'no_inverse' }`.
Click Undo on a `failed` tool	`422 { code: 'not_succeeded' }`.

Workout build (structured)

Prompt	Expected sequence
”7 ROUNDS FOR TIME: 200m run, 5 HSPU, 10 KB swing, 15 air squat”	1) `read.exercises_resolve_batch({ names: [...] })`. 2) `workouts.create({ mode: 'structured', title: '7 Rounds For Time', scoring: 'time' })`. 3) `workouts.set_sections(...)` with 1 section `shape: 'rounds', config: { rounds: 7 }` and 4 movements, each rep-counted movement carries `prescription.reps`.
Fran	`shape: 'rep_scheme', config: { repsScheme: [21,15,9] }`; movements have `load.kind = 'absolute', value: 43, unit: 'kg'`; no per-movement reps.
”5x5 back squat at 80% 1RM”	`shape: 'linear'`; one movement `prescription = { sets: 5, reps: { kind: 'fixed', value: 5 }, load: { kind: 'percent_1rm', value: 80 } }`.
Same prompt on a `lite` tier org (no `workout_builder` feature)	Agent emits one `workouts.create` with `mode: 'freeform'` and the body in `description`. No `set_sections` call.

Programs / scheduling routing

Prompt	Expected
”schedule open gym tomorrow 13:00–14:00”	Agent recognizes time window + “schedule” intent. Sequence: `read.programs_list` → `class_types.list_for_program({ programId })` → `class_sessions.create({ classTypeId, startsAt, endsAt, status: 'published' })`.
”assign Murph to everyone in CrossFit next Friday” on a `schedule` program	Agent re-routes from `assignments.assign_personal` to `class_sessions.create` (per the system prompt’s “schedule vs assign” rules).
“assign Murph to Dani next Friday” on a `coaching` program	`read.exercises_resolve_batch` (if needed) → `workouts.create` → `assignments.assign_personal({ userIds: [...], date: ..., workoutId })`.
”publish today’s workout to the feed track ‘Daily WOD‘“	`workouts.create` + assignments scoped to the feed program.

Analytics

Prompt	Expected tool
”what’s our MRR?”	`analytics.revenue_summary`
”show last 12 months revenue”	`analytics.revenue_trend`
”who’s at risk?”	`analytics.at_risk_members`
”what should I look at right now?”	`analytics.org_insights` (delegates to `InsightsService`)
“how busy were classes last 30 days?”	`analytics.class_utilization`
”how many active members?”	`analytics.members_summary`

Tasks

Prompt	Expected
”create a task: ‘follow up with Dani’ due tomorrow”	`tasks.create`; Undo affordance present.
”what’s overdue?”	`tasks.list` with overdue filter.
”delete task X”	`tasks.delete` (destructive — confirmation card).
“mark task X done”	`tasks.complete`.

Program templates

Prompt	Expected
”apply the ‘6-week strength’ template to ‘Daily WOD’ starting next Monday”	`program_templates.list` → `program_templates.apply` (confirm: always → confirmation card).
“build a template from the last 4 weeks of Daily WOD”	`program_templates.from_history` (confirm: always).

Forms

Prompt	Expected
”who hasn’t signed the new תקנון?”	`forms.list_pending_for_org`.
”what’s Saar’s compliance status?”	`forms.compliance_status_for_member`.

Per-org daily $ cap

Step	Expected
Send turns until org spend reaches the tier cap (e.g. $5 for pro)	Next send returns SSE `error { code: 'agent_budget_exceeded' }` with a localized upgrade prompt. UI surfaces the upsell modal.
`GET /agent/usage` while over cap	`{ remainingUsdMicros: 0, percentUsed: 1 }`.
UTC midnight passes	Cap resets; next send succeeds.
Tier set to `-1` (unmetered)	`preCheck` returns `ok: true` immediately.

Role-aware upsell

Step	Expected
Member role hits the composer	Send → 403 `Agent chat is staff-only in this release.`
Owner at 80% usage	Composer footer shows soft upsell badge.
Owner at 100% usage	Composer is blocked; modal opens on send attempt.

Prompt caching

Step	Expected
First turn of a brand-new conversation	`cache_creation_tokens` > 0 on the static system + org blocks. `cache_hit_ratio` low (~0.2).
Second turn within the same conversation, ~1 min later	`cache_read_tokens > 0`; ratio rises.
Steady-state (5+ turns)	`cacheHitRatio` between 0.85–0.95 as logged in `agent.turn.completed`.
Quiet period > 1 hour	Cache TTL expires; next turn re-creates (125% cost spike) then steady-state resumes.
Inspect `withCachedTail` is applied	The trailing message’s last content block has `cache_control: { type: 'ephemeral' }`.

Observability — PostHog / Sentry

Step	Expected
Send a normal turn	PostHog events: `agent.turn.started` + `agent.turn.completed` with `cacheHitRatio`, `costUsdMicros`, etc. No raw user text in props.
Tool succeeds	`agent.tool.executed` with `outcome: 'success'`.
Tool fails	`agent.tool.executed` with `outcome: 'failure'` + `errorCode`.
Destructive confirmation gets emitted	`agent.tool.confirmation_pending`.
Picker gets emitted	`agent.tool.disambiguation_pending`.
Provider 529 from Anthropic	`agent.turn.failed` with `errorCode: 'provider_overloaded'`. Sentry breadcrumb `agent` with `level: error`. SSE error carries `traceId` for QA.
Replay violation (synthetic by forcing an orphan tool_use)	`agent.replay.violation` with `violations: N`, `repaired: true`.
Daily 03:00 cron	`agent.storage.snapshot` with row + bytes per table.

Replay integrity

Step	Expected
Force a conversation to have a pending `tool_use` whose `tool_result` is missing	Pre-loop `validateReplay` returns `{ ok: false }`; `validateAndRepairReplay` injects synthetic error `tool_result`. Loop continues without aborting.
Same conversation after repair	Next user turn proceeds normally; replay integrity green.
Force unrecoverable violation (`ReplayIntegrityError`)	SSE `error { code: 'replay_integrity_violation' }` with copy “conversation got into an inconsistent state and was reset”.

Compaction

Step	Expected
Conversation < 24 messages past last anchor	`compactIfNeeded` is a no-op.
Conversation >= 24 messages	Haiku summarizer runs within a 3s budget; persists `system_note` with `pageContext.summarizedThroughMessageId`. Subsequent `listMessagesForReplay` returns the system_note + only messages after it.
Summarizer times out	Logged warning; conversation continues with full history this turn; retry on next turn.

Title generation

Step	Expected
First user message of a new conversation	Haiku-generated title persisted within 1.5s. UI updates.
Anthropic key not configured	No title; logged warning; conversation continues.
Second user message	Title generation is skipped (already set).

Conversation list & detail

Step	Expected
`GET /agent/conversations`	Returns caller’s conversations sorted by `last_message_at desc`.
`GET /agent/conversations/:id` for someone else’s conv in the same org	403 — `ConversationsService.getOwn` enforces ownership.
`GET /agent/conversations/:id` returns `messages[]` with `toolCalls[]` per assistant message	Each tool call has `inverseAvailable` filled from the registry.

Negative tests

POST /agent/messages as a member → 403.
POST /agent/confirm/<bad-id> → 200 SSE with tool_execution_not_found.
POST /agent/confirm/<resolved-id> → tool_already_resolved.
POST /agent/pick/<id> with id not in candidates → invalid_pick.
Send a 20_001-char message → 400 (Zod max).
Abort the SSE mid-stream → server cleanly aborts the Anthropic stream; assistant row reflects partial state.
ANTHROPIC_API_KEY unset → agent_disabled on send.

Performance

p95 time-to-first-token < 1500ms on a warm cache.
p95 turn latency on a structured workout build (resolve_batch + create + set_sections): < 8s.
Steady-state cache-hit ratio ≥ 0.85 (logged on every agent.turn.completed).

Localization

Send a prompt in Hebrew → agent replies in Hebrew. The static system prompt instructs locale parity.
Send a prompt in Russian on a ru-locale org → agent replies in Russian.
Switch locale mid-conversation → agent follows the new language from the next turn.

Spotter Agent — Data Model Subscriptions & Plans