Spotter Agent — QA Plan
This plan exercises each tool router, every confirmation/disambiguation/undo path, the cost cap, the prompt cache, and observability. Prompts assume Hebrew or English locale unless noted.
Setup
- Org on tier
pro(cap $5/day). Owner role. - Seed data: 5 programs across delivery modes (
schedule,feed,coaching), 30 members, 100 exercises in library, 10 workouts, 5 class types.
Read-only prompts
| Prompt | Expected tools | Notes |
|---|---|---|
| ”who am I?” | read.get_current_context | Returns org + caller identity. |
| ”what programs do we run?” | read.programs_list | Lists active programs. |
| ”find Saar” | read.search_anything or read.members_search | Returns matches. |
| ”find a squat variation” | read.exercises_search | Hybrid search results. |
| ”any new comments in coaching?” | read.workout_comments_in_program | Resolves programId via programs_list first. |
| ”what did Dani say on Monday’s WOD?” | read.workout_comments_for_assignment | Resolves assignmentId first. |
| ”show me the program with id | read.lookup_by_id | Type=program. |
Disambiguation flow
| Step | Expected |
|---|---|
| Have two members named “Saar”. Prompt: “give Saar a workout for tomorrow”. | Agent emits read.ask_user_to_pick { kind: 'member', candidates: [...] }; UI shows the picker card. |
| Click candidate 1 | SSE tool_completed with output: { pickedId, pickedLabel, kind }. Loop resumes; agent proceeds with the assignment. |
| Click an id not in candidates | 400 / invalid_pick. |
Submit pick twice for the same toolUseId | Second call returns tool_already_resolved. |
| Type a free-text reply instead of clicking | Replay-integrity violation logged; auto-repaired synthetic tool_result; loop continues. |
Confirmation flow
Destructive (single)
| Prompt | Expected |
|---|---|
| ”delete the workout titled ‘Murph 2.0‘“ | Agent resolves the id, emits workouts.delete; orchestrator suspends with confirmation_pending { confirm: 'destructive' }. UI shows Approve/Reject. |
| Approve | Tool runs; tool_completed { ok: true }. Workout soft-deleted. Audit row written with metadata.agent: true. |
| Reject | Status flips rejected_by_user; agent receives the error tool_result and acknowledges. |
Always-confirm
| Prompt | Expected |
|---|---|
| ”publish all draft sessions for next week” | Agent enumerates sessions, emits class_sessions.bulk_publish with confirm: 'always' → confirmation card. |
| Approve | Bulk-publish executes; resource ids logged. |
Multi-tool turn
| Step | Expected |
|---|---|
| Prompt that triggers two destructive tools in one assistant message | Both persisted as pending; UI shows two confirmation cards. |
| Approve first | Tool runs; remaining pending count > 0 → SSE done with usage 0; loop waits. |
| Approve second | Tool runs; merge phase appends one unified tool_result; loop resumes. |
Undo
| Prompt | Expected |
|---|---|
| ”create a workout titled ‘Test‘“ | Non-destructive write executes inline. UI shows Undo toast for 15s. inverseAvailable: true on the card. |
| Click Undo within 15s | POST .../undo/:toolUseId invokes workouts.delete with { id: <newWorkoutId> }. Audit rows for both. |
| Click Undo on a tool with no declared inverse | 422 { code: 'no_inverse' }. |
Click Undo on a failed tool | 422 { code: 'not_succeeded' }. |
Workout build (structured)
| Prompt | Expected sequence |
|---|---|
| ”7 ROUNDS FOR TIME: 200m run, 5 HSPU, 10 KB swing, 15 air squat” | 1) read.exercises_resolve_batch({ names: [...] }). 2) workouts.create({ mode: 'structured', title: '7 Rounds For Time', scoring: 'time' }). 3) workouts.set_sections(...) with 1 section shape: 'rounds', config: { rounds: 7 } and 4 movements, each rep-counted movement carries prescription.reps. |
| Fran | shape: 'rep_scheme', config: { repsScheme: [21,15,9] }; movements have load.kind = 'absolute', value: 43, unit: 'kg'; no per-movement reps. |
| ”5x5 back squat at 80% 1RM” | shape: 'linear'; one movement prescription = { sets: 5, reps: { kind: 'fixed', value: 5 }, load: { kind: 'percent_1rm', value: 80 } }. |
Same prompt on a lite tier org (no workout_builder feature) | Agent emits one workouts.create with mode: 'freeform' and the body in description. No set_sections call. |
Programs / scheduling routing
| Prompt | Expected |
|---|---|
| ”schedule open gym tomorrow 13:00–14:00” | Agent recognizes time window + “schedule” intent. Sequence: read.programs_list → class_types.list_for_program({ programId }) → class_sessions.create({ classTypeId, startsAt, endsAt, status: 'published' }). |
”assign Murph to everyone in CrossFit next Friday” on a schedule program | Agent re-routes from assignments.assign_personal to class_sessions.create (per the system prompt’s “schedule vs assign” rules). |
“assign Murph to Dani next Friday” on a coaching program | read.exercises_resolve_batch (if needed) → workouts.create → assignments.assign_personal({ userIds: [...], date: ..., workoutId }). |
| ”publish today’s workout to the feed track ‘Daily WOD‘“ | workouts.create + assignments scoped to the feed program. |
Analytics
| Prompt | Expected tool |
|---|---|
| ”what’s our MRR?” | analytics.revenue_summary |
| ”show last 12 months revenue” | analytics.revenue_trend |
| ”who’s at risk?” | analytics.at_risk_members |
| ”what should I look at right now?” | analytics.org_insights (delegates to InsightsService) |
| “how busy were classes last 30 days?” | analytics.class_utilization |
| ”how many active members?” | analytics.members_summary |
Tasks
| Prompt | Expected |
|---|---|
| ”create a task: ‘follow up with Dani’ due tomorrow” | tasks.create; Undo affordance present. |
| ”what’s overdue?” | tasks.list with overdue filter. |
| ”delete task X” | tasks.delete (destructive — confirmation card). |
| “mark task X done” | tasks.complete. |
Program templates
| Prompt | Expected |
|---|---|
| ”apply the ‘6-week strength’ template to ‘Daily WOD’ starting next Monday” | program_templates.list → program_templates.apply (confirm: always → confirmation card). |
| “build a template from the last 4 weeks of Daily WOD” | program_templates.from_history (confirm: always). |
Forms
| Prompt | Expected |
|---|---|
| ”who hasn’t signed the new תקנון?” | forms.list_pending_for_org. |
| ”what’s Saar’s compliance status?” | forms.compliance_status_for_member. |
Per-org daily $ cap
| Step | Expected |
|---|---|
| Send turns until org spend reaches the tier cap (e.g. $5 for pro) | Next send returns SSE error { code: 'agent_budget_exceeded' } with a localized upgrade prompt. UI surfaces the upsell modal. |
GET /agent/usage while over cap | { remainingUsdMicros: 0, percentUsed: 1 }. |
| UTC midnight passes | Cap resets; next send succeeds. |
Tier set to -1 (unmetered) | preCheck returns ok: true immediately. |
Role-aware upsell
| Step | Expected |
|---|---|
| Member role hits the composer | Send → 403 Agent chat is staff-only in this release. |
| Owner at 80% usage | Composer footer shows soft upsell badge. |
| Owner at 100% usage | Composer is blocked; modal opens on send attempt. |
Prompt caching
| Step | Expected |
|---|---|
| First turn of a brand-new conversation | cache_creation_tokens > 0 on the static system + org blocks. cache_hit_ratio low (~0.2). |
| Second turn within the same conversation, ~1 min later | cache_read_tokens > 0; ratio rises. |
| Steady-state (5+ turns) | cacheHitRatio between 0.85–0.95 as logged in agent.turn.completed. |
| Quiet period > 1 hour | Cache TTL expires; next turn re-creates (125% cost spike) then steady-state resumes. |
Inspect withCachedTail is applied | The trailing message’s last content block has cache_control: { type: 'ephemeral' }. |
Observability — PostHog / Sentry
| Step | Expected |
|---|---|
| Send a normal turn | PostHog events: agent.turn.started + agent.turn.completed with cacheHitRatio, costUsdMicros, etc. No raw user text in props. |
| Tool succeeds | agent.tool.executed with outcome: 'success'. |
| Tool fails | agent.tool.executed with outcome: 'failure' + errorCode. |
| Destructive confirmation gets emitted | agent.tool.confirmation_pending. |
| Picker gets emitted | agent.tool.disambiguation_pending. |
| Provider 529 from Anthropic | agent.turn.failed with errorCode: 'provider_overloaded'. Sentry breadcrumb agent with level: error. SSE error carries traceId for QA. |
| Replay violation (synthetic by forcing an orphan tool_use) | agent.replay.violation with violations: N, repaired: true. |
| Daily 03:00 cron | agent.storage.snapshot with row + bytes per table. |
Replay integrity
| Step | Expected |
|---|---|
Force a conversation to have a pending tool_use whose tool_result is missing | Pre-loop validateReplay returns { ok: false }; validateAndRepairReplay injects synthetic error tool_result. Loop continues without aborting. |
| Same conversation after repair | Next user turn proceeds normally; replay integrity green. |
Force unrecoverable violation (ReplayIntegrityError) | SSE error { code: 'replay_integrity_violation' } with copy “conversation got into an inconsistent state and was reset”. |
Compaction
| Step | Expected |
|---|---|
| Conversation < 24 messages past last anchor | compactIfNeeded is a no-op. |
| Conversation >= 24 messages | Haiku summarizer runs within a 3s budget; persists system_note with pageContext.summarizedThroughMessageId. Subsequent listMessagesForReplay returns the system_note + only messages after it. |
| Summarizer times out | Logged warning; conversation continues with full history this turn; retry on next turn. |
Title generation
| Step | Expected |
|---|---|
| First user message of a new conversation | Haiku-generated title persisted within 1.5s. UI updates. |
| Anthropic key not configured | No title; logged warning; conversation continues. |
| Second user message | Title generation is skipped (already set). |
Conversation list & detail
| Step | Expected |
|---|---|
GET /agent/conversations | Returns caller’s conversations sorted by last_message_at desc. |
GET /agent/conversations/:id for someone else’s conv in the same org | 403 — ConversationsService.getOwn enforces ownership. |
GET /agent/conversations/:id returns messages[] with toolCalls[] per assistant message | Each tool call has inverseAvailable filled from the registry. |
Negative tests
- POST
/agent/messagesas a member → 403. - POST
/agent/confirm/<bad-id>→ 200 SSE withtool_execution_not_found. - POST
/agent/confirm/<resolved-id>→tool_already_resolved. - POST
/agent/pick/<id>with id not in candidates →invalid_pick. - Send a 20_001-char message → 400 (Zod max).
- Abort the SSE mid-stream → server cleanly aborts the Anthropic stream; assistant row reflects partial state.
ANTHROPIC_API_KEYunset →agent_disabledon send.
Performance
- p95 time-to-first-token < 1500ms on a warm cache.
- p95 turn latency on a structured workout build (resolve_batch + create + set_sections): < 8s.
- Steady-state cache-hit ratio ≥ 0.85 (logged on every
agent.turn.completed).
Localization
- Send a prompt in Hebrew → agent replies in Hebrew. The static system prompt instructs locale parity.
- Send a prompt in Russian on a
ru-locale org → agent replies in Russian. - Switch locale mid-conversation → agent follows the new language from the next turn.