Skip to Content
Living documentation — last reviewed 2026-05-28
FeaturesSpotter AgentSpotter Agent — QA Plan

Spotter Agent — QA Plan

This plan exercises each tool router, every confirmation/disambiguation/undo path, the cost cap, the prompt cache, and observability. Prompts assume Hebrew or English locale unless noted.

Setup

  • Org on tier pro (cap $5/day). Owner role.
  • Seed data: 5 programs across delivery modes (schedule, feed, coaching), 30 members, 100 exercises in library, 10 workouts, 5 class types.

Read-only prompts

PromptExpected toolsNotes
”who am I?”read.get_current_contextReturns org + caller identity.
”what programs do we run?”read.programs_listLists active programs.
”find Saar”read.search_anything or read.members_searchReturns matches.
”find a squat variation”read.exercises_searchHybrid search results.
”any new comments in coaching?”read.workout_comments_in_programResolves programId via programs_list first.
”what did Dani say on Monday’s WOD?”read.workout_comments_for_assignmentResolves assignmentId first.
”show me the program with id read.lookup_by_idType=program.

Disambiguation flow

StepExpected
Have two members named “Saar”. Prompt: “give Saar a workout for tomorrow”.Agent emits read.ask_user_to_pick { kind: 'member', candidates: [...] }; UI shows the picker card.
Click candidate 1SSE tool_completed with output: { pickedId, pickedLabel, kind }. Loop resumes; agent proceeds with the assignment.
Click an id not in candidates400 / invalid_pick.
Submit pick twice for the same toolUseIdSecond call returns tool_already_resolved.
Type a free-text reply instead of clickingReplay-integrity violation logged; auto-repaired synthetic tool_result; loop continues.

Confirmation flow

Destructive (single)

PromptExpected
”delete the workout titled ‘Murph 2.0‘“Agent resolves the id, emits workouts.delete; orchestrator suspends with confirmation_pending { confirm: 'destructive' }. UI shows Approve/Reject.
ApproveTool runs; tool_completed { ok: true }. Workout soft-deleted. Audit row written with metadata.agent: true.
RejectStatus flips rejected_by_user; agent receives the error tool_result and acknowledges.

Always-confirm

PromptExpected
”publish all draft sessions for next week”Agent enumerates sessions, emits class_sessions.bulk_publish with confirm: 'always' → confirmation card.
ApproveBulk-publish executes; resource ids logged.

Multi-tool turn

StepExpected
Prompt that triggers two destructive tools in one assistant messageBoth persisted as pending; UI shows two confirmation cards.
Approve firstTool runs; remaining pending count > 0 → SSE done with usage 0; loop waits.
Approve secondTool runs; merge phase appends one unified tool_result; loop resumes.

Undo

PromptExpected
”create a workout titled ‘Test‘“Non-destructive write executes inline. UI shows Undo toast for 15s. inverseAvailable: true on the card.
Click Undo within 15sPOST .../undo/:toolUseId invokes workouts.delete with { id: <newWorkoutId> }. Audit rows for both.
Click Undo on a tool with no declared inverse422 { code: 'no_inverse' }.
Click Undo on a failed tool422 { code: 'not_succeeded' }.

Workout build (structured)

PromptExpected sequence
”7 ROUNDS FOR TIME: 200m run, 5 HSPU, 10 KB swing, 15 air squat”1) read.exercises_resolve_batch({ names: [...] }). 2) workouts.create({ mode: 'structured', title: '7 Rounds For Time', scoring: 'time' }). 3) workouts.set_sections(...) with 1 section shape: 'rounds', config: { rounds: 7 } and 4 movements, each rep-counted movement carries prescription.reps.
Franshape: 'rep_scheme', config: { repsScheme: [21,15,9] }; movements have load.kind = 'absolute', value: 43, unit: 'kg'; no per-movement reps.
”5x5 back squat at 80% 1RM”shape: 'linear'; one movement prescription = { sets: 5, reps: { kind: 'fixed', value: 5 }, load: { kind: 'percent_1rm', value: 80 } }.
Same prompt on a lite tier org (no workout_builder feature)Agent emits one workouts.create with mode: 'freeform' and the body in description. No set_sections call.

Programs / scheduling routing

PromptExpected
”schedule open gym tomorrow 13:00–14:00”Agent recognizes time window + “schedule” intent. Sequence: read.programs_listclass_types.list_for_program({ programId })class_sessions.create({ classTypeId, startsAt, endsAt, status: 'published' }).
”assign Murph to everyone in CrossFit next Friday” on a schedule programAgent re-routes from assignments.assign_personal to class_sessions.create (per the system prompt’s “schedule vs assign” rules).
“assign Murph to Dani next Friday” on a coaching programread.exercises_resolve_batch (if needed) → workouts.createassignments.assign_personal({ userIds: [...], date: ..., workoutId }).
”publish today’s workout to the feed track ‘Daily WOD‘“workouts.create + assignments scoped to the feed program.

Analytics

PromptExpected tool
”what’s our MRR?”analytics.revenue_summary
”show last 12 months revenue”analytics.revenue_trend
”who’s at risk?”analytics.at_risk_members
”what should I look at right now?”analytics.org_insights (delegates to InsightsService)
“how busy were classes last 30 days?”analytics.class_utilization
”how many active members?”analytics.members_summary

Tasks

PromptExpected
”create a task: ‘follow up with Dani’ due tomorrow”tasks.create; Undo affordance present.
”what’s overdue?”tasks.list with overdue filter.
”delete task X”tasks.delete (destructive — confirmation card).
“mark task X done”tasks.complete.

Program templates

PromptExpected
”apply the ‘6-week strength’ template to ‘Daily WOD’ starting next Monday”program_templates.listprogram_templates.apply (confirm: always → confirmation card).
“build a template from the last 4 weeks of Daily WOD”program_templates.from_history (confirm: always).

Forms

PromptExpected
”who hasn’t signed the new תקנון?”forms.list_pending_for_org.
”what’s Saar’s compliance status?”forms.compliance_status_for_member.

Per-org daily $ cap

StepExpected
Send turns until org spend reaches the tier cap (e.g. $5 for pro)Next send returns SSE error { code: 'agent_budget_exceeded' } with a localized upgrade prompt. UI surfaces the upsell modal.
GET /agent/usage while over cap{ remainingUsdMicros: 0, percentUsed: 1 }.
UTC midnight passesCap resets; next send succeeds.
Tier set to -1 (unmetered)preCheck returns ok: true immediately.

Role-aware upsell

StepExpected
Member role hits the composerSend → 403 Agent chat is staff-only in this release.
Owner at 80% usageComposer footer shows soft upsell badge.
Owner at 100% usageComposer is blocked; modal opens on send attempt.

Prompt caching

StepExpected
First turn of a brand-new conversationcache_creation_tokens > 0 on the static system + org blocks. cache_hit_ratio low (~0.2).
Second turn within the same conversation, ~1 min latercache_read_tokens > 0; ratio rises.
Steady-state (5+ turns)cacheHitRatio between 0.85–0.95 as logged in agent.turn.completed.
Quiet period > 1 hourCache TTL expires; next turn re-creates (125% cost spike) then steady-state resumes.
Inspect withCachedTail is appliedThe trailing message’s last content block has cache_control: { type: 'ephemeral' }.

Observability — PostHog / Sentry

StepExpected
Send a normal turnPostHog events: agent.turn.started + agent.turn.completed with cacheHitRatio, costUsdMicros, etc. No raw user text in props.
Tool succeedsagent.tool.executed with outcome: 'success'.
Tool failsagent.tool.executed with outcome: 'failure' + errorCode.
Destructive confirmation gets emittedagent.tool.confirmation_pending.
Picker gets emittedagent.tool.disambiguation_pending.
Provider 529 from Anthropicagent.turn.failed with errorCode: 'provider_overloaded'. Sentry breadcrumb agent with level: error. SSE error carries traceId for QA.
Replay violation (synthetic by forcing an orphan tool_use)agent.replay.violation with violations: N, repaired: true.
Daily 03:00 cronagent.storage.snapshot with row + bytes per table.

Replay integrity

StepExpected
Force a conversation to have a pending tool_use whose tool_result is missingPre-loop validateReplay returns { ok: false }; validateAndRepairReplay injects synthetic error tool_result. Loop continues without aborting.
Same conversation after repairNext user turn proceeds normally; replay integrity green.
Force unrecoverable violation (ReplayIntegrityError)SSE error { code: 'replay_integrity_violation' } with copy “conversation got into an inconsistent state and was reset”.

Compaction

StepExpected
Conversation < 24 messages past last anchorcompactIfNeeded is a no-op.
Conversation >= 24 messagesHaiku summarizer runs within a 3s budget; persists system_note with pageContext.summarizedThroughMessageId. Subsequent listMessagesForReplay returns the system_note + only messages after it.
Summarizer times outLogged warning; conversation continues with full history this turn; retry on next turn.

Title generation

StepExpected
First user message of a new conversationHaiku-generated title persisted within 1.5s. UI updates.
Anthropic key not configuredNo title; logged warning; conversation continues.
Second user messageTitle generation is skipped (already set).

Conversation list & detail

StepExpected
GET /agent/conversationsReturns caller’s conversations sorted by last_message_at desc.
GET /agent/conversations/:id for someone else’s conv in the same org403 — ConversationsService.getOwn enforces ownership.
GET /agent/conversations/:id returns messages[] with toolCalls[] per assistant messageEach tool call has inverseAvailable filled from the registry.

Negative tests

  • POST /agent/messages as a member → 403.
  • POST /agent/confirm/<bad-id> → 200 SSE with tool_execution_not_found.
  • POST /agent/confirm/<resolved-id>tool_already_resolved.
  • POST /agent/pick/<id> with id not in candidates → invalid_pick.
  • Send a 20_001-char message → 400 (Zod max).
  • Abort the SSE mid-stream → server cleanly aborts the Anthropic stream; assistant row reflects partial state.
  • ANTHROPIC_API_KEY unset → agent_disabled on send.

Performance

  • p95 time-to-first-token < 1500ms on a warm cache.
  • p95 turn latency on a structured workout build (resolve_batch + create + set_sections): < 8s.
  • Steady-state cache-hit ratio ≥ 0.85 (logged on every agent.turn.completed).

Localization

  • Send a prompt in Hebrew → agent replies in Hebrew. The static system prompt instructs locale parity.
  • Send a prompt in Russian on a ru-locale org → agent replies in Russian.
  • Switch locale mid-conversation → agent follows the new language from the next turn.