Ask Hilo - Backend Design Document

How the Ask Hilo conversational health assistant is built - the features, how they connect, and why each part is shaped the way it is.


Section 1: Product Overview & Scope

Ask Hilo is the conversational health assistant inside the Hilo mobile app. The backend hosts chat sessions over a user's wearable-derived health data - blood pressure, sleep, heart rate, steps - streams the coaching agent's answers back live, manages long-running health goals, answers product questions from a vectorised FAQ, remembers durable user facts across sessions, and can flag a conversation to human support via Intercom.

The consumers of this system are the iOS and Android Hilo apps - they talk to this backend over plain HTTPS plus one live-streaming endpoint. The small React app in frontend/ is an internal testing UI only; it is not what users see.

1.1 What the Backend Does

1.2 How the Mobile App Connects

Every request carries a shared API key (X-API-Key) plus four client-context headers: X-User-ID, X-Timezone, X-Platform, X-App-Version. Responses come back as JSON; the chat endpoint is the one exception - it returns a live event stream.


Section 2: System Architecture

The backend is a single FastAPI service that talks to five different datastores, each chosen for a specific job: relational app state in MySQL, user profiles in a separate MySQL owned by the wider Hilo platform, raw watch readings in MongoDB, FAQ vectors in PostgreSQL, and ephemeral streaming + job queues in Redis. Background work - like the nightly goal evaluator - runs in separate worker processes, not inside the web server.

1. Ask Hilo System Overview - Mobile client, FastAPI surface, datastores, background workers and external services

The master picture: the Hilo mobile app calls the FastAPI surface, which reaches five datastores and three external services. Separate worker and scheduler processes share Redis and the databases.

Rendering…
Reading the diagram: the mobile app calls the FastAPI surface, which reaches the datastores and external services. Background workers share the datastores and OpenAI. The dotted line from the frontend marks it as an internal test client; the dotted observe line marks optional Langfuse tracing.

2. Startup & Shutdown Lifecycle - Ordered dependency health-checks at boot and clean teardown

At startup the app validates every critical dependency in order and tears every client down in reverse on shutdown, so a misconfigured datastore fails the process fast rather than at the first user request.

Rendering…
Reading the diagram: each node is a sequential startup step. The process only reaches 'serve traffic' after every critical dependency is reachable. Langfuse is the only branch and never blocks boot.

2.1 Datastores at a glance

SystemWhat it holds
App MySQLSessions, messages, goals, memory, background tasks, Intercom handover records
Hilo User MySQLUser profiles - owned by the wider Hilo platform, read on nearly every chat turn
MongoDBRaw wearable readings the chat tools query
PostgreSQL + pgvectorFAQ embeddings for semantic search
RedisPer-turn streaming logs, background job queues, scheduler leader lock

2.2 Why It Is Built This Way

Five separate datastores, each purpose-specific.

Concerns differ: relational app state and user profiles fit MySQL (split so profile reads do not contend with chat writes), high-volume watch-reading documents fit MongoDB, vector similarity fits pgvector, and ephemeral streaming/queueing fits Redis.

Background work in separate processes, coordinated through Redis.

Keeps long-running periodic work off the web event loop; a Redis leader lock guarantees exactly one scheduler even when running multiple replicas.


Section 3: Chat Conversation & SSE Streaming

The chat core accepts a user message, runs the agent (with tool calls and a safety judge), and streams the assistant's answer back to the mobile app as live events. The defining design choice: generation is decoupled from the client connection. The request kicks off a background task that writes events to a per-turn log; the HTTP connection is just a thin subscriber. If the socket drops mid-stream, generation keeps running, and the client can reconnect from exactly the last event it received - on any instance, with no sticky sessions.

Three things share one mechanism - a per-turn Redis event log plus a durable claim row in MySQL:

  1. Generation survives a drop - the answer is produced independently of the connection.
  2. Resume exactly where you left off - each event has a stable id; the client reconnects with the last id it saw.
  3. No double-billing - every send carries an idempotency key; a retry after success replays the stored answer instead of paying for a second turn.

3. Happy-path chat turn - Mobile app, streaming endpoint, background generation, events back

A complete successful turn showing the streaming subscriber separated from the background generation task that writes the per-turn log.

Rendering…
Reading the diagram: the route never streams the model itself. It spawns generation as a detached task that writes the per-turn Redis log, and returns a thin relay. The two run concurrently - if the app socket drops, generation keeps writing and the turn still completes.

4. Streaming architecture - Producer task, append-only per-turn log, subscriber, resume, idempotent claim

How connection-independent generation, resume-across-reconnect, and idempotent send all share one per-turn log plus a durable MySQL claim.

Rendering…
Reading the diagram: durable truth lives in MySQL (the unique-key claim plus persisted messages); live token delivery lives in the ephemeral Redis log. The dashed arrows are the retry paths - a reconnect re-attaches to the live log, and a retry after a completed turn replays the stored answer without re-paying OpenAI.

5. Turn & claim lifecycle states - The state machine of an idempotency claim

The state machine of an idempotency claim and the terminal events the client observes for each outcome.

Rendering…
Reading the diagram: a claim starts running on the first send. It ends 'done' (clean turn) or 'failed' (error or timeout). A retry on a running claim re-attaches; a retry on a failed claim re-runs; a retry on a done claim replays the persisted answer.

3.1 Events the client may see

EventMeaning
tokenIncremental text chunk from the assistant.
tool_call / tool_resultThe agent invoked a data tool and received its result.
doneThe main answer is finished. The client is unblocked here even if the safety judge has not yet run.
eval_pending / eval_resultThe post-LLM judge is running, then its verdict (approved or revise).
correction_*If the judge said 'revise', the rewritten answer streams under these events.
suggestionsUp to three follow-up question chips.
timeout / errorTerminal events when the turn exceeded its budget or hit an unrecoverable error.
endThe only clean-finish signal. A socket close without 'end' is a drop and the client should reconnect.

3.2 Why It Is Built This Way

Decouple generation from the client connection.

A dropped socket must not kill (or cause a re-pay of) an already-paid-for model turn. The answer is generated and persisted independently, so it is never lost - recoverable by reconnect or by re-reading the transcript.

Map each event's stream id directly onto the event-stream id field.

Enables exact resume from the precise missed event, on any instance, with no sticky sessions.

Use a unique idempotency-key insert as the atomic 'send processed?' claim.

Makes a send idempotent across retries: first arrival runs, concurrent retry attaches to the live log, retry-after-done replays the persisted answer - never a second paid turn.

Emit the 'done' event immediately, then run the safety judge and suggestions in parallel afterwards.

The client gets its answer without waiting on post-processing; the judge can still upgrade the message via the correction events afterward.


Section 4: LLM Layer & Tool Calling

The agent uses OpenAI's GPT-4o by default. On every turn it can call any of around twenty tools - blood-pressure analytics, sleep/heart-rate/steps, goal create/list/confirm, durable memory save/recall, FAQ search, basic math, the user's current time, and an Intercom handover. The agent loops: stream some text, decide if it needs a tool, call it, feed the result back, decide again - until it produces a final answer with no more tool calls.

Each tool is wired to one data source. The blood-pressure / sleep tools query MongoDB; goals, memory, and handover write to MySQL; FAQ search hits PostgreSQL's vector index; calc and clock touch no database at all. The user's id, watch id, and timezone are injected server-side on every tool call, so the model can never reach another user's data.

6. Agentic Tool-Calling Loop - How a single chat turn streams tokens, dispatches tools, terminates

One chat turn streams tokens, dispatches tool calls in parallel, feeds results back, and terminates - bounded by a wall-clock budget rather than an iteration cap.

Rendering…
Reading the diagram: the loop has no fixed iteration limit - it repeats call, stream, tool dispatch, feed-back until the model emits text with no tool calls. All tool calls in one turn run in parallel. A wall-clock budget is the only hard bound on total turn duration.

7. Tool-to-Data-Source Map - Every registered tool and the backing store it hits

Every registered tool mapped to its data source. Goals, memory, and handover get MySQL; blood-pressure and sleep get MongoDB; calc and clock touch nothing; FAQ uses its own PostgreSQL pool.

Rendering…
Reading the diagram: green nodes are databases; orange is OpenAI and the external Intercom API. Memory and FAQ each hit both a database and OpenAI embeddings; the goal tools additionally invoke a small compiler model. Calc and clock touch nothing.

4.1 What the agent can actually do

ToolWhat it doesData source
get_latest_bp_readingThe user's most recent BP reading.MongoDB
get_bp_reading_historyBP readings over a window, with optional filters.MongoDB
get_bp_metrics_summaryAverages, in-target %, best / worst, daypart baselines.MongoDB
get_morning_vs_evening_deltaTime-of-day BP pattern.MongoDB
get_rolling_bp_averageDay-by-day rolling trend curve.MongoDB
summarize_last_n_daysBroad recap (the preferred multi-metric summary).MongoDB
get_user_sleep_metricsAverage sleep hours + daily trend.MongoDB
get_user_step_metricsStep count average + daily trend.MongoDB
get_user_heart_rate_metricsHeart rate average, resting estimate, daily trend.MongoDB
save_memory / recall_memorySave a durable user fact; search older facts semantically.MySQL + OpenAI embedding
create_goal / confirm_goalCompile a natural-language goal and activate it after user confirms.MySQL + small compiler model
list_goals / get_goal_progressList the user's goals and show progress.MySQL
update_goal / delete_goalEdit or remove a goal.MySQL
search_faqSemantic FAQ / product-doc search.PostgreSQL + OpenAI embedding
calculateDeterministic arithmetic - so reported numbers are computed, not hallucinated.None
get_current_datetimeCurrent time in the user's timezone. Called first to anchor relative phrases like 'today'.None
handover_to_intercom_finEscalate an out-of-scope support question.MySQL + Intercom REST

4.2 Why It Is Built This Way

Agent loop bounded by wall-clock budget, not a max-iteration count.

Lets the model chain as many tool rounds as a task genuinely needs (e.g. list_goals - create_goal - confirm), while a hard time budget keeps any one turn from running unbounded.

Force the user's identifiers server-side on every tool call.

Concurrency cuts latency when the model needs several tools at once; forcing identifiers means the model can never reach another user's data or fabricate an audit id.

Calc evaluates arithmetic deterministically rather than letting the model do it.

So every number the agent reports - deltas, percentages, averages - is computed, not hallucinated.

Assemble a byte-identical system prompt per session.

Pinning the memory snapshot to session-open time keeps the prompt stable across turns, so OpenAI's prompt cache reduces cost and latency without any external cache.


Section 5: Safety & Response Quality

Every chat turn is wrapped in three safety layers. The numbering matches the audit log - there is no Layer 2.

Layer 1 - before the model runs. Four quick regex checks. Emergency keywords (chest pain, can't breathe, self-harm) return crisis hotlines and the model is never called. Medical advice requests ('should I take 20mg?') return a polite decline. Personal info (phone, email, SSN, card) is redacted in place but the message still goes through, because health users routinely include numbers. Prompt-injection attempts are silently blocked.

Layer 3 - after the model has answered, regex-checked. Looks for dosage advice, prescription verbs, diagnoses, BP categorisation. If something slips through, a soft correction is appended to the streamed reply; the model is never re-run. Sentences telling the user to talk to their doctor are explicitly exempt - that is the correct, mandated safety behaviour.

Layer 4 - after the model has answered, judged by another model. An OpenAI reasoning model (o4-mini) inspects the answer against the same tool data the agent saw and checks nine 'violation' rules (hallucination, diagnosis, prescription, persona break, etc.). If it says 'revise', the answer is rewritten once - a single trusted rewrite with no second judge pass. The judge is fail-open: any judge error counts as 'approved' so the user is never blocked by a degraded judge.

8. Layered Safety Pipeline - Every chat turn passes through these layers

A user message flows top-to-bottom through pre-model checks, the agent generation loop, post-model regex, and post-model judge. Hard blocks short-circuit before the model is ever called.

Rendering…
Reading the diagram: red is the Layer 1 guard that can short-circuit before any LLM call; blue boxes are the post-LLM stages; orange is the LLM judge and retry agent. Note that personal-info detection does NOT block - it redacts and proceeds.

9. Post-LLM Judge and Retry Flow - Judge once, rewrite at most once

After the reply and 'done' event are sent, the judge runs concurrently with suggestion generation. A 'revise' verdict triggers a single trusted rewrite; everything else is fail-open.

Rendering…
Reading the diagram: the judge and suggestion generation run concurrently so the user is never blocked waiting on the judge. The retry runs the agent once and trusts the rewrite - there is no re-judge of the corrected reply. Any judge failure produces an 'approved' verdict so safety evaluation never blocks the response.

5.1 What the judge checks for

RuleCategoryWhat it catches
V1hallucinationFabricated health figure that contradicts the tool result, or stale data presented as current.
V2safetyDiagnosis applied to the person ('you have hypertension') vs. describing the reading.
V3safetyPrescriptive advice or any medication touch. Clinician redirect ('talk to your doctor') is always fine.
V4safetyAdopting a clinician persona or complying with 'act as a doctor' jailbreaks.
V5safetyAnswering a diagnosis / dosage / prognosis question instead of refusing.
V6safetyLeaking internal instructions, tool names, or identifiers.
V7hallucinationFabricated non-numeric claim - invented memory, doctor name, or device capability.
V8safetyCausal over-claim - asserting a definitive cause vs. an association.
V9safetyUnsafe praise of a single low reading with no supporting trend.

5.2 Why It Is Built This Way

Layer 1 runs before persisting the user message; only the redacted text is stored and sent.

Raw personal information never reaches the database, the trace, or the model.

Personal info is redacted, not hard-blocked.

Health users routinely include numbers (readings, device ids) that broad PII patterns would catch; a hard block would refuse legitimate questions.

Layer 4 judge is fail-open.

The judge must never block the response path. A degraded judge degrades to a no-op rather than withholding the user's answer.

Single-pass retry - the rewrite is trusted with no second judge pass.

Bounds cost and latency to one judge call plus at most one rewrite per turn; the retry prompt is narrowly scoped to fix only the flagged problem.


Section 6: Goals

A user can set a personal health goal in plain language - e.g. "keep my evening blood pressure under 130 five days out of seven" - and the backend tracks progress automatically against the user's watch data. The system is built in two cleanly-separated halves:

  1. The model compiles the goal once, at create time. The natural-language goal is translated into a deterministic recipe - a MongoDB query that produces a daily number, plus a tiny comparison expression that decides hit / miss. Three sandbox checks then run before saving: a pipeline whitelist (no dangerous operators), a math-expression check, and a replay against the user's actual last few periods of data. If validation fails, the model gets one retry with the error fed back in.
  2. Then the model is out of the loop. Every night a worker re-runs the saved recipe against the user's data, decides hit or miss, and writes one row. Goals roll up to 'achieved', 'missed', or 'errored' automatically. The model is never called during scheduled evaluation - cost stays bounded and behaviour is reproducible.

10. Goal Creation & Compilation Pipeline - From chat utterance to a persisted, validated, replay-checked goal

Goal content is created exclusively through the chat tool surface. The compiler runs once, behind three sandbox layers and a real-data replay, before any row is written.

Rendering…
Reading the diagram: blue nodes are the chat tool and service path; orange is the one-time compiler with its retry loop; green are the datastores. A row is written to MySQL only after validation and replay succeed. Red nodes are the dead ends where no goal is persisted.

11. Periodic Goal Evaluation Worker - Dispatcher fan-out and per-goal scoring against watch_readings

A daily worker sweeps due goals, fans them into chunk jobs, and scores each goal deterministically (no model call) against the user's watch data.

Rendering…
Reading the diagram: the dispatcher is fire-and-forget - it queues chunk jobs and exits. Each chunk worker re-checks due-ness, runs the saved recipe against MongoDB, writes one row per goal per period, and only then rolls the history into a possible status change.

12. Goal Lifecycle States - Status transitions across confirmation, evaluation, and lifecycle controls

A goal moves through six statuses. The compiler creates it in 'awaiting_confirmation'; user actions and the evaluator drive the rest.

Rendering…
Reading the diagram: 'awaiting_confirmation' (orange) is the only state the compiler writes; a description update re-enters it via a full recompile. 'achieved', 'missed', and 'errored' are terminal, but 'errored' can be re-activated.

6.1 What the system can measure

The compiler is restricted to a closed list of fields. The user's watch produces one document per day:

FieldNotes
Average blood pressure (systolic / diastolic)Daily average in mmHg.
Last reading of the dayThe final BP reading + its time-of-day.
Heart rateSingle daily value.
Average sleepHours per night.
Average stepsDaily step count.
In-target percentPercent of intraday readings inside the target band.
BP categoryOptimal / Normal / High-Normal / Hypertension 1 / Hypertension 2.

6.2 How daily hits roll up to a verdict

KindAchieved whenMissed when
streakCurrent consecutive-hit streak meets the target.(Never auto-missed; stays active.)
count_hits_in_windowTotal hits within the window meet the target.The full window elapses and the target is not met.
sustained_patternThe full window elapses and the target is met.The full window elapses and the target is not met.

6.3 Why It Is Built This Way

The model is used only at compile time; recurring evaluation is fully deterministic.

Bounds cost and latency, makes scoring reproducible, and confines prompt-injection risk to the one-time compile step rather than every daily run.

The compiler's output passes three independent sandbox checks before any row is written.

The model emits code that will be executed; the layers block dangerous operators and replay the goal against real data to prove the predicate actually runs.

Goals require an explicit confirm step before they start tracking.

The preview-then-confirm flow lets the user verify the model's interpretation against a real preview series before tracking starts.


Section 7: Memory & FAQ Retrieval

Two semantic-search subsystems. They share the same embedding model but live in different databases for different reasons.

User Memory is a per-user store of durable facts - doctor's instructions, medications, dietary preferences, experiment outcomes, significant health events (explicitly NOT blood-pressure readings or goal progress, which live in structured tables). The agent decides what to save via a tool; nothing is extracted automatically. Stored memories show up two ways: automatically, the most recent twenty are prefixed into the system prompt at the start of each session; on demand, the agent calls a 'recall' tool for older specific facts. Memory lives in MySQL with embeddings stored as JSON. Similarity is computed in Python over a small candidate pool - small enough not to need a dedicated vector index.

The FAQ knowledge base answers product questions (charging, calibration, app features) and is the true vector-search subsystem. FAQ rows live in PostgreSQL with a real vector column and a similarity index. The agent calls 'search_faq', the question gets embedded, and the database returns the top matches. Ingestion is CSV-driven and idempotent - it only re-embeds new or changed rows.

13. User Memory: Write and Recall Paths - Tool-driven saves to MySQL; recall and the session prelude

Memory is written and read only through tools. Embeddings are stored as JSON in MySQL and scored in Python - there is no dedicated vector index here.

Rendering…
Reading the diagram: blue nodes are application / agent layers, orange is the external embedding call, and green nodes are MySQL operations. Both write and read paths embed text via OpenAI, but recall scoring happens in Python, not in the database.

14. FAQ Retrieval: Ingestion and Query Time - CSV seed to PostgreSQL vector index; semantic search at query time

FAQ is the true vector-search subsystem. CSV rows are embedded and stored in PostgreSQL; at query time the search tool runs a nearest-neighbour query.

Rendering…
Reading the diagram: the ingestion path embeds combined question + answer text and stores into PostgreSQL; the query path embeds only the user's question. Both call the same embedding model. All ranking and filtering happens inside the database.

7.1 Why It Is Built This Way

Memory writes are entirely tool-driven, with no background extraction pipeline.

Gives the coaching agent explicit, auditable control over what becomes a durable fact - and lets the tool description enforce policy.

Memory uses JSON embeddings in MySQL scored in Python; FAQ uses a real vector index in PostgreSQL.

Memory candidate pools are small (a few hundred rows per user) so a simple numerical pass is fast enough. FAQ can grow larger and is read-heavy, so it gets the dedicated index.

Memory recall is split into an automatic prelude plus an on-demand search tool.

Recent memories are cheap to always include for context; older or specific facts are fetched only when the model explicitly needs them.


Section 8: Intercom Human Handover

When the user asks a product or support question that the FAQ couldn't answer, and the user explicitly agrees to be flagged, the agent calls a tool that opens an Intercom conversation. Medical, dosage, symptom, and other safety-covered questions are deliberately routed to refusal templates instead - never to a handover.

The handover is recorded as an audit row in MySQL (pending - accepted on success, or failed). A failure never blocks the chat turn - the assistant apologises in one clause and moves on; the failure is still recorded for audit.

Return path is not built. If an Intercom agent or Fin replies, that reply has no path back to the mobile app yet - no webhook, no Messenger SDK, no polling. This is a known gap on the roadmap.

15. Intercom Handover - Full Outbound Flow - User asks for a human, conversation is created in Intercom, and why the reply cannot return

End-to-end sequence of a successful handover. The return leg is dashed because no inbound integration exists in the backend yet.

Rendering…
Reading the diagram: solid arrows are calls that actually happen in code today. The two-step Intercom call (contacts then conversations) creates the handover. The final dashed / crossed arrow shows the missing return leg - an agent reply in Intercom has no path back to the mobile app.

16. Handover State Lifecycle - Status transitions of a handover record

The persisted handover row moves through three states. There is no path back from a terminal state; a later question that needs support creates a brand-new row.

Rendering…
Reading the diagram: every handover begins as 'pending' (the row is written before any network call so an attempt is always durably recorded). It then transitions exactly once - to 'accepted' if both Intercom calls succeed, or to 'failed' if any error is raised.

8.1 Why It Is Built This Way

Handover is triggered by an explicit user agreement and a failed FAQ lookup - never by the safety layer.

Medical / dosage / symptom questions go to refusal templates, not handover. Handover is reserved for genuine out-of-scope product questions.

The user id and session id are forced server-side, not trusted from the agent.

Tenant isolation and audit integrity - a fabricated id from the model could attach a handover to the wrong user or session.

Errors are persisted as a 'failed' audit row and returned as a status envelope; the tool never throws into the agent.

A support / Intercom outage must never block or crash the chat turn. The coach apologises in one clause and moves on; the failure is still recorded.


Section 9: Auth, Sessions & Data Model

Authentication is deliberately minimal. The mobile app sends a shared API key in every request plus four context headers (user id, timezone, platform, app version). The API key is timing-safe-compared against a single secret; missing or wrong returns 401. User identity itself is owned by the wider Hilo platform - this backend reads profiles from a separate user database but never writes to it.

'Session' here means a chat conversation, not an auth session. There is no server-side login state. There is also a dev-only username / password / JWT login route, but it is never registered in production.

17. Auth Middleware Decision Path - Per-request gate run before any route handler

Every HTTP request flows through the auth middleware before reaching any route. It enforces a shared API key and the four client-context headers.

Rendering…
Reading the diagram: top-to-bottom is the order of checks. Red nodes are the two terminal rejections (401 for the API key, 400 for missing context headers). There is no per-user credential check - once the API key passes and the headers are present, the user id is trusted downstream.

18. App MySQL Database - Entity Relationships - Tables owned by this service

The tables this app owns in its own MySQL DB. User id and watch id are plain string columns (the join keys into the external Hilo User DB), not foreign keys.

Rendering…
Reading the diagram: only the chat-messages and goal-evaluation relationships are real database foreign keys. The user id and watch id columns elsewhere link to the EXTERNAL Hilo User database, so there is no foreign key to a users table here.

19. Database Topology - Who Owns What - Cross-datastore split

The backend reads and writes across four datastores. Identity is owned externally; the app owns its own MySQL; watch data is Mongo; FAQ vectors are PostgreSQL.

Rendering…
Reading the diagram: the app writes only to its own MySQL (green). The Hilo User MySQL is read-only (identity is owned upstream). The watch readings in MongoDB feed the goal evaluator and the chat data tools.

9.1 Why It Is Built This Way

Single shared API key plus a trusted user-id header as the production auth model.

Identity is owned upstream by the wider Hilo platform; this backend treats the mobile client as a trusted first-party caller. The middleware only proves the caller holds the shared secret, then trusts the user id as context. This keeps the hot chat path cheap and avoids duplicating an identity provider.

User identity accessed via raw SQL rather than as a managed ORM model.

The user table is owned by an external system with a fixed schema; modelling it as an ORM entity here would imply ownership. Reading it via raw SQL keeps migrations and writes out of scope.

User-id and watch-id stored as plain string columns, not foreign keys.

Those identifiers reference the external Hilo User database, so a cross-database foreign key is impossible. The app indexes them instead and enforces scoping in queries.


Section 10: Background Workers & Scheduling

Three process types run, sharing one Redis: the API server, one or more workers that pull jobs from queues, and exactly one scheduler that fires cron. The scheduler uses a Redis lock so even with multiple replicas only one is the active leader; the leader heartbeats every ten seconds and self-terminates if it loses the lock - so a standby can take over without producing a split-brain.

There are three background jobs today:

  1. Nightly goal evaluation (default 02:00 UTC). Paginates due goals and fans them out to chunk workers that score each goal against MongoDB.
  2. Daily chat-claim cleanup (03:30 UTC). Reaps stuck idempotency claims and deletes old rows so the table stays bounded.
  3. On-demand QA evaluation pipeline. Runs the live agent over a CSV of test questions, kicked off by a dev-only endpoint. Used for batch quality measurement, not customer traffic.

20. Worker topology and leader election - How the scheduler, workers, and API server cooperate over Redis

The API does not start workers; three process types run independently and coordinate through Redis. Multiple scheduler replicas race for a single Redis lock so only one fires cron at any time.

Rendering…
Reading the diagram: Scheduler A wins the leader lock and keeps it alive with a heartbeat; Scheduler B sees the key exists and idles, retrying every ten seconds. Only the leader registers cron and enqueues jobs. Any number of workers consume those queues in parallel - workers need no leader election because each enqueued job is delivered to exactly one worker.

21. Background jobs at a glance - Cadence, responsibility, and what each one touches

The current background surface: two cron jobs plus one ad-hoc job, each entering through its own queue.

Rendering…
Reading the diagram: the left column is what triggers each job (two crons plus one dev HTTP route). The middle column is the job itself; note that the goal sweep is only a dispatcher that fans out chunk jobs. The right column is every datastore each job reads or writes.

10.1 Why It Is Built This Way

Workers and scheduler run as standalone processes, not inside the API.

Decouples the worker tier so API, workers, and scheduler scale and restart independently.

Single-leader scheduler via a Redis lock with heartbeat and hard stand-down.

Running multiple schedulers would double-register and double-fire jobs. The lock guarantees exactly one leader and the hard-exit on lock loss prevents a split-brain.

Dispatcher / chunk fan-out for goal evaluation, not one monolithic sweep.

The dispatcher enqueues chunk jobs and exits, so multiple worker processes can evaluate chunks in parallel. Keeps any single job short.


Section 11: Production Deployment (AWS)

In production Ask Hilo runs in a single AWS VPC. The API runs as containers on ECS Fargate behind an Application Load Balancer. Background workers and the scheduler run on an EC2 instance. The application's own databases (MySQL, PostgreSQL, Redis) are managed AWS services. The external data sources - the client's MongoDB watch-readings store and the Hilo user-profile database - are reached over the network but never provisioned by this stack. Every datastore sits in a private subnet; only the load balancer is internet-facing.

22. Production Deployment Topology - AWS services and the data-store ownership boundary

The production picture: the mobile app reaches the load balancer, which routes to the FastAPI tasks on Fargate. Workers and the scheduler run on EC2. Solid lines are stores this stack provisions; dotted lines are external systems reached over the network.

Rendering…
Reading the diagram: blue is application compute and networking, green is databases, orange is Redis and external services, purple is secrets and logging. Only the load balancer is public; everything else lives in private subnets.

11.1 Data-Store Ownership

API on Fargate, workers on EC2, and exactly one owned application database.

The API is stateless and bursty, so it scales by Fargate task count. The worker and scheduler are long-running processes (with a single active scheduler elected through the Redis lock) which fit a persistent EC2 host. The deployment creates and migrates only the application's own MySQL - the client's MongoDB and the Hilo user-profile MySQL are external dependencies this stack connects to but never owns.