The Problem — Stateless by Design
Large Language Models (LLMs) like GPT, Claude, and Llama are phenomenal at generating human-like text, but they share a fundamental architectural limitation: they are stateless. Each inference is independent, with no inherent memory of past interactions.
Modern LLMs have "context windows" — the ability to process a certain amount of text in a single inference call (ranging from 8K to 200K+ tokens). This allows them to consider information provided within a single conversation session, but this is fundamentally different from persistent memory:
- Context windows are ephemeral: Information only exists for the duration of a single session
- Context windows are limited: Even 200K tokens isn't enough for years of medical history
- Context windows are expensive: Processing large contexts on every inference call is slow and costly
- Context windows lack structure: They're just text dumps, not organized knowledge
For healthcare applications where patient history accumulates over months or years, context windows are a band-aid, not a solution. We needed to build true persistent memory.
Our Architecture
The JSS AI Labs Memory Engine is a multi-stage pipeline designed to ingest, index, retrieve, and synthesize information with medical-grade accuracy and safety. Here's how each stage works:
Multi-Modal Ingestion Pipeline
Medical information comes in many forms: free-text conversations, structured intake forms, uploaded lab reports, doctor's notes, and prescriptions. Our ingestion pipeline handles all of these:
- Conversational Parsing: Extract medical entities (symptoms, medications, dates) from natural language using NER (Named Entity Recognition) fine-tuned on medical corpora
- Document Understanding: OCR and layout analysis for scanned documents, preserving structure and relationships
- Structured Data Integration: Direct ingestion of structured medical records (FHIR format when available)
- Temporal Anchoring: Every piece of information is timestamped and linked to a point in the patient journey
The goal is to create a rich, structured representation of patient information, not just store raw text.
Hybrid Indexing (Vector + Graph)
Once information is ingested, it needs to be indexed for efficient retrieval. We use a hybrid approach combining two complementary indexing strategies:
1. Vector Embeddings (Semantic Search):
- Text chunks are converted to high-dimensional vector embeddings using medical-domain models
- Stored in a vector database (pgvector on PostgreSQL) for fast similarity search
- Enables semantic retrieval: "chest pain" matches "thoracic discomfort" even without exact keyword overlap
- Chunking strategy uses sliding windows to preserve context boundaries
2. Knowledge Graph (Explicit Relationships):
- Medical entities (symptoms, medications, conditions) are nodes in a graph
- Relationships are edges (causes, treats, contradicts, precedes)
- Enables reasoning: "If patient has gestational diabetes AND is asking about diet, retrieve diabetic nutrition guidelines"
- Temporal edges track how conditions evolve over time
Vector search finds semantically similar information; graph search finds logically related information. Together, they provide comprehensive retrieval.
Intelligent Retrieval with Re-ranking
When a user asks a question, we don't just dump their entire medical history into the LLM context. Instead, we intelligently retrieve the most relevant information:
- Query Analysis: Decompose the user's question into sub-queries and extract key medical entities
- Hybrid Retrieval: Simultaneously query the vector database (semantic similarity) and knowledge graph (logical relationships)
- Re-ranking: Use a cross-encoder model to re-rank retrieved chunks based on relevance to the original query
- Temporal Filtering: Prioritize recent information but preserve critical historical context
- Context Assembly: Construct a focused, structured context (typically 4K-8K tokens) containing only the most relevant information
This retrieval process happens in milliseconds, providing the LLM with precisely the information it needs without overwhelming it with irrelevant data.
Context-Aware Synthesis
Finally, the LLM generates a response using:
- Retrieved patient context (from the hybrid search)
- Verified medical knowledge (from curated medical corpora)
- Conversation history (recent turns for natural flow)
- System prompts (medical guardrails, empathy instructions, safety constraints)
The synthesis is grounded in facts, personalized to the user, and designed to avoid hallucinations by explicitly citing sources.
The "Memory State" Concept
A key innovation in our system is the concept of a Memory State — a persistent, evolving representation of everything we know about a user's medical journey. Think of it as a living medical profile that grows smarter over time.
The Memory State includes:
- Core Profile: Due date, age, pre-existing conditions, allergies, medications
- Symptom Timeline: Chronological log of reported symptoms and their evolution
- Interaction History: Semantic summary of past conversations (not raw transcripts)
- Preferences: Communication style, information density, dietary preferences
- Risk Factors: Automatically extracted and flagged conditions that require careful monitoring
We distinguish between:
- Short-term memory: Details from the current conversation session (in the LLM context window)
- Long-term memory: Everything from past sessions (in the vector database and knowledge graph)
When you return after a week, the system retrieves relevant long-term memory and brings it into the short-term context, making the conversation feel continuous.
Medical Guardrails
Healthcare AI demands higher safety standards than general-purpose AI. We've implemented multiple layers of guardrails:
- PII Redaction: Personally identifiable information is stripped before being sent to external LLM APIs
- Hallucination Prevention: If the answer isn't in verified sources or patient history, the system says "I don't know" rather than inventing information
- Source Citation: Every medical claim includes a reference to its source (peer-reviewed literature, clinical guidelines, or patient history)
- Safety Filters: Responses that could indicate medical emergencies trigger escalation prompts ("Please contact your doctor immediately")
- Uncertainty Calibration: The system expresses confidence levels and is trained to defer to medical professionals when appropriate
What's Next — Edge Deployment & Federated Memory
Our current architecture runs on cloud infrastructure, but we're working on two major evolutions:
Edge Inference: Running smaller, optimized models directly on user devices for zero-latency responses and enhanced privacy. Patient data never leaves their device for routine queries, only syncing encrypted updates to the cloud for backup and cross-device access.
Federated Memory: Learning from aggregate user patterns without centralizing raw data. For example, if thousands of users report that a certain symptom correlates with a certain week of pregnancy, the system can incorporate this insight while preserving individual privacy. This creates a virtuous cycle: better insights lead to better advice, attracting more users, generating more insights.
The technical foundation we've built — hybrid indexing, intelligent retrieval, context-aware synthesis — is generalizable beyond maternal health. This is the operating system for persistent memory in high-trust AI applications.
For more on our technology stack, visit our technical architecture page. To see the Memory Engine in action, check out our comparison of RAG vs Fine-Tuning for healthcare AI.
