How We Built a Memory Engine for AI — A Technical Deep Dive | JSS AI Labs Blog

The Problem — Stateless by Design

Large Language Models (LLMs) like GPT, Claude, and Llama are phenomenal at generating human-like text, but they share a fundamental architectural limitation: they are stateless. Each inference is independent, with no inherent memory of past interactions.

Modern LLMs have "context windows" — the ability to process a certain amount of text in a single inference call (ranging from 8K to 200K+ tokens). This allows them to consider information provided within a single conversation session, but this is fundamentally different from persistent memory:

Context windows are ephemeral: Information only exists for the duration of a single session
Context windows are limited: Even 200K tokens isn't enough for years of medical history
Context windows are expensive: Processing large contexts on every inference call is slow and costly
Context windows lack structure: They're just text dumps, not organized knowledge

For healthcare applications where patient history accumulates over months or years, context windows are a band-aid, not a solution. We needed to build true persistent memory.

Our Architecture

The JSS AI Labs Memory Engine is a multi-stage pipeline designed to ingest, index, retrieve, and synthesize information with medical-grade accuracy and safety. Here's how each stage works:

Multi-Modal Ingestion Pipeline

Medical information comes in many forms: free-text conversations, structured intake forms, uploaded lab reports, doctor's notes, and prescriptions. Our ingestion pipeline handles all of these:

Conversational Parsing: Extract medical entities (symptoms, medications, dates) from natural language using NER (Named Entity Recognition) fine-tuned on medical corpora
Document Understanding: OCR and layout analysis for scanned documents, preserving structure and relationships
Structured Data Integration: Direct ingestion of structured medical records (FHIR format when available)
Temporal Anchoring: Every piece of information is timestamped and linked to a point in the patient journey

The goal is to create a rich, structured representation of patient information, not just store raw text.

Hybrid Indexing (Vector + Graph)

Once information is ingested, it needs to be indexed for efficient retrieval. We use a hybrid approach combining two complementary indexing strategies:

1. Vector Embeddings (Semantic Search):

Text chunks are converted to high-dimensional vector embeddings using medical-domain models
Stored in a vector database (pgvector on PostgreSQL) for fast similarity search
Enables semantic retrieval: "chest pain" matches "thoracic discomfort" even without exact keyword overlap
Chunking strategy uses sliding windows to preserve context boundaries

2. Knowledge Graph (Explicit Relationships):

Medical entities (symptoms, medications, conditions) are nodes in a graph
Relationships are edges (causes, treats, contradicts, precedes)
Enables reasoning: "If patient has gestational diabetes AND is asking about diet, retrieve diabetic nutrition guidelines"
Temporal edges track how conditions evolve over time

Vector search finds semantically similar information; graph search finds logically related information. Together, they provide comprehensive retrieval.

Intelligent Retrieval with Re-ranking

When a user asks a question, we don't just dump their entire medical history into the LLM context. Instead, we intelligently retrieve the most relevant information:

Query Analysis: Decompose the user's question into sub-queries and extract key medical entities
Hybrid Retrieval: Simultaneously query the vector database (semantic similarity) and knowledge graph (logical relationships)
Re-ranking: Use a cross-encoder model to re-rank retrieved chunks based on relevance to the original query
Temporal Filtering: Prioritize recent information but preserve critical historical context
Context Assembly: Construct a focused, structured context (typically 4K-8K tokens) containing only the most relevant information

This retrieval process happens in milliseconds, providing the LLM with precisely the information it needs without overwhelming it with irrelevant data.

Context-Aware Synthesis

Finally, the LLM generates a response using:

Retrieved patient context (from the hybrid search)
Verified medical knowledge (from curated medical corpora)
Conversation history (recent turns for natural flow)
System prompts (medical guardrails, empathy instructions, safety constraints)

The synthesis is grounded in facts, personalized to the user, and designed to avoid hallucinations by explicitly citing sources.

The "Memory State" Concept

A key innovation in our system is the concept of a Memory State — a persistent, evolving representation of everything we know about a user's medical journey. Think of it as a living medical profile that grows smarter over time.

The Memory State includes:

Core Profile: Due date, age, pre-existing conditions, allergies, medications
Symptom Timeline: Chronological log of reported symptoms and their evolution
Interaction History: Semantic summary of past conversations (not raw transcripts)
Preferences: Communication style, information density, dietary preferences
Risk Factors: Automatically extracted and flagged conditions that require careful monitoring

We distinguish between:

Short-term memory: Details from the current conversation session (in the LLM context window)
Long-term memory: Everything from past sessions (in the vector database and knowledge graph)

When you return after a week, the system retrieves relevant long-term memory and brings it into the short-term context, making the conversation feel continuous.

Medical Guardrails

Healthcare AI demands higher safety standards than general-purpose AI. We've implemented multiple layers of guardrails:

PII Redaction: Personally identifiable information is stripped before being sent to external LLM APIs
Hallucination Prevention: If the answer isn't in verified sources or patient history, the system says "I don't know" rather than inventing information
Source Citation: Every medical claim includes a reference to its source (peer-reviewed literature, clinical guidelines, or patient history)
Safety Filters: Responses that could indicate medical emergencies trigger escalation prompts ("Please contact your doctor immediately")
Uncertainty Calibration: The system expresses confidence levels and is trained to defer to medical professionals when appropriate

What's Next — Edge Deployment & Federated Memory

Our current architecture runs on cloud infrastructure, but we're working on two major evolutions:

Edge Inference: Running smaller, optimized models directly on user devices for zero-latency responses and enhanced privacy. Patient data never leaves their device for routine queries, only syncing encrypted updates to the cloud for backup and cross-device access.

Federated Memory: Learning from aggregate user patterns without centralizing raw data. For example, if thousands of users report that a certain symptom correlates with a certain week of pregnancy, the system can incorporate this insight while preserving individual privacy. This creates a virtuous cycle: better insights lead to better advice, attracting more users, generating more insights.

The technical foundation we've built — hybrid indexing, intelligent retrieval, context-aware synthesis — is generalizable beyond maternal health. This is the operating system for persistent memory in high-trust AI applications.

For more on our technology stack, visit our technical architecture page. To see the Memory Engine in action, check out our comparison of RAG vs Fine-Tuning for healthcare AI.

The Problem — Stateless by Design

Context windows are ephemeral: Information only exists for the duration of a single session
Context windows are limited: Even 200K tokens isn't enough for years of medical history
Context windows are expensive: Processing large contexts on every inference call is slow and costly
Context windows lack structure: They're just text dumps, not organized knowledge

For healthcare applications where patient history accumulates over months or years, context windows are a band-aid, not a solution. We needed to build true persistent memory.

Our Architecture

The JSS AI Labs Memory Engine is a multi-stage pipeline designed to ingest, index, retrieve, and synthesize information with medical-grade accuracy and safety. Here's how each stage works:

Multi-Modal Ingestion Pipeline

Medical information comes in many forms: free-text conversations, structured intake forms, uploaded lab reports, doctor's notes, and prescriptions. Our ingestion pipeline handles all of these:

Conversational Parsing: Extract medical entities (symptoms, medications, dates) from natural language using NER (Named Entity Recognition) fine-tuned on medical corpora
Document Understanding: OCR and layout analysis for scanned documents, preserving structure and relationships
Structured Data Integration: Direct ingestion of structured medical records (FHIR format when available)
Temporal Anchoring: Every piece of information is timestamped and linked to a point in the patient journey

The goal is to create a rich, structured representation of patient information, not just store raw text.

Hybrid Indexing (Vector + Graph)

Once information is ingested, it needs to be indexed for efficient retrieval. We use a hybrid approach combining two complementary indexing strategies:

1. Vector Embeddings (Semantic Search):

Text chunks are converted to high-dimensional vector embeddings using medical-domain models
Stored in a vector database (pgvector on PostgreSQL) for fast similarity search
Enables semantic retrieval: "chest pain" matches "thoracic discomfort" even without exact keyword overlap
Chunking strategy uses sliding windows to preserve context boundaries

2. Knowledge Graph (Explicit Relationships):

Medical entities (symptoms, medications, conditions) are nodes in a graph
Relationships are edges (causes, treats, contradicts, precedes)
Enables reasoning: "If patient has gestational diabetes AND is asking about diet, retrieve diabetic nutrition guidelines"
Temporal edges track how conditions evolve over time

Vector search finds semantically similar information; graph search finds logically related information. Together, they provide comprehensive retrieval.

Intelligent Retrieval with Re-ranking

When a user asks a question, we don't just dump their entire medical history into the LLM context. Instead, we intelligently retrieve the most relevant information:

Query Analysis: Decompose the user's question into sub-queries and extract key medical entities
Hybrid Retrieval: Simultaneously query the vector database (semantic similarity) and knowledge graph (logical relationships)
Re-ranking: Use a cross-encoder model to re-rank retrieved chunks based on relevance to the original query
Temporal Filtering: Prioritize recent information but preserve critical historical context
Context Assembly: Construct a focused, structured context (typically 4K-8K tokens) containing only the most relevant information

This retrieval process happens in milliseconds, providing the LLM with precisely the information it needs without overwhelming it with irrelevant data.

Context-Aware Synthesis

Finally, the LLM generates a response using:

Retrieved patient context (from the hybrid search)
Verified medical knowledge (from curated medical corpora)
Conversation history (recent turns for natural flow)
System prompts (medical guardrails, empathy instructions, safety constraints)

The synthesis is grounded in facts, personalized to the user, and designed to avoid hallucinations by explicitly citing sources.

The "Memory State" Concept

The Memory State includes:

Core Profile: Due date, age, pre-existing conditions, allergies, medications
Symptom Timeline: Chronological log of reported symptoms and their evolution
Interaction History: Semantic summary of past conversations (not raw transcripts)
Preferences: Communication style, information density, dietary preferences
Risk Factors: Automatically extracted and flagged conditions that require careful monitoring

We distinguish between:

Short-term memory: Details from the current conversation session (in the LLM context window)
Long-term memory: Everything from past sessions (in the vector database and knowledge graph)

When you return after a week, the system retrieves relevant long-term memory and brings it into the short-term context, making the conversation feel continuous.

Medical Guardrails

Healthcare AI demands higher safety standards than general-purpose AI. We've implemented multiple layers of guardrails:

PII Redaction: Personally identifiable information is stripped before being sent to external LLM APIs
Hallucination Prevention: If the answer isn't in verified sources or patient history, the system says "I don't know" rather than inventing information
Source Citation: Every medical claim includes a reference to its source (peer-reviewed literature, clinical guidelines, or patient history)
Safety Filters: Responses that could indicate medical emergencies trigger escalation prompts ("Please contact your doctor immediately")
Uncertainty Calibration: The system expresses confidence levels and is trained to defer to medical professionals when appropriate

What's Next — Edge Deployment & Federated Memory

Our current architecture runs on cloud infrastructure, but we're working on two major evolutions:

For more on our technology stack, visit our technical architecture page. To see the Memory Engine in action, check out our comparison of RAG vs Fine-Tuning for healthcare AI.

How We Built a Memory Engine for AI — A Technical Deep Dive

The Problem — Stateless by Design

Our Architecture

Multi-Modal Ingestion Pipeline

Hybrid Indexing (Vector + Graph)

Intelligent Retrieval with Re-ranking

Context-Aware Synthesis

The "Memory State" Concept

Medical Guardrails

What's Next — Edge Deployment & Federated Memory

JSS AI Labs Team

Ready for AI that actually remembers?

Related Articles

What is Context Amnesia in AI — And Why It's Dangerous for Healthcare

RAG vs Fine-Tuning: Which is Better for Healthcare AI?

How We Built a Memory Engine for AI — A Technical Deep Dive

The Problem — Stateless by Design

Our Architecture

Multi-Modal Ingestion Pipeline

Hybrid Indexing (Vector + Graph)

Intelligent Retrieval with Re-ranking

Context-Aware Synthesis

The "Memory State" Concept

Medical Guardrails

What's Next — Edge Deployment & Federated Memory

JSS AI Labs Team

Ready for AI that actually remembers?

Related Articles

What is Context Amnesia in AI — And Why It's Dangerous for Healthcare

RAG vs Fine-Tuning: Which is Better for Healthcare AI?