From hospital management system to clinical AI: an end-to-end case study
A hospital runs on its software the way a body runs on its nervous system — quietly, until something misfires. This is a walk through how we build a hospital management platform end-to-end, and then how we add an LLM stack on top of it that gives clinicians their evenings back, moves patients through beds faster, and gets claims paid the first time. Two parts, one discipline: get the system of record right first, then let AI act on top of it — never instead of it.
"Add AI to healthcare" is the easiest sentence to say and the hardest to ship responsibly. The reason is that a hospital is not a greenfield app; it is a dense web of regulated workflows, life-safety constraints, and forty years of accumulated clinical data in formats that fight you. An LLM that hallucinates a medication dose is not a bad demo — it is a patient-safety event. So the order of operations matters more here than almost anywhere else. You earn the right to add intelligence by first building a system of record that is correct, auditable, and interoperable. Then the AI has solid ground to stand on. Below is the whole arc: the management platform we build, the AI layer we put on it, and the numbers that tell you whether it worked.
Part one — the system of record we build first
A hospital management system is really a federation of workflows that all have to agree on one source of truth about a patient. We build it as a set of bounded services around a shared, governed clinical data layer rather than as one monolith, because the registration desk, the pharmacy, and the billing office change at different speeds and must fail independently. The core domains are familiar to anyone who has worked in the space, and each is a service with its own data and its own contracts:
- Registration & ADT — admit, discharge, transfer. This is the heartbeat of the hospital: who is here, where they are, and under whose care. Every other system subscribes to it.
- Clinical data / EHR — encounters, problems, allergies, vitals, notes, results, and medications, modeled on FHIR R4 resources so the data is interoperable by construction rather than by a later export project.
- Orders & results (CPOE) — computerized provider order entry for labs, imaging, and medications, with results flowing back from ancillary systems.
- Pharmacy & medication administration — order verification, interaction checking, dispensing, and the electronic medication administration record.
- Scheduling & bed management — clinics, theatres, and the live bed census that determines whether the emergency department can move a patient up.
- Revenue cycle — charge capture, coding, claims, and remittance against payers.
The piece that makes or breaks the whole platform is the one nobody outside healthcare expects: the integration layer. Hospitals do not run one vendor's software; they run dozens, and they have to talk. So a healthcare-grade interface engine sits at the center, speaking HL7 v2 to the legacy estate (lab analyzers, radiology, older ancillary systems) and FHIR to anything modern, normalizing both into the canonical clinical model. This is the unglamorous spine that everything else hangs from, and it is where most "we'll add AI later" projects quietly die — because if your data is trapped in point-to-point HL7 feeds with no canonical model, there is nothing clean for the AI to read.
The one rule that governs everything: the AI proposes, the record decides
Before any of the AI features, we commit to a single architectural rule that the rest of the design obeys: the LLM never writes to the system of record directly. It reads governed clinical data, it produces a draft, and a credentialed human accepts, edits, or rejects that draft before a single FHIR resource changes. This is the healthcare equivalent of keeping the slow, clever model off the hot path. The system of record is the source of clinical and legal truth; an LLM is a probabilistic drafting tool. Wiring a probabilistic tool to write directly into a life-safety record is how you turn a productivity feature into a liability. So every AI capability below shares the same shape: asynchronous, grounded, and gated on a clinician's signature.
Part two — the LLM stack we add on top
With a clean canonical data layer and that one rule in place, the AI layer becomes tractable. Each capability is grounded in the patient's actual FHIR record plus a vetted knowledge base, never in the model's free-floating memory.
- Ambient clinical documentation. The single highest-leverage feature. Ambient capture of the patient encounter is transcribed and turned into a structured draft note — history, assessment, plan — that the clinician reviews and signs. This is what pulls clinicians out of after-hours charting.
- RAG over the chart and the guidelines. A retrieval layer indexes the patient's own record (problems, meds, recent results) alongside the hospital's approved clinical guidelines and formulary. Every generated answer is grounded in retrieved, cited passages, so a clinician can click from a suggestion straight to the source — the difference between a decision-support tool and a liability.
- Discharge summary generation. Drafting the discharge summary and patient-friendly instructions from the encounter — historically a task that lags a patient's medical readiness by hours and clogs beds.
- Coding & prior-authorization assist. Suggesting ICD-10/CPT codes from the documented encounter and pre-checking a claim against payer rules before submission, catching the omissions that cause denials.
- Patient triage & messaging. A grounded assistant that answers routine patient questions, handles scheduling, and routes anything clinical to a human — reducing inbox load without practicing medicine.
- Bed-flow forecasting. Predicting likely discharges in the next 24–48 hours so bed managers and the emergency department can plan, turning census from a reactive scramble into a forecast.
Retrieval is the safety mechanism, not a performance trick. In a consumer chatbot, RAG makes answers fresher. In a hospital, grounding every generation in the patient's own cited record and the approved guideline set is what stands between "clinical decision support" and "a confident machine inventing a dose." The citation trail is a feature, not decoration.
How a single feature is wired, end to end
Take ambient documentation, because it shows the whole pattern. An encounter is captured and lands on a queue — already off any synchronous path. A worker de-identifies the transcript for any step that does not strictly require PHI, retrieves the relevant slice of the patient's FHIR record for grounding, and asks the model for a structured draft rather than free prose, so the output maps cleanly onto note fields and can be validated. The draft is checked, surfaced to the clinician, and only written back as a signed note. The contract makes the human gate and the grounding explicit:
async def draft_clinical_note(encounter: Encounter) -> NoteDraft:
# async by construction — never blocks the clinical UI
transcript = await asr.transcribe(encounter.audio)
# ground the model in THIS patient's governed record
context = fhir.retrieve(
patient=encounter.patient_id,
resources=["Condition", "MedicationRequest",
"AllergyIntolerance", "Observation"],
as_of=encounter.time, # point-in-time correct
)
draft = await llm.generate(
task="structured_progress_note",
transcript=redact(transcript), # PHI minimized where possible
grounding=context, # retrieved, cited
schema=PROGRESS_NOTE_SCHEMA, # structured, validatable output
)
draft = guardrails.check(draft, context) # dose/allergy/contradiction checks
return draft.requires_signoff() # NOTHING is written until a clinician signs
Three things in that snippet are the whole philosophy. Point-in-time grounding means the note reflects the record as it was at the encounter, not whatever changed afterward. Structured output against a schema means the generation is validatable — you can check that a proposed medication exists in the record and does not collide with a documented allergy, mechanically, before a human ever sees it. And requires_signoff() is the load-bearing line: the function's job is to produce a proposal, full stop. Everything else in the stack is variations on this theme.
The results: what the AI layer actually moved
An architecture is only as good as the outcomes it produces, and in healthcare the outcomes are measurable in clinician hours, patient flow, and cash. The table below shows the direction and magnitude of impact this kind of layer delivers when it is built on a clean system of record. The figures are representative of what this class of deployment targets and what comparable ambient-AI and revenue-cycle rollouts have publicly reported — ranges, not a single guaranteed number, because the baseline a given hospital starts from varies widely.
| Metric | Before | After | Change |
|---|---|---|---|
| Documentation time per encounter | ~16 min | ~7 min | −55% |
| After-hours charting ("pajama time") | ~6 hrs/wk | ~2.5 hrs/wk | −1 hr/day |
| Discharge summary turnaround | hours–next day | minutes to draft | same-day |
| Claim denial rate | ~10–12% | ~7–8% | −30–40% |
| Average length of stay | baseline | −0.3–0.5 days | faster flow |
| Patient message response time | hours | near-instant draft | ↓ inbox burden |
| Clinician-reported burnout signal | high | improved | retention win |
Two of those rows deserve a word, because they pay for the project twice over. The documentation rows are not a vanity metric — clinician burnout is a staffing and retention crisis, and the hours a physician spends charting after their kids are asleep are the hours that drive them out of the profession. Giving an hour a day back is a recruiting advantage with a real dollar value. The claim-denial row is the one a CFO underwrites the whole program on: denied claims are revenue already earned and then lost to rework, and roughly two-thirds of denials are recoverable but never reworked. Catching the omissions before submission converts directly into collected cash, which is how an AI layer stops being a cost center and starts funding itself.
The metric that is not in the table is the one we watch hardest: clinician edit rate on AI drafts. If physicians accept drafts wholesale, the model may be drifting and nobody is checking. If they reject everything, the feature is dead weight. A healthy, stable edit rate — meaningful revision, high eventual acceptance — is the real signal the system is both used and supervised. We instrument it from day one.
The governance and compliance plane
Spanning all of it, exactly as in any regulated system we build, is the governance plane — and in healthcare it is HIPAA-shaped and load-bearing, not paperwork. An immutable audit log records every access and every AI proposal: who saw which PHI, what the model was shown, what it drafted, and who signed it. PHI minimization and de-identification mean the system carries the minimum necessary at every hop, and any processing that does not strictly require identifiers runs on de-identified data. Role-based access control enforces that a given user — or a given AI workflow — can only reach the data its role permits, and the model inherits the requesting clinician's permissions rather than holding god-mode access. And evaluation and drift monitoring treats every model output as a versioned artifact tested against a fixed clinical eval set, so a regression is caught before it reaches a patient. Any third-party model provider operates under a Business Associate Agreement, with data-handling terms that forbid training on the hospital's PHI. You do not bolt this on after a pilot; it is the substrate the pilot runs on.
The arc is the same one we apply to every system with consequences: get the source of truth right first, keep the probabilistic component off the path where a wrong answer is unrecoverable, ground every generation in real cited data, and instrument the whole thing so a single decision can be replayed and defended. It is the same minimalism behind our Sweep iOS app — zero third-party dependencies, 100% on-device, 23 of 23 tests green — scaled up to a hospital: fewer unaccountable moving parts on the path that matters, and discipline around everything that feeds it. AI did not replace the hospital management system. It made the people running it faster, and it could only do that because the system underneath was built to be trusted.
Key takeaways
- Build the system of record first: bounded services (ADT, EHR, orders, pharmacy, scheduling/beds, revenue cycle) federated through an HL7/FHIR interface engine into one canonical clinical store. The AI is only as good as that foundation.
- Adopt one inviolable rule: the LLM proposes, a credentialed clinician signs off, and only then is anything written back. Never wire a probabilistic model to write directly into a life-safety record.
- Ground every generation in the patient's own FHIR record plus vetted guidelines, with citations — retrieval is the safety mechanism, not a freshness trick.
- Use structured, schema-validated outputs so proposals can be mechanically checked (dose, allergy, contradiction) before a human ever reads them.
- Measure outcomes that matter: documentation time and after-hours charting, discharge turnaround, claim-denial rate, length of stay — and watch clinician edit rate as the signal the system is both used and supervised.
- Make HIPAA governance the substrate: immutable audit log, PHI minimization and de-identification, role-based access the model inherits, eval/drift monitoring, and BAAs that forbid training on your PHI.
Have a system of record that's ready for an AI layer — or one that needs building first?
We design and ship production healthcare AI and the platforms underneath it — idea to production in weeks, governance built in from day one.
Book a 30-min call →