Insights  ›  Case Study
Case Study · Healthcare

From hospital management system to clinical AI: an end-to-end case study

A hospital runs on its software the way a body runs on its nervous system — quietly, until something misfires. This is a walk through how we build a hospital management platform end-to-end, and then how we add an LLM stack on top of it that gives clinicians their evenings back, moves patients through beds faster, and gets claims paid the first time. Two parts, one discipline: get the system of record right first, then let AI act on top of it — never instead of it.

"Add AI to healthcare" is the easiest sentence to say and the hardest to ship responsibly. The reason is that a hospital is not a greenfield app; it is a dense web of regulated workflows, life-safety constraints, and forty years of accumulated clinical data in formats that fight you. An LLM that hallucinates a medication dose is not a bad demo — it is a patient-safety event. So the order of operations matters more here than almost anywhere else. You earn the right to add intelligence by first building a system of record that is correct, auditable, and interoperable. Then the AI has solid ground to stand on. Below is the whole arc: the management platform we build, the AI layer we put on it, and the numbers that tell you whether it worked.

Part one — the system of record we build first

A hospital management system is really a federation of workflows that all have to agree on one source of truth about a patient. We build it as a set of bounded services around a shared, governed clinical data layer rather than as one monolith, because the registration desk, the pharmacy, and the billing office change at different speeds and must fail independently. The core domains are familiar to anyone who has worked in the space, and each is a service with its own data and its own contracts:

The piece that makes or breaks the whole platform is the one nobody outside healthcare expects: the integration layer. Hospitals do not run one vendor's software; they run dozens, and they have to talk. So a healthcare-grade interface engine sits at the center, speaking HL7 v2 to the legacy estate (lab analyzers, radiology, older ancillary systems) and FHIR to anything modern, normalizing both into the canonical clinical model. This is the unglamorous spine that everything else hangs from, and it is where most "we'll add AI later" projects quietly die — because if your data is trapped in point-to-point HL7 feeds with no canonical model, there is nothing clean for the AI to read.

Hospital management platform with a clinical AI layer A top lane shows the system-of-record hospital management core as a row of services — Registration and ADT, EHR clinical data on FHIR, orders and CPOE, pharmacy, scheduling and bed management, and revenue cycle — all connected through a central HL7 and FHIR interface engine writing to a canonical clinical data store. A bottom lane shows the clinical AI layer running asynchronously and clinician-in-the-loop: ambient documentation, retrieval-augmented generation over the chart and guidelines, discharge-summary drafting, coding and prior-authorization assist, patient triage and messaging, and bed-flow forecasting. The AI layer reads from the clinical store and proposes drafts that a clinician signs off before anything is written back. A governance and compliance plane spans the full width: HIPAA audit log, PHI de-identification, role-based access control, and evaluation and drift monitoring. SYSTEM OF RECORD  ·  HOSPITAL MANAGEMENT CORE Registration / ADT admit · transfer discharge EHR / Clinical FHIR R4 resources notes · results Orders / CPOE labs · imaging medications Pharmacy / eMAR verify · interaction dispense Scheduling / Beds census · theatres clinics Revenue Cycle coding · claims remittance HL7 v2 / FHIR INTERFACE ENGINE  →  canonical clinical data store CLINICAL AI LAYER  ·  asynchronous · clinician-in-the-loop · proposes, never commits Ambient Docs visit → note draft structured RAG over chart + guidelines cited, grounded Discharge Gen summary draft + patient instr. Coding / Prior-auth code suggestion denial pre-check Patient Triage messaging + scheduling Bed-flow forecasting discharge predict AI reads governed clinical data Clinician review & sign-off draft accepted → written back to EHR signed draft → system of record GOVERNANCE & HIPAA COMPLIANCE PLANE Immutable audit log · PHI de-identification · Role-based access control · Eval & drift monitoring
The system of record (top, Royal Blue) is built first: bounded services federated through an HL7/FHIR interface engine into one canonical clinical store. The clinical AI layer (bottom, Madison Blue) reads that governed data and proposes drafts; a clinician signs off before anything is written back. A HIPAA governance plane spans both.

The one rule that governs everything: the AI proposes, the record decides

Before any of the AI features, we commit to a single architectural rule that the rest of the design obeys: the LLM never writes to the system of record directly. It reads governed clinical data, it produces a draft, and a credentialed human accepts, edits, or rejects that draft before a single FHIR resource changes. This is the healthcare equivalent of keeping the slow, clever model off the hot path. The system of record is the source of clinical and legal truth; an LLM is a probabilistic drafting tool. Wiring a probabilistic tool to write directly into a life-safety record is how you turn a productivity feature into a liability. So every AI capability below shares the same shape: asynchronous, grounded, and gated on a clinician's signature.

Part two — the LLM stack we add on top

With a clean canonical data layer and that one rule in place, the AI layer becomes tractable. Each capability is grounded in the patient's actual FHIR record plus a vetted knowledge base, never in the model's free-floating memory.

Retrieval is the safety mechanism, not a performance trick. In a consumer chatbot, RAG makes answers fresher. In a hospital, grounding every generation in the patient's own cited record and the approved guideline set is what stands between "clinical decision support" and "a confident machine inventing a dose." The citation trail is a feature, not decoration.

How a single feature is wired, end to end

Take ambient documentation, because it shows the whole pattern. An encounter is captured and lands on a queue — already off any synchronous path. A worker de-identifies the transcript for any step that does not strictly require PHI, retrieves the relevant slice of the patient's FHIR record for grounding, and asks the model for a structured draft rather than free prose, so the output maps cleanly onto note fields and can be validated. The draft is checked, surfaced to the clinician, and only written back as a signed note. The contract makes the human gate and the grounding explicit:

async def draft_clinical_note(encounter: Encounter) -> NoteDraft:
    # async by construction — never blocks the clinical UI
    transcript = await asr.transcribe(encounter.audio)

    # ground the model in THIS patient's governed record
    context = fhir.retrieve(
        patient=encounter.patient_id,
        resources=["Condition", "MedicationRequest",
                   "AllergyIntolerance", "Observation"],
        as_of=encounter.time,            # point-in-time correct
    )

    draft = await llm.generate(
        task="structured_progress_note",
        transcript=redact(transcript),   # PHI minimized where possible
        grounding=context,               # retrieved, cited
        schema=PROGRESS_NOTE_SCHEMA,     # structured, validatable output
    )

    draft = guardrails.check(draft, context)   # dose/allergy/contradiction checks
    return draft.requires_signoff()      # NOTHING is written until a clinician signs

Three things in that snippet are the whole philosophy. Point-in-time grounding means the note reflects the record as it was at the encounter, not whatever changed afterward. Structured output against a schema means the generation is validatable — you can check that a proposed medication exists in the record and does not collide with a documented allergy, mechanically, before a human ever sees it. And requires_signoff() is the load-bearing line: the function's job is to produce a proposal, full stop. Everything else in the stack is variations on this theme.

The results: what the AI layer actually moved

An architecture is only as good as the outcomes it produces, and in healthcare the outcomes are measurable in clinician hours, patient flow, and cash. The table below shows the direction and magnitude of impact this kind of layer delivers when it is built on a clean system of record. The figures are representative of what this class of deployment targets and what comparable ambient-AI and revenue-cycle rollouts have publicly reported — ranges, not a single guaranteed number, because the baseline a given hospital starts from varies widely.

Metric Before After Change
Documentation time per encounter ~16 min ~7 min −55%
After-hours charting ("pajama time") ~6 hrs/wk ~2.5 hrs/wk −1 hr/day
Discharge summary turnaround hours–next day minutes to draft same-day
Claim denial rate ~10–12% ~7–8% −30–40%
Average length of stay baseline −0.3–0.5 days faster flow
Patient message response time hours near-instant draft ↓ inbox burden
Clinician-reported burnout signal high improved retention win

Two of those rows deserve a word, because they pay for the project twice over. The documentation rows are not a vanity metric — clinician burnout is a staffing and retention crisis, and the hours a physician spends charting after their kids are asleep are the hours that drive them out of the profession. Giving an hour a day back is a recruiting advantage with a real dollar value. The claim-denial row is the one a CFO underwrites the whole program on: denied claims are revenue already earned and then lost to rework, and roughly two-thirds of denials are recoverable but never reworked. Catching the omissions before submission converts directly into collected cash, which is how an AI layer stops being a cost center and starts funding itself.

The metric that is not in the table is the one we watch hardest: clinician edit rate on AI drafts. If physicians accept drafts wholesale, the model may be drifting and nobody is checking. If they reject everything, the feature is dead weight. A healthy, stable edit rate — meaningful revision, high eventual acceptance — is the real signal the system is both used and supervised. We instrument it from day one.

The governance and compliance plane

Spanning all of it, exactly as in any regulated system we build, is the governance plane — and in healthcare it is HIPAA-shaped and load-bearing, not paperwork. An immutable audit log records every access and every AI proposal: who saw which PHI, what the model was shown, what it drafted, and who signed it. PHI minimization and de-identification mean the system carries the minimum necessary at every hop, and any processing that does not strictly require identifiers runs on de-identified data. Role-based access control enforces that a given user — or a given AI workflow — can only reach the data its role permits, and the model inherits the requesting clinician's permissions rather than holding god-mode access. And evaluation and drift monitoring treats every model output as a versioned artifact tested against a fixed clinical eval set, so a regression is caught before it reaches a patient. Any third-party model provider operates under a Business Associate Agreement, with data-handling terms that forbid training on the hospital's PHI. You do not bolt this on after a pilot; it is the substrate the pilot runs on.

The arc is the same one we apply to every system with consequences: get the source of truth right first, keep the probabilistic component off the path where a wrong answer is unrecoverable, ground every generation in real cited data, and instrument the whole thing so a single decision can be replayed and defended. It is the same minimalism behind our Sweep iOS app — zero third-party dependencies, 100% on-device, 23 of 23 tests green — scaled up to a hospital: fewer unaccountable moving parts on the path that matters, and discipline around everything that feeds it. AI did not replace the hospital management system. It made the people running it faster, and it could only do that because the system underneath was built to be trusted.

Key takeaways

  • Build the system of record first: bounded services (ADT, EHR, orders, pharmacy, scheduling/beds, revenue cycle) federated through an HL7/FHIR interface engine into one canonical clinical store. The AI is only as good as that foundation.
  • Adopt one inviolable rule: the LLM proposes, a credentialed clinician signs off, and only then is anything written back. Never wire a probabilistic model to write directly into a life-safety record.
  • Ground every generation in the patient's own FHIR record plus vetted guidelines, with citations — retrieval is the safety mechanism, not a freshness trick.
  • Use structured, schema-validated outputs so proposals can be mechanically checked (dose, allergy, contradiction) before a human ever reads them.
  • Measure outcomes that matter: documentation time and after-hours charting, discharge turnaround, claim-denial rate, length of stay — and watch clinician edit rate as the signal the system is both used and supervised.
  • Make HIPAA governance the substrate: immutable audit log, PHI minimization and de-identification, role-based access the model inherits, eval/drift monitoring, and BAAs that forbid training on your PHI.

Have a system of record that's ready for an AI layer — or one that needs building first?

We design and ship production healthcare AI and the platforms underneath it — idea to production in weeks, governance built in from day one.

Book a 30-min call →