Inside a real-time AI fraud & risk platform: an end-to-end architecture

Algorion EngineeringJune 16, 202611 min read

A fraud-and-risk platform has to answer one brutal question on every transaction: authorize, decline, or step up — and it has to answer in under 100 milliseconds, at the throughput a payment network demands, with real money and real regulators on the line. This is what an end-to-end architecture for that system looks like, and where the AI layer genuinely belongs.

Payment fraud decisioning is one of the least forgiving problems in distributed systems. A recommendation engine that is 80ms slow is annoying; a fraud check that is 80ms slow times out the authorization and either declines a good customer or waves through a bad one. Every decision is logged, auditable, and potentially the subject of a regulator's question months later. And the adversary on the other side is adaptive — the fraud patterns that mattered last quarter are already stale. The temptation today is to "add an LLM" and call it modern. Done naively, that is exactly how you blow your latency budget and your credibility. Below is how we design these systems so the AI layer is built in from the foundation rather than stapled on at the end — and, just as importantly, kept off the hot path where it does not belong.

The shape of the problem

Strip away the vendor logos and a real-time risk platform is two systems wearing one name. The first is a synchronous hot path: a request arrives, and within a strict latency budget you must return a decision the payment flow can act on. The second is an asynchronous analytical plane where the slow, smart, expensive work happens — investigation, link analysis, model training, and narrative explanation — feeding learnings back into the hot path over minutes and hours rather than milliseconds. Conflating the two is the single most common architectural mistake we see. The discipline that makes this system work is knowing, for every component, which side of that line it lives on and why.

The synchronous hot path (top, Royal Blue) returns a decision in under 100 ms using only fast rules and a gradient-boosted risk score. The LLM lives in the separate asynchronous lane (bottom, Madison Blue), where it investigates, explains, and proposes — never blocking an authorization. A governance plane spans both.

Ingestion: the event gateway

Everything begins as an event. A card swipe, a tap, an online checkout — each becomes a structured authorization request that lands on an event gateway fronting a durable log such as Kafka. We make this layer boring on purpose, because boring is what you want under a payment network. Two properties matter most. First, idempotency: networks retry, and the same authorization can arrive twice. Each event carries a stable idempotency key so a duplicate is recognized and collapsed rather than double-counted. Second, delivery semantics: a payment log is at-least-once by nature, so true exactly-once is an illusion you pay dearly to chase. We design consumers to be idempotent instead, which gives the operational equivalent of exactly-once without the distributed-transaction tax. The gateway also does the unglamorous regulatory work up front — PII minimization and tokenization, so the raw PAN is swapped for a token at the edge and the sensitive material never travels deeper into the system than it must.

Features: online store vs offline store, and the skew that kills you

A risk model is only as good as the features it sees. The hard part is that those features have two homes with different jobs. The online feature store serves the hot path: single-digit-millisecond reads of pre-computed aggregates — velocity counts, device reputation, "spend in the last 60 seconds," distance from last known location — usually out of an in-memory store like Redis. The offline store holds the full history used to train models. The cardinal sin here is training/serving skew: if the value a feature had at training time differs from the value served at decision time, your offline accuracy is a fiction.

The defense is point-in-time correctness. When you build a training set, every feature must be read as of the instant the historical decision was made — never with data that only existed afterward. Get this wrong and you leak the future into the model: it learns from a chargeback that had not happened yet and looks brilliant offline, then collapses in production. We treat point-in-time joins as a non-negotiable property of the feature platform, not a convenience.

Label leakage is the most expensive bug in fraud ML, because it is invisible until production. A model trained on accidentally-future data passes every offline check and then fails silently with real money behind it. Point-in-time-correct feature reads are how you make that bug structurally impossible rather than something you hope a reviewer catches.

The decisioning core: fast, deterministic, explainable

This is the heart of the hot path, and it is deliberately not where the LLM lives. The decision is produced by two collaborating layers. A deterministic rules engine encodes hard policy and known-bad patterns — blocklists, regulatory constraints, velocity ceilings — and can decline or force step-up authentication outright. Alongside it, a gradient-boosted tree model consumes the online features and emits a calibrated risk score in well under a millisecond. The two are combined into one decision: authorize, decline, or step up. We favor gradient-boosted trees here for a reason beyond speed — they are inspectable. When a regulator or a customer asks why a transaction was declined, "feature X exceeded threshold Y, contributing Z to the score" is an answer; an opaque embedding is not.

The contract this core exposes is strict, and the latency budget is part of the contract, not an aspiration:

def decide(req: AuthRequest) -> Decision:
    # hard deadline is part of the contract, not a hope
    with deadline(ms=80):
        feats = online_store.read(
            keys=req.feature_keys,
            as_of=req.event_time,        # point-in-time correct
        )
        if feats.max_staleness_ms > 500:   # freshness SLA
            return step_up(req, reason="stale_features")

        if (hit := rules.evaluate(req, feats)).blocks:
            return Decision(action="decline", reason=hit.code)

        score = gbm.score(feats)            # < 1 ms, calibrated
        return policy.apply(req, score)     # authorize / decline / step_up

    # deadline exceeded → fail safe, never hang the network
    return step_up(req, reason="risk_timeout")

Notice the failure behavior. If features are stale or the deadline is blown, the system does not guess and it does not hang — it returns a step-up. This is the fraud world's version of the fail-open/fail-closed trade-off. A pure fail-open (authorize on error) invites loss; a pure fail-closed (decline on error) burns good customers and revenue. The pragmatic default is to fail toward a safe state — typically a step-up challenge — which neither approves blindly nor declines a legitimate buyer outright. Which way you lean is a risk-appetite decision the business owns, expressed explicitly in code and reviewable in the audit log.

The AI/LLM layer, built ground-up — and kept off the hot path

Here is the thesis the whole post turns on: the LLM is designed into the data and feedback loops from day one, not bolted on as a chatbot at the end. "Ground-up" does not mean "in the critical path." It means the AI layer consumes the same governed event and feature data the models do, its outputs are logged and evaluated like any other model artifact, and it sits inside the same governance perimeter. What it must never do is sit synchronously between an authorization request and its decision. A model that thinks for two seconds has no business in a 100ms budget. So the LLM lives in the asynchronous lane, where latency is measured in seconds-to-hours and where its strengths actually matter:

Narrative explanation. Turning the structured reasons behind a flag — the rules that fired, the top score contributors — into a clear case summary an analyst can read in seconds instead of reverse-engineering.
Case investigation and link analysis. Walking the fraud graph to surface that five "unrelated" accounts share a device fingerprint and a funding instrument, and writing up the connected component as a coherent narrative.
Adaptive rule synthesis. Proposing candidate rules from emerging patterns — drafted, never auto-deployed. A human reviews, a shadow run measures, and only then does it ship.
Entity resolution. Reconciling messy identity signals across events into stable entities that downstream features and graph analysis can rely on.
Analyst copilot. A grounded assistant that answers "show me everything connected to this transaction" against governed data, with citations back to the underlying events.

Because the LLM reads untrusted data — transaction memos, merchant strings, support tickets — we treat prompt injection as the default threat, exactly as we would in any agent. Retrieved content is fenced as data, never instructions; the model cannot expand its own permissions; and anything it proposes that touches money or policy is gated behind human approval. Its outputs are not "AI magic" exempt from scrutiny — they are model outputs, logged, versioned, and evaluated against a fixed test set like everything else.

The fastest way to ruin a fraud platform is to put a slow, non-deterministic model in front of a synchronous payment. The LLM earns its place by making humans faster and models smarter in the async lane — not by trying to be the goalkeeper on a 100ms clock. Ground-up integration is about the data and governance loops, not about latency-critical placement.

Human-in-the-loop and the feedback flywheel

Flagged and sampled decisions flow into a case-management system where analysts investigate, aided by the AI lane, and record a disposition: fraud or legitimate, with reasons. That disposition is the most valuable data the system produces, because it becomes a label. Labels flow back into the offline store, where they retrain the next generation of models and tune the rules. This is the flywheel: real outcomes continuously sharpen the features, the score, and the policy. The loop has to be closed deliberately — labels carry their decision timestamp so point-in-time correctness survives retraining, and the same disposition feeds drift monitoring so a sudden shift in the fraud/legit ratio raises an alarm before it quietly erodes the model.

The governance plane: where regulators live

Spanning everything is a governance and observability plane, and in fintech it is not optional polish — it is the part that keeps the business licensed. It carries four things. An immutable audit log records every decision with its inputs, the model and rule versions that produced it, and the reason codes, so any authorization can be reconstructed and explained months later. A model registry versions every model and the data it was trained on, which is what makes champion/challenger and shadow deployment safe: a new model runs alongside production scoring real traffic without affecting decisions until it has proven itself. Feature lineage traces every feature from raw event to served value, so when a feature drifts you can find every model that depends on it. And drift and eval monitoring watches inputs and outputs for the distribution shifts that signal either a new fraud campaign or a decaying model.

This plane is also what satisfies the regulators directly. Explainability — the kind required for adverse-action notices and a customer's right to an explanation under regimes like GDPR — is a property you design in, by keeping the hot-path models inspectable and logging reason codes on every decision. Model-risk-management expectations in the spirit of SR 11-7 — independent validation, documented assumptions, ongoing monitoring — map directly onto the registry, the shadow-deployment discipline, and the drift monitors. You do not retrofit this. You build the audit trail as you build the decision, or you do not really have it.

The thread running through all of it is the same engineering discipline we apply to any system with consequences: keep the critical path fast, deterministic, and explainable; push the slow and the clever off to the side where they can think; and instrument everything so a single decision can be replayed and defended. It is the same minimalism behind our Sweep iOS app, which ships with zero third-party dependencies, runs 100% on-device, and passes 23 of 23 tests — fewer moving parts on the path that matters, every failure mode accounted for. A fraud platform is vastly larger, but the instinct is identical: be ruthless about what is allowed on the hot path, and disciplined about everything that feeds it.

Key takeaways

Treat the platform as two systems: a synchronous hot path under a hard p99 latency budget, and an asynchronous analytical plane for the slow, smart work.
Make ingestion boring and safe — idempotent consumers over at-least-once delivery, with PII tokenized at the edge.
Serve features from an online store but guarantee point-in-time correctness against the offline store, or training/serving skew and label leakage will destroy you in production.
Keep the decision core fast, deterministic, and inspectable (rules + gradient-boosted trees); fail toward a safe step-up default, never a hang.
Build the LLM in ground-up — same governed data, same logging and evals, same governance — but keep it in the async lane for investigation, explanation, and proposals, never on the 100ms path.
Make governance load-bearing: audit log, model registry, champion/challenger and shadow deploys, feature lineage, drift monitoring, and explainability for adverse-action and SR 11-7-style oversight.

Building a real-time decisioning system that has to be right?

We design and ship production AI and distributed systems for companies — idea to production in weeks.

Book a 30-min call →