The four late-onset sepsis models, compared across eight dimensions.

Cohort design, signal stack, ML methodology, validation, performance, alarm policy, limitations, contributions. The Substack post walks the argument; this page is the grid it was distilled from.

Why this page exists.

Episode 03 of Road to CEPAS 2026 is a close read of the four anchor papers in Branch A — continuous-physiology machine learning for late-onset sepsis (LOS) in preterm infants. The Substack post argues that the headline AUCs are similar but not the same, and that the two decisions which matter most — signal stack and alarm policy — are the two the literature documents least clearly.

This page is the working file behind that argument: the full eight-dimension comparison, read across Berg 2023 (WKZ Utrecht), Kausch 2023 (UVA and two partner NICUs), Yang 2024 (Eindhoven/Máxima) and Meeus 2024 (Antwerp/Innocens). Three supporting-cast papers from the Episode 02 map — Honoré, Leon, Rio — sit at the foot. Full citations and licensing are in the references.

A framing note. Laying four headline AUCs in a row invites the eye to read them as four measurements of one quantity. They are not. Each answers a different question — different outcome, cohort, prediction horizon, evaluation unit. The grid below is built to make those differences legible rather than hide them.

The four papers at a glance.

Table 1 — Anchor papers, summary parameters. AUC figures are each paper's headline number under its own conditions; see Dimension 05 for why they are not directly comparable.
	Berg 2023	Kausch 2023	Yang 2024	Meeus 2024
Group	WKZ Utrecht (NL)	UVA + Columbia + WashU (US)	Eindhoven / Máxima (NL)	Antwerp / Innocens (BE)
Cohort	2,519 infants / 389 LOS	2,494 analysed / 302 LOS events	119 infants / 51 LOS	865 infants / 206 LOS+NEC episodes
Population	Whole NICU, GA ≤32 wk	VLBW (<1500 g)	Preterm <32 wk, selected	Preterm <32 wk
Outcome	Culture-positive LOS	Culture-positive LOS	Culture-positive LOS	LOS and NEC, jointly
Signals	HR, SpO2	HR, SpO2 (+ cross-correlation)	HR, RR, SpO2	HR, RR, SpO2, PI, temp, FiO2, glucose
Sampling	1 / min	0.5 Hz	1 Hz	1 / min
Chosen model	Logistic regression	Logistic regression (POWS)	XGBoost	XGBoost
Headline AUC	0.73 train / 0.79 test (t=0)	0.82 train / ≈0.79 external	0.875 (6 h pre-CRASH)	0.93 (CV, reference point)
Validation	Single-centre, longitudinal sim	Multi-NICU external	Single-centre, 7-fold CV	Single-centre + temporal hold-out
Pre-clinical detection	≈47–60% of patients	Risk rises 23–24 h before	96% patient-wise (small n)	69% all / 81% severe episodes

Dimension 01

Cohort design.

Berg — Retrospective, single-centre, 2008–2019. Every consecutively admitted infant GA ≤32 wk regardless of admission reason: the broadest, least selected population of the four. Controls drawn from negative-culture and out-of-window patients, matched on GA and time-since-birth.
Kausch — Retrospective, three NICUs, 2012–2021. Restricted to VLBW (<1500 g) — a narrower, sicker slice. Strict definition: positive culture after day 3, ≥5 days antibiotics, ≥2 antibiotic-free days before.
Yang — Retrospective, single-centre, 2016–2018. The smallest and most selected cohort: 51 LOS + 68 controls. Clean cases, clean controls — ideal for modelling, least representative of a real ward.
Meeus — Retrospective, single-centre, 2012–2020. Outcome is LOS and NEC jointly — 206 combined episodes. Not a LOS-only paper.

Takeaway

Four different denominators. Whole-NICU preterm, VLBW-only, a small hand-cleaned cohort, and a joint LOS+NEC cohort. Before a single AUC is compared, the patients and the outcome already differ.

Dimension 02

Signal stack.

Really three decisions in one: which signals, sampled how fast, from which device.

Berg — Two signals, 1/min: HR and SpO2 only. Deliberately excluded CRP and blood pressure (clinician-initiated, leakage risk) and temperature and RR (artefact-prone).
Kausch — Two signals plus their cross-correlation, 0.5 Hz. Key finding: performance is essentially unchanged whether HR comes from ECG or pulse oximetry — the model can run on a standalone oximeter.
Yang — Three signals, 1 Hz, 102 features. The sampling-rate ablation: raw waveform 0.886 → 1 Hz 0.875 → 1/min 0.825 → 1/h 0.687. Minute-by-minute is the deployable floor; hourly destroys the signal.
Meeus — Seven signals, 1/min — the widest stack. But breadth does not produce a cleanly higher comparable AUC.

Takeaway

More channels is not the lever. What moves performance is sampling resolution and a well-chosen feature. The single most deployment-relevant fact in the field — pulse-oximeter-only works — gets one paragraph in one paper.

Dimension 03

ML methodology.

Berg — LR vs GAM vs XGBoost; all within 0.01 AUC, no significant difference by DeLong's test. Chose LR for interpretability.
Kausch — LR (cubic splines) vs neural net vs XGBoost vs random forest; similar across all. Chose LR — "equal performance, better explainability."
Yang — Seven methods including two deep-learning architectures. XGBoost won; deep learning underperformed.
Meeus — XGBoost only, no bake-off. SHAP for explainability.

Takeaway

Three of four ran a model comparison; all three found architecture barely matters once features are right. The bottleneck is feature engineering, not algorithm sophistication.

Dimension 04

Validation rigour.

Berg — 75/25 patient-grouped split, bootstrap CIs, longitudinal alarm simulation. Single-centre; no external validation.
Kausch — The field's strongest: trained on one NICU, externally validated on two others. TRIPOD-reported, calibration assessed per site. External drop ≈0.03 AUC.
Yang — 7-fold cross-validation, nested for hyperparameters. Single-centre; no external validation — named as the main limitation.
Meeus — Cross-validation plus a temporal hold-out (year-2020 patients). Still single-centre; generalisability stated as the main limitation.

Takeaway

Only Kausch has shown its model survives contact with another hospital. Every headline number except Kausch's external figures is an internal number — and internal numbers systematically flatter.

Dimension 05

Performance.

Each number, with its conditions attached — that is the point.

Berg 0.79 — cross-sectional AUC at t=0 (0.73 train / 0.79 test), whole-NICU, culture-positive LOS. Longitudinal: ≈58–67% recall under the multi-threshold policy; ≈47–60% of patients detected pre-clinically.
Kausch ≈0.79 — prediction within 24 h, VLBW-only, external. Pulse-rate ≈ ECG. Risk rises significantly 23–24 h before blood culture.
Yang 0.875 — measured 6 h before CRASH, on 119 hand-cleaned infants. Patient-wise sensitivity 96.1%, but specificity 19.1%.
Meeus 0.93 — for LOS+NEC jointly, on cross-validated time windows (a metric Meeus flag as optimistic on imbalanced data). Sliding-window: 69% all / 81% severe; median time gain ≈10 h.

Takeaway — the turn

Four different outcomes, horizons, or cohorts. The 0.78–0.88 band is a coincidence of four measurements, not four readings of one quantity. The metric that survives translation is the fraction of LOS episodes caught pre-clinically — set Yang's small-n outlier aside and it lands at roughly half to two-thirds.

Dimension 06

Alarm policy.

What converts an AUC into something a clinician actually experiences — chosen four different ways, rarely justified.

Berg — Hourly scores; thresholds set to fixed unit-wide rates. 8-hour refractory period (a nursing shift); a muted alarm re-fires only if a higher threshold is crossed. Burden reported as alarms per patient-day.
Kausch — An alert switches on at a threshold and stays on until 24 h with no further crossing. No refractory period, no escalation.
Yang — Adopts Berg's framework wholesale — same 8-hour silencing, same multi-threshold escalation — and cites it.
Meeus — Hourly probabilities aggregated into 24-hour buckets; reports alarm-days per week, echoing Berg's burden accounting without the refractory machinery.

Takeaway

Four genuinely different policies — and the policy, not the model, drives the detection headline. It is the highest-leverage deployment decision and the least clearly written-about.

Dimension 07

Honest limitations.

Berg — Imputed blood-culture timestamps; culture-positive-only labelling; blood-culture time as a suspicion proxy may overestimate lead time; may fire on non-sepsis deterioration.
Kausch — Missing-data exclusions; culture-proven-only definition; co-morbidities alter features; a prospective multicentre study still needed.
Yang — Selected, non-representative cohort; maturation difference between groups; small dataset; no external validation; deep learning underperformed.
Meeus — LOS labelling is error-prone; retrospective-labelling risk; no generalisability data; AUROC optimistic on imbalanced data; metrics are backward-looking.

Takeaway

The limitations rhyme. Every paper concedes culture-positive labelling is imperfect, every single-centre paper concedes generalisability is unproven. The field is honest in its discussion sections — the deployment gap is acknowledged, just never closed.

Dimension 08

Unique contributions.

Berg — Largest unselected single-centre cohort; the most extensive retrospective clinical-impact simulation in the LOS literature; the alarm-fatigue framework that became a field convention.
Kausch — The only true multi-NICU external validation; the pulse-rate-equals-ECG finding; HR–SpO2 cross-correlation as a feature; a direct benchmark against the legacy HRC index.
Yang — The sampling-rate ablation; the seven-method bake-off showing deep learning does not help; concrete evidence of Berg's alarm framework propagating.
Meeus — Joint LOS+NEC prediction; the widest signal stack; the only commercial spinoff (Innocens BV); a pointed critique of AUROC on imbalanced data.

Supporting cast.

Three further Branch A groups mapped in Episode 02, included in the close read as context rather than as anchors.

Honoré 2023 — Karolinska / KTH. 325 infants, only 20 LOS cases; Naïve Bayes, chosen to avoid overfitting a tiny positive set; AUROC 0.82 up to 24 h before suspicion.
Leon 2021 — Rennes. 49 infants; HRV from ECG only, visibility-graph features; AUROC 0.877 in the 6 h before antibiotics. The methodologically distinctive single-signal end of the spectrum.
Rio 2022 — Lausanne. Not a model paper: an independent validation of the commercial HeRO/HRC index. Strongly gestational-age-dependent (sensitivity 76% at <28 wk vs 25% at >32 wk); optimal cutoff 2.76, not the FDA-cleared 2.0.

References & sourcing.

van den Berg M, Medina O, Loohuis I, et al. Development and clinical impact assessment of a machine-learning model for early prediction of late-onset sepsis. Comput Biol Med 2023;163:107156. doi:10.1016/j.compbiomed.2023.107156 Open access · CC BY-NC-ND
Kausch SL, Brandberg JG, Qiu J, et al. Cardiorespiratory signature of neonatal sepsis: development and validation of prediction models in 3 NICUs. Pediatr Res 2023;93:1913–1921. doi:10.1038/s41390-022-02444-7 NIH / PMC author manuscript
Yang M, Peng Z, van Pul C, et al. Continuous prediction and clinical alarm management of late-onset sepsis in preterm infants using vital signs from a patient monitor. Comput Methods Programs Biomed 2024;255:108335. doi:10.1016/j.cmpb.2024.108335 Open access · CC BY
Meeus M, Beirnaert C, Mahieu L, et al. Clinical decision support for improved neonatal care: development of a machine learning model for the prediction of late-onset sepsis and necrotizing enterocolitis. J Pediatr 2024;266:113869. doi:10.1016/j.jpeds.2023.113869 All rights reserved
Honoré A, Forsberg D, Adolphson K, et al. Vital sign-based detection of sepsis in neonates using machine learning. Acta Paediatr 2023;112:686–696. doi:10.1111/apa.16660 Open access · CC BY-NC-ND
Leon C, Carrault G, Pladys P, Beuchée A. Early detection of late onset sepsis in premature infants using visibility graph analysis of heart rate variability. IEEE J Biomed Health Inform 2021;25:1006–1017. doi:10.1109/JBHI.2020.3021662 IEEE · author manuscript

This page discusses and compares the papers above; it does not reproduce them. Every source links to its publisher of record. No figures, tables or full text are reproduced here. The eight-dimension comparison is the author's own synthesis.