Eight dimensions · four anchor papers · the grid behind the post
The four late-onset sepsis models, compared across eight dimensions.
Cohort design, signal stack, ML methodology, validation, performance, alarm policy, limitations, contributions. The Substack post walks the argument; this page is the grid it was distilled from.
Why this page exists.
Episode 03 of Road to CEPAS 2026 is a close read of the four anchor papers in Branch A — continuous-physiology machine learning for late-onset sepsis (LOS) in preterm infants. The Substack post argues that the headline AUCs are similar but not the same, and that the two decisions which matter most — signal stack and alarm policy — are the two the literature documents least clearly.
This page is the working file behind that argument: the full eight-dimension comparison, read across Berg 2023 (WKZ Utrecht), Kausch 2023 (UVA and two partner NICUs), Yang 2024 (Eindhoven/Máxima) and Meeus 2024 (Antwerp/Innocens). Three supporting-cast papers from the Episode 02 map — Honoré, Leon, Rio — sit at the foot. Full citations and licensing are in the references.
A framing note. Laying four headline AUCs in a row invites the eye to read them as four measurements of one quantity. They are not. Each answers a different question — different outcome, cohort, prediction horizon, evaluation unit. The grid below is built to make those differences legible rather than hide them.
The four papers at a glance.
Table 1 — Anchor papers, summary parameters. AUC figures are each paper's headline number under its own conditions; see Dimension 05 for why they are not directly comparable.
Berg 2023
Kausch 2023
Yang 2024
Meeus 2024
Group
WKZ Utrecht (NL)
UVA + Columbia + WashU (US)
Eindhoven / Máxima (NL)
Antwerp / Innocens (BE)
Cohort
2,519 infants / 389 LOS
2,494 analysed / 302 LOS events
119 infants / 51 LOS
865 infants / 206 LOS+NEC episodes
Population
Whole NICU, GA ≤32 wk
VLBW (<1500 g)
Preterm <32 wk, selected
Preterm <32 wk
Outcome
Culture-positive LOS
Culture-positive LOS
Culture-positive LOS
LOS and NEC, jointly
Signals
HR, SpO2
HR, SpO2 (+ cross-correlation)
HR, RR, SpO2
HR, RR, SpO2, PI, temp, FiO2, glucose
Sampling
1 / min
0.5 Hz
1 Hz
1 / min
Chosen model
Logistic regression
Logistic regression (POWS)
XGBoost
XGBoost
Headline AUC
0.73 train / 0.79 test (t=0)
0.82 train / ≈0.79 external
0.875 (6 h pre-CRASH)
0.93 (CV, reference point)
Validation
Single-centre, longitudinal sim
Multi-NICU external
Single-centre, 7-fold CV
Single-centre + temporal hold-out
Pre-clinical detection
≈47–60% of patients
Risk rises 23–24 h before
96% patient-wise (small n)
69% all / 81% severe episodes
Dimension 01
Cohort design.
Berg — Retrospective, single-centre, 2008–2019. Every consecutively admitted infant GA ≤32 wk regardless of admission reason: the broadest, least selected population of the four. Controls drawn from negative-culture and out-of-window patients, matched on GA and time-since-birth.
Kausch — Retrospective, three NICUs, 2012–2021. Restricted to VLBW (<1500 g) — a narrower, sicker slice. Strict definition: positive culture after day 3, ≥5 days antibiotics, ≥2 antibiotic-free days before.
Yang — Retrospective, single-centre, 2016–2018. The smallest and most selected cohort: 51 LOS + 68 controls. Clean cases, clean controls — ideal for modelling, least representative of a real ward.
Meeus — Retrospective, single-centre, 2012–2020. Outcome is LOS and NEC jointly — 206 combined episodes. Not a LOS-only paper.
Takeaway
Four different denominators. Whole-NICU preterm, VLBW-only, a small hand-cleaned cohort, and a joint LOS+NEC cohort. Before a single AUC is compared, the patients and the outcome already differ.
Dimension 02
Signal stack.
Really three decisions in one: which signals, sampled how fast, from which device.
Berg — Two signals, 1/min: HR and SpO2 only. Deliberately excluded CRP and blood pressure (clinician-initiated, leakage risk) and temperature and RR (artefact-prone).
Kausch — Two signals plus their cross-correlation, 0.5 Hz. Key finding: performance is essentially unchanged whether HR comes from ECG or pulse oximetry — the model can run on a standalone oximeter.
Yang — Three signals, 1 Hz, 102 features. The sampling-rate ablation: raw waveform 0.886 → 1 Hz 0.875 → 1/min 0.825 → 1/h 0.687. Minute-by-minute is the deployable floor; hourly destroys the signal.
Meeus — Seven signals, 1/min — the widest stack. But breadth does not produce a cleanly higher comparable AUC.
Takeaway
More channels is not the lever. What moves performance is sampling resolution and a well-chosen feature. The single most deployment-relevant fact in the field — pulse-oximeter-only works — gets one paragraph in one paper.
Dimension 03
ML methodology.
Berg — LR vs GAM vs XGBoost; all within 0.01 AUC, no significant difference by DeLong's test. Chose LR for interpretability.
Kausch — LR (cubic splines) vs neural net vs XGBoost vs random forest; similar across all. Chose LR — "equal performance, better explainability."
Yang — Seven methods including two deep-learning architectures. XGBoost won; deep learning underperformed.
Meeus — XGBoost only, no bake-off. SHAP for explainability.
Takeaway
Three of four ran a model comparison; all three found architecture barely matters once features are right. The bottleneck is feature engineering, not algorithm sophistication.
Kausch — The field's strongest: trained on one NICU, externally validated on two others. TRIPOD-reported, calibration assessed per site. External drop ≈0.03 AUC.
Yang — 7-fold cross-validation, nested for hyperparameters. Single-centre; no external validation — named as the main limitation.
Meeus — Cross-validation plus a temporal hold-out (year-2020 patients). Still single-centre; generalisability stated as the main limitation.
Takeaway
Only Kausch has shown its model survives contact with another hospital. Every headline number except Kausch's external figures is an internal number — and internal numbers systematically flatter.
Dimension 05
Performance.
Each number, with its conditions attached — that is the point.
Berg 0.79 — cross-sectional AUC at t=0 (0.73 train / 0.79 test), whole-NICU, culture-positive LOS. Longitudinal: ≈58–67% recall under the multi-threshold policy; ≈47–60% of patients detected pre-clinically.
Kausch ≈0.79 — prediction within 24 h, VLBW-only, external. Pulse-rate ≈ ECG. Risk rises significantly 23–24 h before blood culture.
Yang 0.875 — measured 6 h before CRASH, on 119 hand-cleaned infants. Patient-wise sensitivity 96.1%, but specificity 19.1%.
Meeus 0.93 — for LOS+NEC jointly, on cross-validated time windows (a metric Meeus flag as optimistic on imbalanced data). Sliding-window: 69% all / 81% severe; median time gain ≈10 h.
Takeaway — the turn
Four different outcomes, horizons, or cohorts. The 0.78–0.88 band is a coincidence of four measurements, not four readings of one quantity. The metric that survives translation is the fraction of LOS episodes caught pre-clinically — set Yang's small-n outlier aside and it lands at roughly half to two-thirds.
Dimension 06
Alarm policy.
What converts an AUC into something a clinician actually experiences — chosen four different ways, rarely justified.
Berg — Hourly scores; thresholds set to fixed unit-wide rates. 8-hour refractory period (a nursing shift); a muted alarm re-fires only if a higher threshold is crossed. Burden reported as alarms per patient-day.
Kausch — An alert switches on at a threshold and stays on until 24 h with no further crossing. No refractory period, no escalation.
Yang — Adopts Berg's framework wholesale — same 8-hour silencing, same multi-threshold escalation — and cites it.
Meeus — Hourly probabilities aggregated into 24-hour buckets; reports alarm-days per week, echoing Berg's burden accounting without the refractory machinery.
Takeaway
Four genuinely different policies — and the policy, not the model, drives the detection headline. It is the highest-leverage deployment decision and the least clearly written-about.
Dimension 07
Honest limitations.
Berg — Imputed blood-culture timestamps; culture-positive-only labelling; blood-culture time as a suspicion proxy may overestimate lead time; may fire on non-sepsis deterioration.
Kausch — Missing-data exclusions; culture-proven-only definition; co-morbidities alter features; a prospective multicentre study still needed.
Yang — Selected, non-representative cohort; maturation difference between groups; small dataset; no external validation; deep learning underperformed.
Meeus — LOS labelling is error-prone; retrospective-labelling risk; no generalisability data; AUROC optimistic on imbalanced data; metrics are backward-looking.
Takeaway
The limitations rhyme. Every paper concedes culture-positive labelling is imperfect, every single-centre paper concedes generalisability is unproven. The field is honest in its discussion sections — the deployment gap is acknowledged, just never closed.
Dimension 08
Unique contributions.
Berg — Largest unselected single-centre cohort; the most extensive retrospective clinical-impact simulation in the LOS literature; the alarm-fatigue framework that became a field convention.
Kausch — The only true multi-NICU external validation; the pulse-rate-equals-ECG finding; HR–SpO2 cross-correlation as a feature; a direct benchmark against the legacy HRC index.
Yang — The sampling-rate ablation; the seven-method bake-off showing deep learning does not help; concrete evidence of Berg's alarm framework propagating.
Meeus — Joint LOS+NEC prediction; the widest signal stack; the only commercial spinoff (Innocens BV); a pointed critique of AUROC on imbalanced data.
Supporting cast.
Three further Branch A groups mapped in Episode 02, included in the close read as context rather than as anchors.
Honoré 2023 — Karolinska / KTH. 325 infants, only 20 LOS cases; Naïve Bayes, chosen to avoid overfitting a tiny positive set; AUROC 0.82 up to 24 h before suspicion.
Leon 2021 — Rennes. 49 infants; HRV from ECG only, visibility-graph features; AUROC 0.877 in the 6 h before antibiotics. The methodologically distinctive single-signal end of the spectrum.
Rio 2022 — Lausanne. Not a model paper: an independent validation of the commercial HeRO/HRC index. Strongly gestational-age-dependent (sensitivity 76% at <28 wk vs 25% at >32 wk); optimal cutoff 2.76, not the FDA-cleared 2.0.
References & sourcing.
van den Berg M, Medina O, Loohuis I, et al. Development and clinical impact assessment of a machine-learning model for early prediction of late-onset sepsis. Comput Biol Med 2023;163:107156. doi:10.1016/j.compbiomed.2023.107156Open access · CC BY-NC-ND
Kausch SL, Brandberg JG, Qiu J, et al. Cardiorespiratory signature of neonatal sepsis: development and validation of prediction models in 3 NICUs. Pediatr Res 2023;93:1913–1921. doi:10.1038/s41390-022-02444-7NIH / PMC author manuscript
Yang M, Peng Z, van Pul C, et al. Continuous prediction and clinical alarm management of late-onset sepsis in preterm infants using vital signs from a patient monitor. Comput Methods Programs Biomed 2024;255:108335. doi:10.1016/j.cmpb.2024.108335Open access · CC BY
Meeus M, Beirnaert C, Mahieu L, et al. Clinical decision support for improved neonatal care: development of a machine learning model for the prediction of late-onset sepsis and necrotizing enterocolitis. J Pediatr 2024;266:113869. doi:10.1016/j.jpeds.2023.113869All rights reserved
Honoré A, Forsberg D, Adolphson K, et al. Vital sign-based detection of sepsis in neonates using machine learning. Acta Paediatr 2023;112:686–696. doi:10.1111/apa.16660Open access · CC BY-NC-ND
Leon C, Carrault G, Pladys P, Beuchée A. Early detection of late onset sepsis in premature infants using visibility graph analysis of heart rate variability. IEEE J Biomed Health Inform 2021;25:1006–1017. doi:10.1109/JBHI.2020.3021662IEEE · author manuscript
Rio L, Ramelet A-S, Ballabeni P, et al. Monitoring of heart rate characteristics to detect neonatal sepsis. Pediatr Res 2022;92:1070–1074. doi:10.1038/s41390-021-01913-9All rights reserved
This page discusses and compares the papers above; it does not reproduce them. Every source links to its publisher of record. No figures, tables or full text are reproduced here. The eight-dimension comparison is the author's own synthesis.