— Episode 03 · Comparative deep-read
Eight dimensions · four anchor papers · the grid behind the post

The four late-onset sepsis models, compared across eight dimensions.

Cohort design, signal stack, ML methodology, validation, performance, alarm policy, limitations, contributions. The Substack post walks the argument; this page is the grid it was distilled from.

Why this page exists.

Episode 03 of Road to CEPAS 2026 is a close read of the four anchor papers in Branch A — continuous-physiology machine learning for late-onset sepsis (LOS) in preterm infants. The Substack post argues that the headline AUCs are similar but not the same, and that the two decisions which matter most — signal stack and alarm policy — are the two the literature documents least clearly.

This page is the working file behind that argument: the full eight-dimension comparison, read across Berg 2023 (WKZ Utrecht), Kausch 2023 (UVA and two partner NICUs), Yang 2024 (Eindhoven/Máxima) and Meeus 2024 (Antwerp/Innocens). Three supporting-cast papers from the Episode 02 map — Honoré, Leon, Rio — sit at the foot. Full citations and licensing are in the references.

A framing note. Laying four headline AUCs in a row invites the eye to read them as four measurements of one quantity. They are not. Each answers a different question — different outcome, cohort, prediction horizon, evaluation unit. The grid below is built to make those differences legible rather than hide them.

The four papers at a glance.

Table 1 — Anchor papers, summary parameters. AUC figures are each paper's headline number under its own conditions; see Dimension 05 for why they are not directly comparable.
  Berg 2023 Kausch 2023 Yang 2024 Meeus 2024
Group WKZ Utrecht (NL) UVA + Columbia + WashU (US) Eindhoven / Máxima (NL) Antwerp / Innocens (BE)
Cohort 2,519 infants / 389 LOS 2,494 analysed / 302 LOS events 119 infants / 51 LOS 865 infants / 206 LOS+NEC episodes
Population Whole NICU, GA ≤32 wk VLBW (<1500 g) Preterm <32 wk, selected Preterm <32 wk
Outcome Culture-positive LOS Culture-positive LOS Culture-positive LOS LOS and NEC, jointly
Signals HR, SpO2 HR, SpO2 (+ cross-correlation) HR, RR, SpO2 HR, RR, SpO2, PI, temp, FiO2, glucose
Sampling 1 / min 0.5 Hz 1 Hz 1 / min
Chosen model Logistic regression Logistic regression (POWS) XGBoost XGBoost
Headline AUC 0.73 train / 0.79 test (t=0) 0.82 train / ≈0.79 external 0.875 (6 h pre-CRASH) 0.93 (CV, reference point)
Validation Single-centre, longitudinal sim Multi-NICU external Single-centre, 7-fold CV Single-centre + temporal hold-out
Pre-clinical detection ≈47–60% of patients Risk rises 23–24 h before 96% patient-wise (small n) 69% all / 81% severe episodes

Dimension 01

Cohort design.

Takeaway

Four different denominators. Whole-NICU preterm, VLBW-only, a small hand-cleaned cohort, and a joint LOS+NEC cohort. Before a single AUC is compared, the patients and the outcome already differ.


Dimension 02

Signal stack.

Really three decisions in one: which signals, sampled how fast, from which device.

Takeaway

More channels is not the lever. What moves performance is sampling resolution and a well-chosen feature. The single most deployment-relevant fact in the field — pulse-oximeter-only works — gets one paragraph in one paper.


Dimension 03

ML methodology.

Takeaway

Three of four ran a model comparison; all three found architecture barely matters once features are right. The bottleneck is feature engineering, not algorithm sophistication.


Dimension 04

Validation rigour.

Takeaway

Only Kausch has shown its model survives contact with another hospital. Every headline number except Kausch's external figures is an internal number — and internal numbers systematically flatter.


Dimension 05

Performance.

Each number, with its conditions attached — that is the point.

Takeaway — the turn

Four different outcomes, horizons, or cohorts. The 0.78–0.88 band is a coincidence of four measurements, not four readings of one quantity. The metric that survives translation is the fraction of LOS episodes caught pre-clinically — set Yang's small-n outlier aside and it lands at roughly half to two-thirds.


Dimension 06

Alarm policy.

What converts an AUC into something a clinician actually experiences — chosen four different ways, rarely justified.

Takeaway

Four genuinely different policies — and the policy, not the model, drives the detection headline. It is the highest-leverage deployment decision and the least clearly written-about.


Dimension 07

Honest limitations.

Takeaway

The limitations rhyme. Every paper concedes culture-positive labelling is imperfect, every single-centre paper concedes generalisability is unproven. The field is honest in its discussion sections — the deployment gap is acknowledged, just never closed.


Dimension 08

Unique contributions.


Supporting cast.

Three further Branch A groups mapped in Episode 02, included in the close read as context rather than as anchors.


References & sourcing.

  1. van den Berg M, Medina O, Loohuis I, et al. Development and clinical impact assessment of a machine-learning model for early prediction of late-onset sepsis. Comput Biol Med 2023;163:107156. doi:10.1016/j.compbiomed.2023.107156  Open access · CC BY-NC-ND
  2. Kausch SL, Brandberg JG, Qiu J, et al. Cardiorespiratory signature of neonatal sepsis: development and validation of prediction models in 3 NICUs. Pediatr Res 2023;93:1913–1921. doi:10.1038/s41390-022-02444-7  NIH / PMC author manuscript
  3. Yang M, Peng Z, van Pul C, et al. Continuous prediction and clinical alarm management of late-onset sepsis in preterm infants using vital signs from a patient monitor. Comput Methods Programs Biomed 2024;255:108335. doi:10.1016/j.cmpb.2024.108335  Open access · CC BY
  4. Meeus M, Beirnaert C, Mahieu L, et al. Clinical decision support for improved neonatal care: development of a machine learning model for the prediction of late-onset sepsis and necrotizing enterocolitis. J Pediatr 2024;266:113869. doi:10.1016/j.jpeds.2023.113869  All rights reserved
  5. Honoré A, Forsberg D, Adolphson K, et al. Vital sign-based detection of sepsis in neonates using machine learning. Acta Paediatr 2023;112:686–696. doi:10.1111/apa.16660  Open access · CC BY-NC-ND
  6. Leon C, Carrault G, Pladys P, Beuchée A. Early detection of late onset sepsis in premature infants using visibility graph analysis of heart rate variability. IEEE J Biomed Health Inform 2021;25:1006–1017. doi:10.1109/JBHI.2020.3021662  IEEE · author manuscript
  7. Rio L, Ramelet A-S, Ballabeni P, et al. Monitoring of heart rate characteristics to detect neonatal sepsis. Pediatr Res 2022;92:1070–1074. doi:10.1038/s41390-021-01913-9  All rights reserved

This page discusses and compares the papers above; it does not reproduce them. Every source links to its publisher of record. No figures, tables or full text are reproduced here. The eight-dimension comparison is the author's own synthesis.