— Episode 04 · Critical self-appraisal
One paper · two decisions · read against its own supplement

Our own work, warts and all.

The same two decisions from Episode 03 — signal stack and alarm policy — turned onto the paper I cannot hide behind a citation: Berg 2023, the WKZ Utrecht late-onset sepsis model I am senior author on. The Substack post makes the case; this page is the working file behind it.

Why this page exists.

Episode 04 of Road to CEPAS 2026 applies the Episode 03 reading to my own group's paper — van den Berg et al. 2023, the largest unselected single-centre cohort in this corner of the LOS literature and the most extensive retrospective impact simulation done for it to date. The Substack post argues that the paper's central strength and its hardest limitations are not separate items to be weighed against each other — they are the same decisions seen from two sides.

This page is the working file: the two decisions laid out against the figures in the paper and its supplement, the proxy the whole impact story rests on, and the supplementary discrepancy I could not reconstruct. It is organised around the same two decisions Episode 03 identified as the field's least-documented, rather than re-running the full eight-dimension grid — that grid already exists on the Episode 03 page.

A note on candour. It is easy to critique other people's papers from behind a citation. The most credible test of "prediction is not utility" is the paper I am answerable for. The figures below are all in the published paper and supplement; what this page does is put the less flattering ones next to the headline ones they came from.

The headline, and the number beside it.

The abstract reports detection of LOS in at least 47% of patients before clinical suspicion, without exceeding an alarm-fatigue threshold of three alarms per day. That is the sentence most readers take away. At that same operating point the precision is 3.93% on the training set and 4.41% on the test set — a number that is not in the abstract.

Both figures come from one alarm configuration; they summarise different aspects of it. The detection fraction is how many LOS episodes are flagged before the culture. The precision is the positive predictive value of the alarms that policy generates. A precision of ~4% means roughly 96 of every 100 alarms do not pan out — at the operating point chosen precisely because it matched the alarm burden a NICU already tolerates.

Table 1 — Two summaries of one alarm configuration. Both are in Berg 2023; only the first is in the abstract. Figures as reported in the paper and Table 3.
  What it measures Value Where it appears
Pre-clinical detection Fraction of LOS episodes flagged before the blood culture ≥47% of patients Abstract + body
Precision (PPV) Share of alarms that correspond to a true LOS episode 3.93% train / 4.41% test Table 3 only
Alarm burden Alarms per patient-day at the chosen operating point ≤3 / day Abstract + body
Takeaway

The defence is genuine: three alarms a day is roughly how often the WKZ team already considers sepsis during routine care, and an unpublished internal check found 80% of documented sepsis evaluations did not lead to a blood culture. A bedside clinician already works with a high "check, probably nothing" base rate. But foregrounding the detection fraction over the false-alarm rate is a framing decision — and if "prediction is not utility" means anything, the 4% deserves as much prominence as the 47%.


Decision 01

Signal stack: one decision, two faces.

The model sees heart rate and oxygen saturation, and nothing else — low-frequency, one sample a minute, the values on every monitor on the ward. CRP, blood pressure and white-cell counts were deliberately excluded.

Takeaway

The principled choice and the disappointing precision are the same fact stated twice. You cannot keep the leakage-free design and also escape the ceiling it imposes; they are one decision.


Decision 02

Alarm policy: a convenience that became a convention.

A model produces a score every hour. A score is not an alarm. The policy that converts one into the other has three parts:

The refractory period is the revealing part. Eight hours is an operational convenience — reasonable for a working ward, but not a clinically derived parameter. No one showed that eight hours is where early detection and alarm fatigue trade off best; it is where the staff rotation happens to fall.

The supplement lets us check whether these choices matter, and they do:

Table 2 — Sensitivity of the headline figures to two policy parameters, as reported in the Berg 2023 supplement. Values paraphrase the direction and approximate magnitude of change; see supplement Tables S3, S7, S8 for exact figures.
Parameter varied Default Effect of varying it Source
Refractory period 8 h (a shift) 4 h / 12 h moves recall and precision — moderately, not dramatically, but enough to show the headline depends on a rostering choice S7, S8
True-positive window 36 h (−24 h to +12 h) Restricting to −12 h to culture drops multi-alarm recall from the high-50s/60s into the high-30s to mid-50s S3
Takeaway

The "at least 47%" is real, but it rests on a generous definition of catching the episode in time. And the policy did not stay internal: Episode 03 documented Yang 2024 reusing the eight-hour period and escalation logic, and Meeus echoing the alarm-burden accounting. The element we documented least rigorously is the one the field borrowed most directly — a shift-pattern parameter that has propagated into other models as a convention.


The proxy at the centre of everything.

Beneath both decisions is one substitution. We do not know the moment a clinician first suspected sepsis — it is not reliably recorded — so we used the blood-culture timestamp as a stand-in for "t = 0". Every claim about detecting LOS before clinical suspicion is, more precisely, a claim about detecting it before the culture was drawn.

The paper concedes this directly: because actual suspicion could arise earlier, using culture time as the proxy can overestimate early-warning performance. If a clinician was already uneasy an hour before drawing the culture, part of our measured lead time is an artefact of when the culture was recorded, not of when the model spoke first.

There is a smaller version of the same problem. A number of cultures arrived timestamped at exactly midnight — an artefact of missing times defaulting to 00:00. We imputed those to noon, closer to morning rounds, and confirmed that dropping the contaminated records did not change the results. It is a defensible fix, but it means a portion of the t = 0 anchors are reasoned estimates of what the clock read.

One definitional point cuts the other way. We labelled only culture-positive cases as sepsis; culture-negative "clinical sepsis" — babies treated as septic who never grew an organism — sits in the control group. When the model fires on one of them it is scored as a false positive, even where a clinician might have agreed. So some unknown fraction of the 96% of "false" alarms may not be false. A muddied outcome definition is muddied whichever way it leans.


A discrepancy I cannot fully explain.

While re-reading the supplement for this episode I found something I cannot reconstruct from the published PDF. The train/test baseline table reports a CRP-above-10 mg/L rate in the control group of roughly 80%, against 8.5% in the main-text table — a figure that cannot be right for a control group, and one the page appears to print twice with differing control numbers.

It is most likely a proofing fault rather than an analytical one, and I cannot say more than that from the PDF alone. I flag it here in the same spirit as the rest of the page: a critic of his own paper should not only notice the errors that flatter it. This item is carried forward as an open accuracy note rather than a resolved finding.


Simulation is not impact.

We described our longitudinal alarm analysis as the most extensive clinical impact assessment in the LOS prediction literature to date. It is worth being exact about what kind of object that is: a retrospective simulation on historical data, the model applied to records of babies whose outcomes were already settled. No clinician saw a score. No decision changed. Nothing that happened to those babies happened because of the model.

The paper's most honest move is its last. Listing what a real evaluation would require, the authors note that counterfactuals — would this sepsis have been caught later without the model? — are not measured. That is the thesis of this whole series, stated in our own paper two years before I began writing these episodes. We demonstrated prediction. We simulated impact. We did not show utility, and we said so.

Takeaway — the turn to Episode 05

The gap between the most extensive impact simulation and actual clinical impact is not a flaw — it is the frontier at which the paper honestly stops. Every model in the Episode 02 field map stops at the same line. Berg 2023 already gestures past it: two footnotes in the discussion point to the EU Medical Device and In-Vitro Diagnostic Regulations as the frame any performance claim must eventually answer to. We dropped those footnotes and moved on. Episode 05 picks them up.


References & sourcing.

  1. van den Berg M, Medina O, Loohuis I, et al. Development and clinical impact assessment of a machine-learning model for early prediction of late-onset sepsis. Comput Biol Med 2023;163:107156. doi:10.1016/j.compbiomed.2023.107156  Open access · CC BY-NC-ND
  2. Yang M, Peng Z, van Pul C, et al. Continuous prediction and clinical alarm management of late-onset sepsis in preterm infants using vital signs from a patient monitor. Comput Methods Programs Biomed 2024;255:108335. doi:10.1016/j.cmpb.2024.108335  Open access · CC BY
  3. Meeus M, Beirnaert C, Mahieu L, et al. Clinical decision support for improved neonatal care: development of a machine learning model for the prediction of late-onset sepsis and necrotizing enterocolitis. J Pediatr 2024;266:113869. doi:10.1016/j.jpeds.2023.113869  All rights reserved

This page discusses and critiques the paper above; it does not reproduce it. No figures, tables or full text are reproduced here — supplementary findings are paraphrased and pointed back to the source by table number. The synthesis and the judgements of what counts as a limitation versus a defensible choice are the author's own.