On the Reproducibility Crisis in Sleep Research

I've been sitting with this one for a while because I wasn't sure I wanted to be the undergraduate who writes a critical take on a field she's trying to enter. But the more I read, the harder it is to ignore: a lot of the canonical findings in sleep and memory research don't replicate cleanly, and I think the field is not being honest enough with itself about why.

This isn't a polemic. I find sleep neuroscience genuinely fascinating and I plan to spend the next several years inside it. But I think the reproducibility problems here are worth naming directly, because the way we respond to them will determine whether the field gets stronger or just quietly accumulates a backlog of shaky findings.

The TMR Example

Targeted memory reactivation is probably the clearest case. The basic finding — that playing a cue associated with a learned object during slow-wave sleep improves later recall of that object — is compelling and has a plausible mechanistic story behind it. The original Rudoy et al. (2009) paper in Science was widely cited and spawned a decade of follow-up work. The problem is that effect sizes across replications vary enormously, several pre-registered replications have failed to find significant effects, and a 2022 meta-analysis found substantial evidence of publication bias inflating the apparent effect.

The variance in TMR effect sizes across labs is much larger than the variance you'd expect from sampling alone. That means something about the experimental setup — the stimulation timing, the sleep scoring, the task, the population — is doing a lot of work that we don't understand.

This is not a reason to abandon TMR as a paradigm. It is a reason to be much more careful about treating any single TMR study as establishing a fact rather than generating a hypothesis.

What I Think Is Actually Going Wrong

Small samples treated as adequate

Sleep studies are expensive and logistically hard. Keeping participants in a sleep lab overnight, running full polysomnography, and then testing them the next day is not cheap. The result is that most published studies in this area run 20–30 participants, and the field has largely normalized this. But if the true effects are small — and for a lot of the behavioural outcomes, they are — then n=25 is not nearly enough to get stable estimates. The published literature is therefore likely to systematically overestimate effect sizes, because only studies that happened to get a large effect by chance clear the significance threshold.

Outcome flexibility

A typical sleep and memory study measures recall, recognition, response time, error rate, and sometimes physiological measures like spindle density. With that many outcomes, the probability of finding something significant by chance is high. And while most researchers are not consciously p-hacking, the post-hoc selection of the "most interesting" finding is a subtler version of the same problem.

My Actual Position

I think sleep genuinely matters for memory. The animal literature is too mechanistically clear to dismiss, and the human correlational evidence is too consistent. But I think our ability to detect and measure sleep's effects on memory with current human paradigms is much weaker than the published record suggests. Pre-registration and larger samples won't fix everything, but they'd fix a lot.

A Note on Why This Matters

Sleep research has a large public presence. Books like Why We Sleep reached millions of readers with claims that turned out to be significantly overstated. That's partly a science communication problem, but it's also a consequence of a literature that had overclaimed its findings. Journalists and popular authors read the abstracts of papers and report what they say. If the abstracts claim more than the data support, the misinformation travels.

I don't think the answer is to be pessimistic about the field. I think the answer is for researchers — including future researchers like me — to be more precise about what the evidence actually shows and more disciplined about how we design and report studies. The underlying phenomena are real and important. We owe it to them to study them carefully.

On the Reproducibility Crisisin Sleep and Memory Research

The TMR Example

What I Think Is Actually Going Wrong

Small samples treated as adequate

Outcome flexibility

A Note on Why This Matters

On the Reproducibility Crisis
in Sleep and Memory Research