^{*}

Edited by: Claudia Voelcker-Rehage, Technische Universität Chemnitz, Germany

Reviewed by: Tilo Strobach, Medical School Hamburg, Germany; Florian Schmiedek, German Institute for International Educational Research, Germany

*Correspondence: David Moreau

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution and reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The prospect of enhancing cognition is undoubtedly among the most exciting research questions currently bridging psychology, neuroscience, and evidence-based medicine. Yet, convincing claims in this line of work stem from designs that are prone to several shortcomings, thus threatening the credibility of training-induced cognitive enhancement. Here, we present seven pervasive statistical flaws in intervention designs: (i) lack of power; (ii) sampling error; (iii) continuous variable splits; (iv) erroneous interpretations of correlated gain scores; (v) single transfer assessments; (vi) multiple comparisons; and (vii) publication bias. Each flaw is illustrated with a Monte Carlo simulation to present its underlying mechanisms, gauge its magnitude, and discuss potential remedies. Although not restricted to training studies, these flaws are typically exacerbated in such designs, due to ubiquitous practices in data collection or data analysis. The article reviews these practices, so as to avoid common pitfalls when designing or analyzing an intervention. More generally, it is also intended as a reference for anyone interested in evaluating claims of cognitive enhancement.

Can cognition be enhanced via training? Designing effective interventions to enhance cognition has proven one of the most promising and difficult challenges of modern cognitive science. Promising, because the potential is enormous, with applications ranging from developmental disorders to cognitive aging, dementia, and traumatic brain injury rehabilitation. Yet difficult, because establishing sound evidence for an intervention is particularly challenging in psychology: the gold standard of double-blind randomized controlled experiments is not always feasible, due to logistic shortcomings or to common difficulties in disguising the underlying hypothesis of an experiment. These limitations have important consequences for the strength of evidence in favor of an intervention. Several of them have been extensively discussed in recent years, resulting in stronger, more valid, designs.

For example, the importance of using active control groups has been underlined in many instances (e.g., Boot et al.,

Our objective is three-fold. First, we aim to bring attention to core methodological and statistical issues when designing or analyzing training experiments. Using clear illustrations of how pervasive these problems are, we hope to help design better, more potent interventions. Second, we stress the importance of simulations to improve the understanding of research designs and data analysis methods, and the influence they have on results at all stages of a multifactorial project. Finally, we also intend to stimulate broader discussions by reaching wider audiences, and help individuals or organizations assess the effectiveness of an intervention to make informed decisions in light of all the evidence available, not just the most popular or the most publicized information. We strive, throughout the article, to make every idea as accessible as possible and to favor clear visualizations over mathematical jargon.

A note on the structure of the article. For each flaw we discuss, we include three steps: (1) a brief introduction to the problem and a description of its relation to intervention designs; (2) a Monte Carlo simulation and its visual illustration^{1}

A slightly more technical question pertains to the use of Monte Carlo simulations. Broadly speaking, Monte Carlo methods refer to the use of computational algorithms to simulate repeated random sampling, in order to obtain numerical estimates of a process. The idea that we can refine knowledge by simulating stochastic processes repeatedly rather than via more traditional procedures (e.g., direct integration) might be counterintuitive, yet this method is well suited to the specific examples we are presenting here for a few reasons. Repeated stochastic simulations allow creating mathematical models of ecological processes: the repetition represents research groups, throughout the world, randomly sampling from the population and conducting experiments. Such simulations are also particularly useful in complex problems where a number of variables are unknown or difficult to assess, as they can provide an account of the values a statistic can take when constrained by initial parameters, or a range of parameters. Finally, Monte Carlo simulations can be clearly represented visually. This facilitates the graphical translation of a mathematical simulation, thus allowing a discussion of each flaw with little statistical or mathematical background.

We begin our exploration with a pervasive problem in almost all experimental designs, particularly in training interventions: low statistical power. In a frequentist framework, two types of errors can arise at the decision stage in a statistical analysis: Type I (false positive, probability α) and Type II (false negative, probability β). The former occurs when the null hypothesis (H_{0}) is true but rejected, whereas the latter occurs when the alternative hypothesis (H_{A}) is true but the H_{0} is retained. That is, in the context of an intervention, the experimental treatment was effective but statistical inference led to the erroneous conclusion that it was not. Accordingly, the power of a statistical test is the probability of rejecting H_{0} given that it is false. The more power, the lower the probability of Type II errors, such that power is (1−β). Importantly, higher statistical power translates to a better chance of detecting an effect if it is exists, but also a better chance that an effect is genuine if it is significant (Button et al.,

Because α is set arbitrarily by the experimenter, power could be increased by directly increasing α. This simple solution, however, has an important pitfall: since α represents the probability of Type I errors, any increase will produce more false positives (rejections of H_{0} when it should be retained) in the long run. Therefore, in practice experimenters need to take into account the tradeoff between Type I and Type II errors when setting α. Typically, α < β, because missing an existing effect (β) is thought to be less prejudicial than falsely rejecting H_{0} (α); however, specific circumstances where the emphasis is on discovering new effects (e.g., exploratory approaches) sometimes justify α increases (for example, see Schubert and Strobach,

Discussions regarding experimental power are not new. Issues related to power have long been discussed in the behavioral sciences, yet they have drawn heightened attention recently (e.g., Button et al., ^{2}

_{A}), estimated from a Monte Carlo simulation (_{0}) is consistent with IQ test score standards (_{A} varies from 100 to 120 by incremental steps of 0.1. The blue line shows the result of the simulation, with estimated smoothed curve (orange line). The red line denotes power = .80.

Power analyses are especially relevant in the context of interventions because sample size is usually limited by the design and its inherent costs—training protocols require participants to come back to the laboratory multiple times for testing and in some cases for the training regimen itself. Yet despite the importance of precisely determining power before an experiment, power analyses include several degrees of freedom that can radically change outcomes and thus recommended sample sizes (Cohen,

Suppose, for example, that we wish to evaluate the effectiveness of an intervention by comparing gain scores in experimental and control groups. Using a two-sample

We should emphasize that we are not implying that every significant finding with low power should be discarded; however, caution is warranted when underpowered studies coincide with unlikely hypotheses, as this combination can lead to high rates of Type I errors (Krzywinski and Altman,

A pernicious consequence of low statistical power is sampling error. Because a sample is an approximation of the population, a point estimate or statistic calculated for a specific sample may differ from the underlying parameter in the population (Figures

_{0} is true), small sample sizes will spuriously produce substantial differences (absolute median effect size, blue line), as illustrated here with another Monte Carlo simulation (

Let us consider a typical scenario in intervention designs. Assume we randomly select a sample of 40 individuals from an underlying population and assign each participant either to the experimental or the control group. We now have 20 participants in each group, which we assume are representative of the whole population. This assumption, however, is rarely met in typical designs (e.g., Campbell and Stanley,

In addition, failure to take into account extraneous variables is not the only problem with sampling. Another common weakness relates to differences in pretest scores. As set by α, random sampling will generate significantly different baseline scores on a given task 5% of the time in the long run, despite drawing from the same underlying population (see Figure

There are different ways to circumvent this problem, and one in particular that has been the focus of attention recently in training interventions is to increase power. As we have mentioned in the previous section, this can be accomplished either by using larger samples, or by studying larger effects, or both (Figure

Lack of power and its related issue sampling error are two limitations of experimental designs that often need substantial investment to be remediated. Conversely, splitting a continuous variable is a deliberate decision at the analysis stage. Although popular in intervention studies, it is rarely—if ever—justified.

Typically, a continuous variable reflecting performance change throughout training is split into a categorical variable, often dichotomous. Because the idea is to identify individuals who do respond to the training regimen, and those who do not benefit as much, this approach is often called “responder analysis”. Most commonly, the dichotomization is achieved via a median split, which refers to the procedure of finding the median score on a continuous variable (e.g., training performance) and split subjects who are below and above this particular score (e.g., low responders vs. high responders).

Median splits are almost always prejudicial (Cohen,

In intervention designs, a detrimental consequence of turning continuous variables into categorical ones and separating low and high performers

In short, regression toward the mean is the tendency for a given observation that is extreme, or far from the mean, to be closer to the mean on a second measurement. When a population is normally distributed, extreme scores are not as likely as average scores, therefore making the probability to observe two extreme scores in a row unlikely. Regression toward the mean is the consequence of imperfect correlations between scores from one session to the next—singling out an extreme score on a specific measure therefore increases the likelihood that it will regress to the mean on another measurement.

This phenomenon might be puzzling because it seems to violate the assumption of independent events. Indeed, regression toward the mean can be mistaken as a deterministic linear change from one measurement to the next, whereas it simply reflects the idea that in a bivariate distribution with the correlation between two variables

This is particularly problematic in training interventions because numerous studies are designed to measure the effectiveness of a treatment after an initial selection based on baseline scores. For example, many studies intend to assess the impact of a cognitive intervention in schools after enrolling the lowest-scoring participants on a pretest measure (e.g., Graham et al.,

Despite the questionable relevance of this practice, countless studies have used median splits on training performance scores in the cognitive training literature (Jaeggi et al.,

The goal in most training interventions is to show that training leads to transfer, that is, gains in tasks that were not part of the training. Decades of research have shown that training on a task results in enhanced performance on this particular task, paving the way for entire programs of research focusing on deliberate practice (e.g., Ericsson et al.,

For the purpose of simplicity, suppose we design a training intervention in which we set out to measure only two dependent variables: the ability directly trained (e.g., working memory capacity, WMC) and the ability we wish to demonstrate transfer to (e.g., intelligence,

To make things worse, analyses of correlation in gains are often combined with median splits to look for different patterns in a group of responders (i.e., individuals who improved on the training task) and in a group of non-responders (i.e., individuals who did not improve on the training task). The underlying rationale is that if training is effective, only those who improved in the training task should show transfer. This approach, however, combines the flaw we presented herein with the ones discussed in the previous section, therefore increasing the chances to reach erroneous conclusions. Limitations of this approach have been examined before and illustrated via simulations (Tidwell et al.,

The remedy to this intuitive but erroneous interpretation of correlated gains lies in alternative statistical techniques. Transfer can be established when the experimental group shows larger gains than controls, demonstrated by a significant interaction on a repeated measures ANOVA (with treatment group as the between-subject factor and session as the within-group factor) or its Bayesian analog. Because this analysis does not correct for group differences at pretest, one should always report ^{3}

A different, perhaps more general problem concerns the validity of improvements typically observed in training studies. How should we interpret gains on a specific task or on a cognitive construct? Most experimental tasks used by psychologists to assess cognitive abilities were designed and intended for comparison between individuals or groups, rather than as a means to quantify individual or group improvements. This point may seem trivial, but it hardly is—the underlying mechanisms tapped by training might be task-specific, rather than domain-general. In other words, one might improve via specific strategies that help perform well on a task or set of tasks, without any guarantee of meaningful transfer. In some cases, even diminishment can be viewed as a form of enhancement (Earp et al.,

Reaching a precise understanding about the nature and meaning of cognitive improvement is a difficult endeavor, but in a field with far-reaching implications for society such as cognitive training, it is worth reflecting upon what training is thought and intended to achieve. Although informed by prior research (e.g., Ellis,

Beyond matters of analysis and interpretation, the choice of specific tasks used to demonstrate transfer is also critical. Any measurement, no matter how accurate, contains error. More than anywhere else perhaps, this is true in the behavioral sciences—human beings differ from one another on multiple factors that contribute to task performance in any ability. One of the keys to reduce error is to increase the number of measurements. This idea might not be straightforward at first—if measurements are imperfect, why would multiplying them, and therefore the error associated with them, give a better estimate of the ability one wants to probe? The reason multiple measurements are superior to single measurements is because inferring scores from combined sources allows extracting out some, if not most, of the error.

This notion is ubiquitous. Teachers rarely give final grades based on one assessment, but rather average intermediate grades to get better, fairer estimates. Politicians do not rely on single polls to decide on a course of action in a campaign—they combine several of them to increase precision. Whenever precision matters most, we also increase the number of measurements before combining them. In tennis, men play to the best of three sets in most competitions, but to the best of five sets in the most prestigious tournaments, the Grand Slams. The idea is to minimize the noise, or random sources of error, and maximize the signal, or the influence of a true ability, tennis skills in this example.

This is not the unreasoned caprice of picky scientists—by increasing the number of measurements, we do get better estimates of latent constructs. Nobody says it more eloquently than Randy Engle in

Because constructs are not directly observable (i.e., latent), we rely on combinations of multiple measurements to provide accurate estimates of cognitive abilities. Measurements can be combined into composite scores, that is, scores that minimize measurement error to better reflect the underlying construct of interest. Because they typically improve both reliability and validity in measurements (Carmines and Zeller,

Different solutions exist to minimize measurement error, besides ensuring experimental conditions were adequate to guarantee valid measurements. One possibility is to use the median score. Although not ideal, this is an improvement over single testing. Another solution is to average all scores and create a unit-weighted composite score (i.e., mean, Figure

Directly in line with this idea, more advanced statistical techniques such as latent curve models (LCM) and latent change score models (LCSM), typically implemented in a SEM framework, can allow finer assessment of training outcomes (for example, see Ghisletta and McArdle,

If including too few dependent variables is problematic, too many can also be prejudicial. At the core of this apparent conundrum lies the multiple comparisons problem, another subtle but pernicious limitation in experimental designs. Following up on one of our previous examples, suppose we are comparing a novel cognitive remediation program targeting learning disorders with traditional feedback learning. Before and after the intervention, participants in the two groups can be compared on measures of reading fluency, reading comprehension, WMC, arithmetic fluency, arithmetic comprehension, processing speed, and a wide array of other cognitive constructs. They can be compared across motivational factors, or in terms of attrition rate. And questionnaires might provide data on extraversion, happiness, quality of life, and so on. For each dependent variable, one could test for differences between the group receiving the traditional intervention and the group enrolled in the new program, with the rationale that differences between groups reflect an inequality of the treatments.

With the multiplication of pairwise comparisons, however, experimenters run the risk of finding differences by chance alone, rather than because of the intervention itself.^{4}

This problem is well known, and procedures have been developed to account for it. One evident answer is to reduce Type I errors by using a more stringent threshold. With α = .01, the percentage of significant differences rising spuriously in our previous scenario drops to 10% (10 tasks), 14% (15 tasks), and 18% (20 tasks). Lowering the significance threshold is exactly what the Bonferroni correction does (Figure

Provided there is a problem, a potential solution is replication. Obviously, this is not always feasible, can turn out to be expensive, and is not entirely foolproof. Other techniques have been developed to answer this challenge, with good results. For example, the recent rise of Monte Carlo methods or their non-parametric equivalent such as bootstrap and jackknife offers interesting alternatives. In intervention that include brain imaging data, these techniques can be used to calculate cluster-size thresholds, a procedure that relies on the assumption that contiguous signal changes are more likely to reflect true neural activity (Forman et al.,

In line with this idea, one approach that has gained popularity over the years is based on the false discovery rate (FDR). FDR correction is intended to control false discoveries by adjusting α only in the tests that result in a discovery (true or false), thus allowing a reduction of Type I errors while leaving more power to detect truly significant differences. The resulting

To determine this probability, we first need to determine how many interventions overall will yield a positive result (i.e., the experimental group will be significantly different from the control group at posttest). In our hypothetical scenario, we would detect, with a power of .80, 800 true positives. These are interventions that were effective (

The FDR is the amount of false positives divided by all the positive results, that is, 36% in this example. More than 1/3 of the positive studies will not reflect a true underlying effect. The positive predictive value (PPV), the probability that a significant effect is genuine, is approximately two thirds in this scenario (64%). This is worth pausing for a moment: more than a third of our positive results, reaching significance with standard frequentist methods, would be misleading. Furthermore, the FDR increases if either power or the percentage of effective training interventions in the population of studies decreases (Figure

Our final stop in this statistical journey is to discuss publication bias, a consequence of research findings being more likely to get published based on the direction of the effects reported or on statistical significance. At the core of the problem lies the overuse of frequentist methods, and particularly H_{0} Significance Testing (NHST), in medicine and the behavioral sciences, with an emphasis on the likelihood of the collected data or more extreme data if the H_{0} is true—in probabilistic notation, _{0}) – rather than the probability of interest, _{0}|d), that is, the probability that the H_{0} is true given the data collected. In intervention studies, one typically wishes to know the probability that an intervention is effective given the evidence, rather than the less informative likelihood of the evidence if the intervention were ineffective (for an in-depth analysis, see Kirk,

Because of the underlying logic of NHST, only rejecting the H_{0} is truly informative—retaining H_{0} does not provide evidence to prove that it is true^{5}_{0} is untrue, but it is equally plausible that the strength of evidence is insufficient to reject H_{0} (i.e., lack of power). What this means in practice is that null findings (findings that do not allow us to confidently reject H_{0}) are difficult to interpret, because they can be due to the absence of an effect or to weak experimental designs. Publication of null findings is therefore rare, a phenomenon that contribute to bias the landscape of scientific evidence—only positive findings get published, leading to the false belief that interventions are effective, whereas a more comprehensive assessment might lead to more nuanced conclusions (e.g., Dickersin,

Single studies are never definitive; rather, researchers rely on meta-analyses pooling together all available studies meeting a set of criteria and of interest to a specific question to get a better estimate of the accumulated evidence. If only studies corroborating the evidence for a particular treatment get published, the resulting literature becomes biased. This is particularly problematic in the field of cognitive training, due to the relative novelty of this line of investigation, which increases the volatility of one’s belief, and because of its potential to inform practices and policies (Bossaer et al.,

Two directions seem particularly promising to circumvent publication bias. First, researchers often try to make an estimate of the size of publication bias when summarizing the evidence for a particular intervention. This process can be facilitated by examining a representation of all the published studies, with a measure of precision plotted as a function of the intervention effect. In the absence of publication bias, it is expected that studies with larger samples, and therefore better precision, will fall around the average effect size observed, whereas studies with smaller sample size, lacking precision, will be more dispersed. This results in a funnel shape within which most observations fall. Deviations from this shape can raise concerns regarding the objectivity of the published evidence, although it should be noted that other explanations might be equally valid (Lau et al.,

Second, ongoing initiatives are intended to facilitate the publication of all findings, irrespective of the outcome, on online repositories. Digital storage has become cheap, allowing platforms to archive data for limited cost. Such repositories already exist in other fields (e.g., arXiv), but have not been developed fully in medicine and in the behavioral sciences. Additional incentives to pre-register studies are another step in that direction—for example, allowing researchers to get preliminary publication approval based on study design and intended analyses, rather than on the direction of the findings. Publishing all results would eradicate publication bias (van Assen et al.,

Based on Monte Carlo simulations, we have demonstrated that several statistical flaws undermine typical findings in cognitive training interventions. This critique echoes others, which have pointed out the limitations of current research practices (e.g., Ioannidis,

Importantly, not all interventions suffer from these flaws. A number of training experiments are excellent, with strong designs and adequate data analyses. Arguably, these studies have emerged in response to prior methodological concerns and through facilitated communication across scientific fields, such as between evidence-based medicine and psychology, stressing further the importance of discussing good research practices. One example that illustrates the benefits of this dialog is the use of active control groups, which is becoming the norm rather than the exception in the field of cognitive training. When feasible, other important components are being integrated within research procedures, such as random allocation to conditions, standardized data collection and double-blind designs. Following current trends in the cognitive training literature, interventions should be evaluated according to their methodological and statistical strengths—more value, or weight, should be given to flawless studies or interventions with fewer methodological problems, whereas less importance should be conferred to studies that suffer several of the flaws we mentioned (Moher et al.,

Related to this idea, most simulations in this article stress the limit of frequentist inference in its NHST implementation. This idea is not new (e.g., Bakan, _{0} cannot be considered as convincing evidence against the effectiveness of cognitive training, despite the prevalence of this line of reasoning in this literature.

As a result, we believe cognitive interventions are particularly suited to alternatives such as Neyman-Pearson Hypothesis Testing (NPHT) and Bayesian inference. These approaches are not free of caveats, yet they provide interesting alternatives to the prevalent framework. Because NPHT allows non-significant results to be interpreted as evidence for the null-hypothesis (Neyman and Pearson,

In closing, we remain optimistic about current directions in evidence-based cognitive interventions—experimental standards have been improved (Shipstead et al.,

DM designed and programmed the simulations, ran the analyses, and wrote the paper. IJK and KEW provided valuable suggestions. All authors approved the final version of the manuscript.

Part of this work was supported by philanthropic donations from the Campus Link Foundation, the Kelliher Trust and Perpetual Guardian (as trustee of the Lady Alport Barker Trust) to DM and KEW.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We are deeply grateful to Michael C. Corballis for providing invaluable suggestions and comments on an earlier version of this manuscript. DM and KEW are supported by philanthropic donations from the Campus Link Foundation, the Kelliher Trust and Perpetual Guardian (as trustee of the Lady Alport Barker Trust).

^{1}Step (2) was implemented in R (R Core Team,

^{2}For a given α and effect size, low power results in low Positive Predictive Value (PPV), that is, a low probability that a significant effect observed in a sample reflects a true effect in the population. The PPV is closely related to the False Discovery Rate (FDR) mentioned in the section on multiple comparisons of this article, such that PPV + FDR = 1.

^{3}More advanced statistical techniques (e.g., latent change score models) can help to refine claims of transfer in situations where multiple outcome variables are present (e.g. McArdle and Prindle,

^{4}Multiple comparisons introduce additional problems in training designs, such as practice effects from one task to another within a given construct (i.e., hierarchical learning, Bavelier et al.,

^{5}David Bakan distinguished between sharp and loose null hypotheses, the former referring to the difference between population means being strictly zero, whereas the latter assumes this difference to be