# fMRI clustering and false-positive rates

1. aScientific and Statistical Computing Core, National Institute of Mental Health, US Department of Health and Human Services, National Institutes of Health, Bethesda, MD 20892

Recently, Eklund et al. (1) analyzed clustering methods in standard fMRI packages: AFNI (which we maintain), FSL, and SPM. They claim that (i) false-positive rates (FPRs) in traditional approaches are greatly inflated, questioning the validity of “countless published fMRI studies”; (ii) nonparametric methods produce valid, but slightly conservative, FPRs; (iii) a common flawed assumption is that the spatial autocorrelation function (ACF) of fMRI noise is Gaussian-shaped; and (iv) a 15-y-old bug in AFNI’s 3dClustSim significantly contributed to producing “particularly high” FPRs compared with other software. We repeated simulations from ref. 1 [Beijing_Zang data (2), cf. ref. 3) and comment on each point briefly.

## AFNI and 3dClustSim

Fig. 1 AD compares results of the “buggy” and “fixed” 3dClustSim. For each simulation, the typical difference was small: <mml:math><mml:mrow><mml:mi mathvariant="normal">Δ</mml:mi><mml:mtext>FPR</mml:mtext><mml:mo>?</mml:mo><mml:mn>3</mml:mn><mml:mo>?</mml:mo><mml:mn>5</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math>ΔFPR?3?5% at per-voxel <mml:math><mml:mrow><mml:mi>P</mml:mi><mml:mo>=</mml:mo><mml:mn>0.01</mml:mn></mml:mrow></mml:math>P=0.01 and <mml:math><mml:mrow><mml:mo>?</mml:mo><mml:mn>1</mml:mn><mml:mo>?</mml:mo><mml:mn>2</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math>?1?2% for <mml:math><mml:mrow><mml:mi>P</mml:mi><mml:mo>=</mml:mo><mml:mn>0.001</mml:mn></mml:mrow></mml:math>P=0.001. The bug had only a minor impact.

View larger version:
Fig. 1.

FPRs for various software scenarios, with 1,000 two-sample one-sided <mml:math><mml:mi>t</mml:mi></mml:math>t-tests (as in ref. 1; cf. ref. 3 for more details) using 20 subjects’ data in each sample. For “buggy” (A and B) and “fixed” (C and D), cluster-size thresholds were selected using the Gaussian shape model with the FWHM being the median of the 40 individual subjects’ values: “buggy” via 3dClustSim before the bug fix, “fixed” via 3dClustSim after the bug fix. For “mixed ACF” (E and F), the cluster-size threshold was selected using a non-Gaussian ACF model allowing for heavy tails (3). For “nonparam” (G and H), 3dttest++ was used to perform spatial model-free, nonparametric permutation testing (3); paired, two-sided, and tests with covariates gave similar results. Two different per-voxel P-value thresholds are shown. The black line shows the nominal 5% FPR (out of 1,000 trials), and the gray band shows its binomial 95% confidence interval, 3.65–6.35%. As in ref. 1, different smoothing values were tested (4–10 mm), and four test designs were used: B1 = 10-s block; B2 = 30-s block; E1 = regular event-related; E2 = randomized event-related.

Figures 1 and 2 of ref. 1 actually show similar FPRs for AFNI, FSL-OLS, and SPM: Most tests were in a range of <mml:math><mml:mrow><mml:mn>20</mml:mn><mml:mo>?</mml:mo><mml:mn>40</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math>20?40% FPR at <mml:math><mml:mrow><mml:mi>P</mml:mi><mml:mo>=</mml:mo><mml:mn>0.01</mml:mn></mml:mrow></mml:math>P=0.01 and <mml:math><mml:mrow><mml:mn>5</mml:mn><mml:mo>?</mml:mo><mml:mn>15</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math>5?15% FPR at <mml:math><mml:mrow><mml:mi>P</mml:mi><mml:mo>=</mml:mo><mml:mn>0.001</mml:mn></mml:mrow></mml:math>P=0.001 (nor did their famous 70% FPR come from AFNI). The data given in the Results section of ref. 1 simply do not support the statement in the Discussion section that AFNI had “particularly high” FPRs.

## Smoothness

To test the effect of assuming a Gaussian ACF in fMRI noise, an empirical “mixed ACF” allowing for longer tails was computed from residuals (3). All FPRs (Fig. 1 E and F) decreased. Block designs remained <mml:math><mml:mrow><mml:mo>></mml:mo><mml:mn>5</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math>>5%, likely reflecting dependence of the noise’s spatial smoothness on temporal frequency. Heavy tails in spatial smoothness indeed have significant consequences for clustering.

## Nonparametric Approach

A spatial model-free, nonparametric randomization approach was added to AFNI’s group-level GLM program, 3dttest++ (3). All FPRs (Fig. 1 G and H) were within the nominal confidence interval. Although this approach shows promise (as in ref. 1), it may not be feasible to generalize nonparametric permutations to complicated covariate structures and models (e.g., complex ANOVA, analysis of covariance, or linear mixed effects) (4, 5).

## Inflated FPRs

Several cases showed significant FPR inflation across existing fMRI software within the testing framework of ref. 1. However, deviations from nominal FPR were not uniformly large and depended strongly on several factors. Fig. 1 and figure 1 of ref. 1 show quite good cluster results for stricter per-voxel P values (which ref. 6 found to be predominantly used in fMRI analyses) and for event-related stimuli (emphasizing the importance of good experimental design): FPR inflation was often <mml:math><mml:mrow><mml:mo>?</mml:mo><mml:mn>10</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math>?10% (Beijing) or <mml:math><mml:mrow><mml:mo>?</mml:mo><mml:mn>5</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math>?5% (Cambridge), affecting only clusters with marginally significant volume.

We strongly disagree with Eklund et al.’s (1) summary statement: “Alarmingly, the parametric methods can give a very high degree of false positives (up to <mml:math><mml:mrow><mml:mn>70</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math>70%, compared with the nominal <mml:math><mml:mrow><mml:mn>5</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math>5%).” For comparison, their own nonparametric method’s results actually showed up to 40% FPR. When characterizing results, medians or percentile ranges are generally more informative summary statistics than maxima. Looking backward, the typical ranges show much smaller FPR inflation than what had been highlighted, and looking forward they provide useful suggestions for experimental design and analyses (lower voxelwise <mml:math><mml:mi>P</mml:mi></mml:math>P, event-related paradigms, etc.). By concentrating on the highest observed FPRs, the conclusions of Eklund et al. (1) are unnecessarily alarmist.

## Acknowledgments

This work was supported by the National Institute of Mental Health and National Institute of Neurological Disorders and Stroke Intramural Research Programs ZICMH002888 of the NIH, US Department of Health and Human Services. This work used the computational resources of the NIH High-Performance Computing Biowulf cluster (http://www.danielhellerman.com/).

## Footnotes

• ?1To whom correspondence should be addressed. Email: robertcox{at}mail.nih.gov.
• Author contributions: R.W.C. designed research; R.W.C. performed research; R.W.C. contributed new reagents/analytic tools; R.W.C. and R.C.R. analyzed data; and R.W.C., G.C., D.R.G., R.C.R., and P.A.T. wrote the paper.

• The authors declare no conflict of interest.

## References

1. ?
.
2. ?
.
3. ?
.
4. ?
.
5. ?
.
6. ?
.

#### Online Impact

• 1634281249 2018-02-17
• 2115681248 2018-02-17
• 8627591247 2018-02-17
• 1184961246 2018-02-17
• 9203941245 2018-02-17
• 4504061244 2018-02-16
• 5597191243 2018-02-16
• 5234981242 2018-02-16
• 6285841241 2018-02-16
• 3913011240 2018-02-16
• 5129741239 2018-02-16
• 3595841238 2018-02-16
• 3166311237 2018-02-16
• 633831236 2018-02-16
• 4424691235 2018-02-16
• 4865101234 2018-02-16
• 159241233 2018-02-16
• 8626671232 2018-02-16
• 315591231 2018-02-16
• 5822951230 2018-02-16