Permutation methods are a class of statistical tests that, under minimal assumptions, can provide exact control of false positives (i.e., type I error). The central assumption is simply that of exchangeability, that is, swapping data points keeps the data just as likely as the original. With the increasing availability of inexpensive large-scale computational resources and openly shared, large datasets, permutation methods are becoming popular in neuroimaging due to their flexibility and ease of concern about yielding nominal error rates than parametric tests, which rely on assumptions and/or approximations that may be difficult to meet in real data. This becomes even more important in the presence of multiple testing, in that assumptions may not be satisfied for each and every test, and the correlation across tests may be difficult to account for. However, even exchangeability can be violated in the presence of dependence among observations, and it may not always be clear what to permute. The aim of this blog post is to emphasize the relevance of linking the null hypothesis and the dependence structure within the data to what should be shuffled in a permutation test. We provide a few practical examples, and offer some glimpses of the theory along the way.
Example 1: Permutation mechanics
Let’s begin by reviewing the mechanics of a permutation test. Consider a comparison between two groups, for example whether hippocampal volume is different between subjects with Alzheimer’s disease (AD) and demographically matched cognitively normal controls (that is, a group with similar age, sex, education level, etc). If we assume that in both groups the hippocampal volumes are independent samples from a Gaussian distribution, a classical parametric two-sample t-test can be used to test for a difference between means of the two groups. However, this distributional assumption may not be true, and departures from this assumption can potentially lead to incorrect conclusions. In these circumstances, permutation tests perform better than parametric tests by providing a valid statistical test with much weaker assumptions. Specifically, under the null hypothesis that the hippocampal volume has no actual difference between AD cases and controls, the group membership (or the label of case and control) becomes arbitrary, that is, any subject from one group might as well have been from the other.
While it may seem implausible that this would be the case for patients and controls, in fact this is what we are testing: all else being equal (that is, exchangeable), and any difference found must relate to the means, which is what we are interested in. In fact, a classical parametric two-sample test (with equal variance) makes not just the same assumption, but also further assumes that patients and controls come from the same Gaussian distribution. Permutation tests do not require Gaussianity; it suffices that the data are merely exchangeable. Exchangeability further relaxes another important assumption of parametric tests: independence. Data that are not independent may still be exchangeable, either globally or under certain restrictions, as presented in more detail in Example 3 below.
With exchangeability, we compute the t statistic under each permutation, and produce the permutation distribution of the statistic under the null. The permutation distribution is the empirical cumulative distribution function (cdf) obtained from the data themselves, as opposed to from some idealized distribution, as is the case with parametric tests. The observed test statistic can be considered a random sample from the permutation distribution because it is equally likely to have arisen from any case-control re-labeling given the null hypothesis.
The p-value is the probability of finding a test statistic for the group comparison at least as high as the one observed, provided that there is no actual difference (i.e., null hypothesis is true). So, the p-value can be calculated by randomly permuting the group labels many times, each time recalculating the test statistic; at the end of the process, we check how often a larger statistic was observed than the original (before any shuffling had been applied), and divide that by the number of permutations performed. Figure 1 shows an example in which there are three subjects in each group; before any permutation is done, the test statistic is t = +0.7361. After exhaustively computing all 20 possible permutations, we see that 4 of these (including the non-permuted) are higher than or equal to +0.7361. Thus, the p-value is 4/20 = 0.20. If we had decided beforehand that our significance level would be 0.05, we would say that the result of this test is not significant, that is, there is no significant difference in hippocampal volume between AD patients and controls.
Figure 1: Consider the hippocampal volume measured in 6 subjects, three with Alzheimer’s disease, and three cognitively normal controls. The values measured are shown in the boxes (ranging between 3498 and 3588), controls in blue, AD patients in green. The test statistic for a difference Controls > AD is t = +0.7361. If there is no actual difference between the two groups, then the group assignment can be randomly permuted. For each such permutation, a new test statistic is calculated. In this example, four t statistics (shown in red) computed after random permutations of the group assignments, out of the 20 performed, were equal to or larger than the observed, non-permuted statistic. The p-value is therefore 4/20 = 0.20.
Example 2: Permutation with the presence of nuisance
Suppose in Example 1 that there were other variables that could potentially explain some of the variability seen in hippocampal volume. Some of these variables could even be associated with diagnosis itself. For example, it may be the case that, in this particular study, AD patients were older than cognitively normal controls. To account for these nuisance variables, we can formulate the problem as a multiple regression, in which hippocampal volume is the dependent variable, whereas the case-control status, along with other potential nuisance variables, are the independent variables. We would then test whether the regression coefficient corresponding to the case-control label is significantly different than zero. Now it is less clear what should be permuted. If we permute just the group labels, what to do with the other variables in the model? It turns out that various approaches have been considered in the literature.
Systematic evaluations show that, among a host of permutation and regression strategies, the method attributed to Freedman and Lane provides accurate false positive control in the presence of nuisance variables and is robust to extreme outliers in the data. In the Freedman-Lane method, we regress out all nuisance variables from the hippocampal volume measurements to obtain the residuals of this nuisance-only model, and use the permuted residuals as the new dependent variable in the multiple regression, from which we construct the permutation distribution for the test statistic (i.e., the regression coefficient of interest). Intuitively, once the nuisance has been regressed out, what remains should be indistinguishable between AD patients and controls if the null hypothesis is true, and thus, can be permuted.
We note that whichever regression and permutation strategy is adopted, it is crucial that what is permuted is what would render the subjects different were the alternative hypothesis true. It is not relevant to permute aspects of the dataset that would not be affected should the null hypothesis be false, that is, should an effect actually exists. This is important because, when an experiment becomes complex (e.g., with multiple factors, levels, nuisance variables, and/or multiple response variables), it can be easy to permute aspects of the data that are not informative with respect to the null hypothesis. One should not lose sight of what is being tested, and permute the data accordingly.
Example 3: Permutation with the presence of dependence in observations
Data are not always freely exchangeable. It may be the case, for example, that there are repeated measurements from the same subjects among the observations. Or maybe some or all subjects are twins, siblings, or otherwise relatives. Cases such as these restrict the possibilities for permutations, but even so, permutation tests continue to be possible. They proceed in a similar manner as in the examples above, but care needs to be taken when selecting the permutations that are allowed. Exchangeability as defined above — that is, permuting the data keeps them just as likely as originally observed — must be preserved. More technically, it means that the joint distribution of all the data points must remain unchanged under the null. For example, in a twin study, one could permute the subjects within twin pairs, and pairs of twins could be permuted as a whole, but one sibling should never be mixed with the sibling from a different family; see an example in Figure 2. These restrictions, unfortunately, tend to reduce power compared to the analyses in which all subjects are independent and freely exchangeable. However, all other benefits of permutation tests are kept.
Figure 2: Observations that are not independent restrict the possible rearrangements of the data. In this figure, each white circle represent an observation (e.g., a measurement from a subject), the blue (+) or red (−) dots indicate whether the branches that originate at that dot are or are not exchangeable, respectively, and therefore indicate observations that can be permuted with each other. On the left, 10 unrelated subjects who are freely exchangeable. On the right, 18 subjects, some of which were recruited along with their siblings (FS), and/or with their monozygotic (MZ) or dizygotic (DZ) twin. Siblings must be kept together in every rearrangement of the data, which needs be performed in blocks; subjects within a sibship can be permuted; some families may have both twins and non-twins, which requires nested blocks. (Figure licensed under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/)
Consider a longitudinal extension of the AD patients vs. controls example, in which two measurements are obtained from each subject, one before and another after an intervention is applied. As per above, the measurements must stay together within subject. However, depending on what is being tested, we may permute the data only within-subject, or only the subjects as a whole while keeping the order of intra-subject measurements unaltered, or do both things simultaneously. Within-subject effects (that is, the effect of treatment) would require that permutations happen within-subject, whereas between-subject effects would require permutations of the subjects as a whole. Interactions in a mixed design (within and between-subject effects) could benefit from both types of permutation. Crucially, what needs to be permuted is what would be equal should the null hypothesis hold, and that would differ should the alternative hypothesis be actually true.
Example 4: Comparison between models
Now suppose that, in our AD example, in addition to hippocampal volume, we have also measured the amygdala volume for each subject, and are interested in investigating whether hippocampal volume is a better biomarker of AD than amygdala volume (for example, in terms of standardized mean difference between cases and controls as measured by the Cohen’s d statistic). It is tempting to permute the case-control label, but this strategy turns out to be wrong as it completely breaks the associations between the hippocampal/amygdala volume and disease status, which should be retained under the null hypothesis. In fact, in this example, it is unclear what to permute. As a second example, if we want to test whether the mean of hippocampal volume in AD cases is significantly different from a fixed value (e.g., the typical size of hippocampus in normal aging subjects), it can be seen that there is nothing to permute. In these circumstances where a permutation test is difficult to apply, we need to resort to other methods such as the bootstrap for statistical inference.
The bootstrap is an established data-based simulation method, which is often used to assign measures of accuracy, such as standard error, bias, and confidence intervals, to a statistical estimate. It essentially uses the observed data to define an empirical distribution that estimates the unknown underlying data-generation mechanism, and then generates bootstrap samples and bootstrap replications of the statistic of interest using the empirical distribution, from which measures of accuracy can be calculated.
Bootstrap can be applied to virtually any statistic and a wide variety of situations. For example, by sampling cases and controls with replacement independently, we can calculate the standard error or construct confidence intervals for the Cohen’s d statistic for hippocampal and amygdala volume, respectively, as well as for the difference of the two Cohen’s d. Given the strong connection between confidence intervals and hypothesis testing, a p-value can also be produced indicating whether the difference in Cohen’s d is significantly different from zero. In fact, bootstrap can be applied to hypothesis testing, including the questions described in Examples 1-3. However, unlike the permutation p-value, which is exact, the bootstrap significance is only approximate and thus less accurate.
Therefore, permutation is a natural and favorable choice when the null/alternative hypothesis is well defined and what to permute is clear. Bootstrap is useful when the primary goal is to quantify the accuracy of an estimate or when a permutation test is not available in a hypothesis test (e.g., nothing to permute). That said, we also caution that bootstrap relies on an accurate empirical estimation of the true underlying probability distribution. Thus the sampling procedure requires careful consideration in order to respect the data generation mechanism in the presence of complex data structures. For example, block bootstrap is often used to replicate correlations within the data, while variants of the wild bootstrap are used to capture heteroscedasticity in the sample.
Practical advice: It's easy to get started with permutation methods in brain imaging. Most software packages have some sort of permutation test implemented. AFNI's 3dttest++ now uses permutation by default for cluster inference with the -ClustSim option; BrainVoyager has a randomisation plugin (permutation tests are sometimes called randomisation tests); Freesurfer can do permutation with mri_glmfit-sim; FSL has its randomise tool; and SPM has the SnPM toolbox. Finally, PALM is a standalone tool for permutation that works with different types of input data and has various advanced features.
At its best, multi-modal imaging offers rich insight into a many aspects of brain structure & function. At the same time, its development has been thwarted by challenges, for example simultaneous EEG-fMRI has additional safety concerns, and the EEG data requires extra analysis steps to account for artifacts from the magnetic field and rapidly changing field gradients. Despite these issues, there is increasing attention to the merits of this approach, with high profile journals dedicating special issues to multi-modal data fusion.
To find out about the promises and pitfalls of multi-modal imaging, we sent a series of questions to members of the OHBM Multi-Modal Imaging Task Force. This team is comprised of experts in different imaging domains, and aims to promote and develop multi-modal imaging. We found out the state of the field from Alain Dagher, neurologist and PET/fMRI expert in the Montreal Neurological Institute, Urs Ribary, cognitive neuroscientist and EEG/fMRI expert in British Columbia, Gitte Knudsen, neurologist and translational neurobiologist at Copenhagen University, and Shella Keilholz, physicist and fMRI expert at Emory University and Georgia Tech.
OHBM: First, what advice would you give to those who are keen to get into multi-modal imaging?
Alain Dagher (AD): Make sure you have a strong grasp of both methods.
Urs Ribary (UR): First, focus on understanding the neurophysiological and biochemical aspects of the brain; then learn individual methods (MRI-fMRI, MEG/EEG, PET, or others…); finally, learn the additional technologies and techniques that will allow you to integrate these different sources of information.
Gitte Knudsen (GK): You need to train at a site where there is high-level expertise in both modalities, and preferentially integrated. If you cannot readily become attached to an academic site that masters true multimodality, do your master thesis/PhD in a centre where they master one or two of the modalities and then move on to another site with the complementary expertise.
Shella Keilholz (SK): Well I would tell them that if they want to do it, just go for it! It’s a great way to increase the impact of your research, especially when the additional modality allows you to make inferences about causality or fundamental mechanisms that you can’t obtain with a single methodology. Sometimes it seems overwhelmingly difficult to add another modality but we have always been able to find collaborators who generously help us get started.
OHBM: It seems the tools for collecting the data are more readily available (e.g. MRI compatible EEG setups). What is the biggest remaining hurdle in conducting multimodal studies? Is data-fusion between modalities improving?
AD: The increased cost and complexity is generally what holds this back. [Further note from Jean Chen, OHBM blogteam member: “For example, an integrated PET/MRI system is more costly than a regular PET or MRI system. Whilst it may not be as expensive as buying a PET and an MRI system separately, new money is often required to get into multi-modal imaging.”].
GK: The biggest hurdle is, first, to master more than one tool to perfection and second, to ask the right scientific questions that can only be addressed using a multimodality approach. Data-fusion between modalities is a challenge, but slowly improving.
UR: Yes, data fusion is improving, but not so much the underlying knowledge of neurophysiology (why to integrate). There are also clearly issues with money (more recordings are more expensive) and with time (it requires more knowledge and work, and everybody wants to publish quickly). On the other hand, data fusion is not something that has to be done alone, and can be done efficiently in collaborations.
SK: One of the biggest challenges in multimodal research is designing experiments and analyses that maximize the use of the information obtained from both modalities. It requires thinking beyond the conventional paradigms for each of the modalities involved.
OHBM: The increased use of simultaneous PET-MR scanners has clear advantages for cancer imaging. What benefits do you feel it may hold for other areas of neuroimaging?
UR: A clear benefit would be the ability to combine biochemical information with information about brain structure, function and dynamics.
AD: There are many benefits. For example if you take the combination of BOLD and neurotransmitter imaging, since neurotransmitter signalling fluctuates, simultaneous measurement of, for example, dopamine signalling and task-related BOLD has great potential. This then also allows powerful task designs with pharmacological manipulations.
GK: It allows us to measure neurotransmitter release and receptor occupancies and hemodynamic responses simultaneously. We can then use this with pharmacological, physiological or other stimuli. Another great advantage is that it saves time (becoming a one stop shop) for patients with neurological or psychiatric disorders, and so can be useful for those who are not able to tolerate multiple scanning sessions. Unfortunately, despite saving time and possibly resources, the simultaneous acquisition of these different types of information has not yet been truly exploited.
OHBM: The last decades have seen the development of a number of new radioligands for imaging tau and amyloid pathology, microglial activation with translocator protein, phosphodiesterases, and other exciting clinical markers. Are these helping drive multi-modal imaging research? Which emerging PET tracers are you most excited about and why?
AD: For me, the most exciting tracers have been those used to image tau and amyloid, providing otherwise unavailable information about neurodegenerative diseases. Previously we only had brain atrophy as a proxy of disease.
GK: If we’re still talking about hybrid scanners, then we are most interested in developing tracers that target components in the brain that are under rapid regulation. In these cases the methodology can capture these regulations and relate them to, for example, the hemodynamic responses. I’m currently excited about radioligands that are sensitive to neurotransmitter release, as well as emerging PET tracers that are informative of brain processes key to many different types of functions/pathologies. For example, tracers that indicate neuroplasticity or stem cells.
UR: Everything helps! I’ve been impressed with recent research relating imaging of neurotransmitters to cognitive functions in health and disease. In addition, the ability to image GABA as an inhibitory substance has been fascinating to see how it may contribute to, and even control, brain development and dynamic network functions. Last, it’s helped us understand the brain as a fine-tuned electrochemical system which controls all brain functions.
OHBM: Simultaneous EEG-fMRI offers high spatial and temporal precision - but how have labs coped with the challenge of integrating and analysing this wealth of data?
AD: This has been especially problematic for EEG. What we need is good open-source processing software for integrating this information, along with online tutorials and courses to teach people how use them.
UR: I believe that there’s still not enough work in this area. We need to have a much greater understanding of how structure, overall function and brain dynamics integrate in order to understand how typical/atypical brain networks function. Here the question is not so much about using information from different methods to prove each other but instead to complement each other.
OHBM: EEG-fMRI has clear benefits in conditions like epilepsy, for identifying seizure focus and spread. What applications has it had in other conditions - and what do researchers hope to achieve with it?
AD: Cognitive neuroscience can certainly benefit from the combination of higher spatial and temporal resolution in brain mapping.
GK: EEG-fMRI also has promise for use in sleep physiology, sleep disorders and coma.
UR: Any typical cognitive functions and any pathology which are ALL based on structure, function and dynamics....
OHBM: What do you think are the main strengths of multi-modal MRI work? Do you feel it offers hope for developing valid and reliable MR-biomarkers?
UR: Absolutely! Science is not a mystery, the more complementary information we have, the better we understand the human brain. It will help us to diagnose/monitor sub-types of pathologies and give much greater precision when tracking the effects of interventions....
AD: I do believe using multiple MR measures makes sense for biomarker development and understanding pathophysiology. Pathological processes (e.g. in Alzheimer’s Disease) can affect the brain in multiple but likely stereotyped ways. We can also Increase our power to detect pathology (e.g. inflammation, white and grey matter tissue loss, connectivity information) by combining multiple measures.
OHBM: What additional challenges do animal studies have in terms of sequence development or protocol considerations? How do you find these studies enrich those in humans?
AD: Clearly a major issue is the small size of animal brains. We also have to account for the animals typically being anaesthetised when scanned, which has implications for physiology.
GK: Sometimes data from preclinical studies can help optimize a project to be conducted later in humans.
UR: The real benefit of these preclinical studies is that it allows us to perform complementary invasive studies not possible on humans, such as MRI-histology studies. We do however need to continue developing better, or more realistic, settings in animal research in order to better correlate those findings with human brain research.
SK: One of the challenges that we’ve found is that tools that are available on human MRI systems (simultaneous multislice EPI, for example) are not easily implemented on animal systems due to hardware limitations. As Alain says, the other main issue is the use of anesthesia in animals, a special challenge for functional neuroimaging studies, as discussed in our review. Luckily, many of the basic properties of the brain remain relatively intact under light anesthesia, which has been critical of the validation of human neuroimaging methods against “ground truth” modalities like microelectrode recording. People talk of animal research as preclinical or translational, but we like to think of it more as circular. For example, one can take a neuroimaging finding in humans (e.g., fMRI response to tactile stimulation) and look at its basis in the rat using MRI and electrophysiology. Then perhaps one sees that this response is altered in human patients with a particular disorder (maybe stroke). One can then go back to a rat model of stroke and see if the same alteration is present, which helps to validate the stroke model. Then one can look for the neural basis of the alteration using MRI and electrophysiology and identify specific alterations in patients that may be detectable with EEG…etc, etc. We think that human and animal neuroimaging work should inform each other.
OHBM: Thanks all for your insight! We look forward to the multi-modal imaging symposium at OHBM 2018 in Singapore.