The Generalizability Crisis Revisited

In the summer of 2015, I remember coming across a New York Times article in a hotel lobby about a study by Brian Nosek and the Open Science Collaboration. Recently published in Science, the researchers tested whether some of the best-known effects in psychology studies could be replicated by independent teams; this included famous results such as the priming effect, in which a subject who read a list of words related to a concept such as old age would start to act as if they were indeed old, by walking slower, spending way too much time in the bathroom, and repeatedly asking where they left their glasses.

Out of the 100 studies they looked at, only 39 were successfully replicated, which according to my calculations is a success rate of approximately 39%, and on the subjective scale falls somewhere between “Uh-oh” and “We done goofed”. Clearly this was not good for science and for psychology in particular, and although social psychologists were a convenient whipping-boy, it became obvious that the problem ran deep throughout other branches of science as well. Neuroimaging in particular lost some of its prestige with the publication of Eklund et al.’s 2016 article on inflated false positives in fMRI studies, casting further doubt on the reliability of imaging research.

During these times, many researchers tried to discern the cause of all these problems. Was it a flawed reward system, in which only statistically significant results are noticed and rewarded, while null results are ignored, or, at best, seen as a sign of incompetence? Or was it a hangover from the small study designs of the 90s and early 2000s, which could be afforded when studying robust effects in the brain, for example, but which became impractical when testing for more subtle effects?

One argument made by Yarkoni (2022) is that construct validity is necessary for the results to mean anything, let alone for them to be generalizable. Construct validity refers to whether a measurement accurately captures what it is supposed to be measuring; finding out how much someone donated to charity, for example, might be a measurement of a psychological construct such as selflessness, but it could also be confounded with the personal interests of the donor - perhaps the donation was given for tax purposes, or maybe the donation was given to a charity run by his brother.

According to Yarkoni, many psychological studies fail this basic test of construct validity, and it becomes even further separated from reality, as well as more costly, when more sophisticated techniques are involved - namely, brain imaging. This is a technology that usually costs anywhere from $500-$1000 per hour for an MRI scan, and a typical fMRI study can cost tens of thousands of dollars. Furthermore, fMRI is several steps removed from the underlying neural activity; the blood-oxygenation level dependent (BOLD) signal that is measured in an fMRI scan is an epiphenomenon of neural activity, requiring several assumptions about blood flow and neurovascular coupling that may not be completely accurate throughout the brain.

On top of this, virtually all statistics used in psychological science rely on what is called Random Effects. This refers to the estimation of the mean and standard deviation of the population that we are drawing from, in order to generalize about the findings from our sample. Then, if the mean of the sample is sufficiently large, and the standard deviation sufficiently small, we calculate a p-value to quantify how likely it is that we would have randomly drawn a sample with those statistics, if there were truly no effect in the population.

The rub is that the validity of the inference depends on the researcher’s assumption. For example, would this generalize to a different age group, testing site, or even slightly different stimuli? To illustrate a generalizable phenomenon, the Stroop task furnishes us with a good example. Originally designed to test whether subjects could override their habit of reading a word and respond to the color of the font instead, the Stroop task has been modified to encompass many different types of incongruency, such as direction, location, and emotion. Across all of these different variations, the underlying Stroop effect appears to hold: No matter what kind of task design or population you study, people tend to have more difficulty and commit more errors when responding to an incongruent stimulus as compared to a congruent one.

Note that for such an effect to be considered generalizable, it takes years or decades of replications across different populations, testing sites, and, ideally, different sets of stimuli. It also helps that the original effect and most of the follow-up studies also showed large effect sizes. On the other hand, there are many other psychological studies that are not as robust, and even though they may generate a statistically significant result in one study, care should be taken about generalizing the effect to the general population. Yarkoni used a study by Alogna et al. (2014) as an example. This was an attempted replication of something called the “verbal overshadowing” effect, in which participants who described the physical appearance of a perpetrator caught on camera were less able to recognize the same perpetrator following a delay as compared to participants to did a control task, such as naming as many states and capitals as they could.

Even though Alogna et al. (2014) found a significant effect similar to the original study, however, Yarkoni maintains that this is not enough to call the effect generalizable. If we take it for granted that subjects should be treated as a random effect in order for our statistics to generalize to the population, then it stands to reason that elements of the experiment such as the stimuli and even the testing location should also be treated as random factors. As it stands, virtually all psychological studies treat variables such as stimuli and testing location as a fixed effect - meaning that the variable of interest (in this case, recognition of the perpetrator) requires a lower statistical threshold to be labeled significant.

Therefore the question is, How much of our experiment should be treated as random factors? Including everyone as a random effect will make achieving statistical significance much more difficult, but on the other hand, any effect that survives such a threshold is most likely generalizable to the general population. However, other factors should also be considered: If many different studies from independent research groups using similar stimuli reach the same conclusion about a psychological phenomenon, that should weigh in favor of the phenomenon being a true effect; i.e., that it is sufficiently general throughout the population to be taken for granted. Extenuating circumstances and special populations should always be considered - nobody would expect the word Stroop effect in someone who is illiterate - but often what we are looking for is an effect that is broad enough to merit consideration when taking into account something like the ergonomics of a new office space, or whether a clinical intervention may be effective.

Regarding the latter, there is evidence that new neuromodulation interventions do work, and they are often based on fMRI studies that have articulated the functional architecture of the brain. A new technique called Stanford Neuromodulation Therapy, for example, is based on targeting the dorsolateral prefrontal cortex (DLPFC) with transcranial magnetic stimulation - and this region was chosen as a target based on several neuroimaging studies showing that the region was hypoactive in people with major depressive disorder. The therapy appears to work, the remission of depressive symptoms tend to last, and it has been approved by the FDA. Similarly, a recent review of neuromodulation studies of addiction have found that some of the strongest effects in clinical studies come from targeting the frontal pole and ventromedial prefrontal cortex - areas that have been shown in previous fMRI studies to be involved in reward sensitivity and craving. Again, these studies have not treated every experimental factor as a random effect, but the converging evidence from multiple studies appears to lead to significant clinical outcomes for different types of patients.

Yarkoni raises important points about many psychological studies being underpowered, poorly designed, and overgeneralized, and the field would benefit from greater rigor and large sample sizes. However, we should also train our judgment about what effects appear to be real, which includes critically examining the study design, whether the methods and data are publicly available and reproducible, and whether independent studies have confirmed the effect. The recent success of neuromodulation as a therapy for different illnesses should also encourage us about the body of neuroimaging literature that has accumulated over the years regarding psychological phenomena from cognitive control to reward processing, as these have provided the foundation for efficacious clinical interventions. The growing normalization of sharing data and code, as well as providing large open-access databases for analysis, will likely continue to yield important insights into how to treat different kinds of mental illness.

Trends in Best Practices for fMRI Research

If you are a newcomer to the field of neuroimaging, you may find bewildering the range of software packages, methods, and concepts in the field; aside from learning some of the basics of fMRI analysis, perhaps, or how to analyze an EEG dataset from start to finish, you may have questions such as:

  • What are other, more experienced researchers doing?

  • What is the best way to organize and analyze my data? Is this BIDS thing for real, or just a fad?

  • Will univariate analyses be around for a while, or will they eventually be replaced by multivariate techniques?

  • Was O.J. guilty?

It is natural to ponder all of these, and more, as you advance along your career as a neuroimaging researcher. Although it’s impossible for anyone to answer all of these with complete certainty, we can make some educated guesses about the direction of the field as a whole, including how results are displayed and reported, what statistical techniques are considered necessary, and what other tools the modern researcher should have in their toolkit. All of this, and more, I discussed in a talk hosted by the University of Connecticut, which you can watch below.

How does Magnetic Resonance Imaging Work?

For anyone who has tried explaining MRI physics to the layman, the expression on his face follows a very particular progression: First the eyes are narrowed and attentive, the brow slightly furrowed, as you speak of water and hydrogen, blood and oxygen, tissue and bone. These are tangible, they are real; the man can feel them on his own body, or he has an easy enough time picturing them. His look of concentration wavers a bit once you talk about spin, and how it’s an intrinsic property of all atoms, how it’s both like and unlike the spin he experienced as a child on the merry-go-round. And he can be forgiven for looking puzzled when you describe spins as either aligning with a magnetic field or aligning in the opposite direction, partly due to the field but also partly due to chance, and that these spins, in and of themselves, are either up or down; they do not pass through some intermediate stage. And that although we have many figures and paintings of spinning electrons, we evidently draw from memory as we do of a distant loved one; as the electron is a very shy lady indeed, and no one has ever taken her picture or seen her in the flesh.

But that shadow of doubt is gone in an instant, his demeanor ready for more, once you begin talking about magnets. Magnets! Everyone has played with them; everyone understands intuitively the nature of the poles, attracting their opposites, repelling their identical twins. Everyone has observed them acting through solid matter: Tables, books, hands; none of these stop the magnet from pulling on filaments, metal, other magnets. An invisible force, whose effects are plain as day. It is only when you begin talking about gyromagnetic ratios and resonance that his mind begins to falter. Yes, the atoms spin at an incredible rate; yes, we can push them periodically just as we would a child in a swing, tilting the atoms on their side. And then this potential energy is released, and the signal is picked up by sensitive recording devices inside the scanner. So far, so good.

But magnetic gradients? K-space? At this point our listener’s inner eye becomes clouded over. There is something about Fourier transforms, and how each point in k-space corresponds to the magnitude of the image - or was that the contrast? In any case, he will attempt to understand it the next day, or the day after that; but it invariably comes to pass that our thinker finds himself frustrated, and, not seeing any purpose to continue wasting his time trying to understand - unless he is a very eager student indeed - he quits.

Nobody would claim that a video would clear up all of his confusion, but it might go a long way toward making MRI physics more accessible. The video above contains an impressive illustration of how MRI machines work, a brief but effective description of MRI physics, and an animation of how images are reconstructed from k-space. I recommend this to any student who has found himself bewildered by the topic, and I hope that it helps everyone appreciate just how complex and wonderful these machines are.

Now if we could just get a picture of that electron.

MRtrix Fixel-Based Analysis

One of the more advanced features of MRtrix is Fixel-Based Analysis (FBA), a technique to measure both the fiber density and the fiber cross-section of a given piece of white matter. The developers of the package invented the term “fixel” to rhyme with “voxel” (kind of), indicating that they both contain values representing a metric of brain activity or brain structure. The typical voxels we think of contain a single number representing contrast - either the contrast between grey and white matter or other tissue types, in the case of a T1-weighted anatomical image, or the contrast between the intensity of the BOLD signal in a T2-weighted functional image.

Fixels, on the other hand, are MRtrix-calculated values that are stored in voxels; they are the smallest unit of resolution for measuring white-matter related metrics, such as fiber density or fiber cross-section. These terms are defined in more detail in the Raffelt et al. 2017 paper, in which fiber density refers to the overall number of fibers compressed into a single voxel. Fiber cross-section, on the other hand, refers to the amount of the voxel that is occupied by the fiber bundle. These differences are illustrated in the following figure, taken from the Raffelt et al. 2017 study:

The goal of Fixel-Based Analysis is to compare groups and determine which fixels show a difference in fiber density, cross-section, or a combination of the two (referred to as Fiber Density & Cross-Section, or FDC). Many patient populations, such as persons with Alzheimer’s or other age-related dementias, have markedly different fiber density and cross-sections in major white-matter pathways, and the technique described above is a way to visualize and quantify these differences.

I have written a tutorial demonstrating how to do this for the BTC Preop dataset, available on OpenNeuro, which includes glioma patients as well as controls. As there are 36 participants in total, I recommend running this analysis on a supercomputing cluster. In fact, you probably won’t be able to run this analysis without a computing cluster, because with 36 datasets, commands such as “population_template” and “fixelcfestats” can take dozens if not hundreds of hours to run. The datasets that are generated are also huge. All of this points to using a powerful supercomputing cluster with plenty of storage in order to run the analyses, and then downloading the final product to visualize on your local computer, or mounting a volume of the computing cluster on your machine.

The supercomputing code for Fixel-Based Analysis, adapted from the code outlined on the MRtrix FBA tutorial page, can be found here. The tutorial may be updated to reflect better supercomputing practices - for example, using an array instead of creating an individual template file and submitting it for each subject - but it should work for most purposes.