Leave One Subject Out Cross Validation - The Video

Due to the extraordinary popularity of the leave-one-subject-out (LOSO) post I wrote a couple of years ago, and seeing as how I've been using it lately and want to remember how to do it, here is a short eight-minute video on how to do it in SPM. While the method itself is straightforward enough to follow - GLMs are estimated for each group of subjects excluding one subject, and then estimates are extracted from the resulting ROIs for just that subject - the major difficulty is batching it, especially if there are many subjects.

Unfortunately I haven't been able to figure this out satisfactorily; the only advice I can give is that once you have a script that can run your second-level analysis, loop over it while leaving out consecutive subjects for each GLM. This will leave you with the same number of second-level GLMs as there are subjects, and each of these can be used to load up contrasts and observe the resulting clusters from that analysis. Then you extract data from your ROIs for that subject which was left out for the GLM and build up a vector of datapoints for each subject from each GLM, and do t-tests on it, put chocolate sauce on it, eat it, whatever you want. Seriously. Don't tell me I'm the only one who's thought of this.

Once you have your second-level GLM for each subject, I recommend using the following set of commands to get that subject's unbiased data (I feel slightly ridiculous just writing that: "unbiased data"; as though the data gives a rip about anything one way or the other, aside from maybe wanting to be left alone, and just hang out with its friends):

1. Load up your contrast, selecting your uncorrected p-value and cluster size;
2. Click on your ROI and highlight the corresponding coordinates in the Results windown;
3. Find out what the path is to the contrasts for each subject for that second-level contrast by typing "SPM.xY.P"; that will be the template you will alter to get the single subject's data - for example, "/data/myStudy/subject_101/con_0001.img" - and then you can save this to a variable, such as "subject_101_contrast";
4. Average that subject's data across the unbiased ROI (there it is again! I can't get away from it) using something like "mean(spm_get_data(subject_101_contrast, xSPM.XYZ), 2)";
5. Save the resulting value to a vector, and update this for each additional subject.



Beta Series Analysis in SPM

The following is a script to run beta series analysis in SPM; which, conceptually, is identical to beta series analysis in AFNI, just made more complicated by the use of Matlab commands, many of which I would otherwise remain totally ignorant of.

Also, if you believe that I did this on my own, you are a scoundrel and a fool. Virtually all of the code is from my colleague Derek Nee, a postdoc who used to work in my lab and who remains my superior in all things, scientific and otherwise. He also once got me with the old "If you say gullible slowly, it sounds like orange" joke.

In any case, all that I did was repackage it into a function to be more accessible by people outside of our lab. You may see a couple snippets of code that are useful outside of just doing beta series analysis; for example, the spm_get_data command, which can be used for parameter extraction from the command line without using marsbar. Also note that the script assumes that each trial for a condition has been labeled separately with the regressor name followed by a "_" and then the occurrence of the trial. So, for example, if I were looking at 29 instances of pressing the left button, and the regressor name was LeftButtonPress, I would have LeftButtonPress_1, LeftButtonPress_2, LeftButtonPress_3...LeftButtonPress_29.

Obviously, this is very tedious to do by hand. The Multiple Conditions option through the SPM GUI for setting up 1st level analyses is indispensable; but to cover that is another tutorial, one that will probably cross over into our discussion of GLMs.


The script:

function DoBetaSeries(rootdir, subjects, spmdir, seedroi, Conds, MapNames)
% function BetaSeries(rootdir, subjects, spmdir, seedroi, Conds, MapNames)
% INPUT:
%  rootdir  - Path to where subjects are stored
%  subjects - List of subjects (can concatenate with brackets)
%  spmdir   - Path to folder containing SPM file
%  seedroi  - Absolute path to file containing ROI in NIFTI format
%  Conds    - List of conditions for beta series analysis
%  MapNames - Output for list of maps, one per Conds cell
%       
%   Example use:
%       BetaSeries('/data/study1/fmri/', [101 102 103],
%       'model/BetaSeriesDir/', '/data/study1/Masks/RightM1.nii',
%       {'TapLeft'}, {'TapLeft_RightM1'})
%
%       conditions to be averaged together should be placed together in a cell
%       separate correlation maps will be made for each cell
%       Conds = {'Incongruent' 'Congruent' {'Error' 'Late'}}; 
%
%       For MapNames, there should be one per Conds cell above
%       e.g., with the Conds above, MapNames = {'Incongruent_Map',
%       'Congruent_Map', 'Error_Late_Map'}
%
%       Once you have z-maps, these can be entered into a 2nd-level
%       1-sample t-test, or contrasts can be taken between z-maps and these
%       contrasts can be taken to a 1-sample t-test as well.
%
% 

if nargin < 5
 disp('Need rootdir, subjects, spmdir, seedroi, Conds, MapNames. See "help BetaSeries" for more information.')
    return
end


%Find XYZ coordinates of ROI
Y = spm_read_vols(spm_vol(seedroi),1);
indx = find(Y>0);
[x,y,z] = ind2sub(size(Y),indx);
XYZ = [x y z]';


%Find each occurrence of a trial for a given condition
%These will be stacked together in the Betas array
for i = 1:length(subjects)
    subj = num2str(subjects(i));
    disp(['Loading SPM for subject ' subj]);
    %Can change the following line of code to CD to the directory
    %containing your SPM file, if your directory structure is different
    cd([rootdir subj filesep spmdir]);
    load SPM;
    
    for cond = 1:length(Conds)
        Betas = [];
        currCond = Conds{cond};
        if ~iscell(currCond)
            currCond = {currCond};
        end
        for j = 1:length(SPM.Vbeta) 
            for k = 1:length(currCond)
                if ~isempty(strfind(SPM.Vbeta(j).descrip,[currCond{k} '_'])) 
                    Betas = strvcat(Betas,SPM.Vbeta(j).fname);
                end
            end
        end
              

    %Extract beta series time course from ROI
    %This will be correlated with every other voxel in the brain
    if ischar(Betas)
        P = spm_vol(Betas);
    end

    est = spm_get_data(P,XYZ);
    est = nanmean(est,2);

        
        
        %----Do beta series correlation between ROI and rest of brain---%

            MapName = MapNames{cond};
            disp(['Performing beta series correlation for ' MapName]);
            
            Vin = spm_vol(Betas);
            nimgo = size(Vin,1);
            nslices = Vin(1).dim(3);

            % create new header files
            Vout_r = Vin(1);   
            Vout_p = Vin(1);
            [pth,nm,xt] = fileparts(deblank(Vin(1).fname));
            Vout_r.fname = fullfile(pth,[MapNames{cond} '_r.img']);
            Vout_p.fname = fullfile(pth,[MapNames{cond} '_p.img']);

            Vout_r.descrip = ['correlation map'];
            Vout_p.descrip = ['p-value map'];

            Vout_r.dt(1) = 16;
            Vout_p.dt(1) = 16;

            Vout_r = spm_create_vol(Vout_r);
            Vout_p = spm_create_vol(Vout_p);

            % Set up large matrix for holding image info
            % Organization is time by voxels
            slices = zeros([Vin(1).dim(1:2) nimgo]);
            stack = zeros([nimgo Vin(1).dim(1)]);
            out_r = zeros(Vin(1).dim);
            out_p = zeros(Vin(1).dim);


            for i = 1:nslices
                B = spm_matrix([0 0 i]);
                %creates plane x time
                for j = 1:nimgo
                    slices(:,:,j) = spm_slice_vol(Vin(j),B,Vin(1).dim(1:2),1);
                end

                for j = 1:Vin(1).dim(2)
                    stack = reshape(slices(:,j,:),[Vin(1).dim(1) nimgo])';
                    [r p] = corr(stack,est);
                    out_r(:,j,i) = r;
                    out_p(:,j,i) = p;

                end

                Vout_r = spm_write_plane(Vout_r,out_r(:,:,i),i);
                Vout_p = spm_write_plane(Vout_p,out_p(:,:,i),i);

            end

                
            %Convert correlation maps to z-scores
            %NOTE: If uneven number of trials in conditions you are
            %comparing, leave out the "./(1/sqrt(n-3)" term in the zdata
            %variable assignment, as this can bias towards conditions which
            %have more trials
            
                disp(['Converting correlation maps for subject ' subj ', condition ' MapNames{cond} ' to z-scores'])
                P = [MapNames{cond} '_r.img'];
                n = size(Betas,1);
                Q = MapNames{cond};
                Vin = spm_vol([MapNames{cond} '_r.img']);

                % create new header files
                Vout = Vin;   

                [pth,nm,xt] = fileparts(deblank(Vin(1).fname));
                Vout.fname = fullfile(pth,[Q '_z.img']);

                Vout.descrip = ['z map'];

                Vout = spm_create_vol(Vout);

                data = spm_read_vols(Vin);
                zdata = atanh(data)./(1/sqrt(n-3));

                spm_write_vol(Vout,zdata);

    end
end



This can either be copied and pasted into a script, or downloaded as a .m file here.




Unbiased FMRI Analysis: Leave One Subject Out

Neuroimaging researchers are incessantly bedeviled by the problem of biased region of interest (ROI) analysis. One is constantly lured by the siren song of significant results and large effect sizes radiating from the stygian depths of a non-independent ROI; and while one can at times point toward their use of independent ROIs from other studies, there is always the lurking suspicion that the researcher already knew where the activation was before the ROI was chosen. I have witnessed men, otherwise Samsons in the field and Solomons in counsel, who have had their heads shorn by the harlot of biased analysis.

The most straightforward and appropriate way to do this, of course, is with a region defined on a priori assumptions about where your quarry might lie, based on theory or based on the results of other studies. This ensures that any results extracted from that region are uninfluenced by the model used to generate the statistical maps, therefore circumventing the issue of "double-dipping", or circular analyses (see Kriegeskorte et al, 2009). Another method is to use anatomical regions based on atlases, which again should be motivated by theory.

However, there is yet another option that I was unaware of until recently: Leaving one subject out (LOSO). According to this procedure, non-independence can be mitigated by constructing a general linear model (GLM) with every subject in the study except for one; statistics such as beta weights, time courses, etc., can then be extracted from the resulting parametric map for the subject that was left out, as this subject is no longer contributing to the signal observed in the given region. This process is then repeated and the appropriate parameter extracted for each subject. It is unlikely that there will be perfect overlap between all of the subjects included in each LOSO analysis, but if the effect is real and robust, then it should survive each of these non-overlapping regions.

One consideration with this procedure is what threshold to use for each LOSO analysis. One approach is to hold the p-value constant, in which case a higher t-threshold is used for each analysis due to a reduction in the degrees of freedom. The other approach is to hold the t-value constant, leading to a slightly increased p-value. Both approaches are defensible, although if there are wide variations in the ROI results with each approach, one may want to reconsider the reliability of their finding.

More details can be found in the paper by Esterman et al (2010); I hope this provides the necessary edification and enlightenment for those benighted souls wading about in the filth of their own squalor.

SPM Design Specification and Estimation

At long last, after several long minutes - perhaps hours - or grueling preprocessing, you are ready to specify your general linear model. The concept is straightforward enough: Specify when a certain condition happened, input how long that condition took to happen (zero duration in the case of an instantaneous event), and what kind of basis function you want to convolve with that condition. Basis functions is a topic all on its own (you'll find out more when you're older!), but for the time being, realize that the canonical hemodynamic response function will suffice for most of your cases; even though sometimes it is a laughably wrong assumption about the shape of your hemodynamic response. But hey, it's the best we've got.

More details on the ins and outs of model specification, along with an example of what might be going through your head as you do this, can be found in the following video.