A Computational Model of Arbitration between Model-Based and Model-Free Learning (Featuring Django Unchained!)

Decision-making has fascinated both neuroscientists and economists for decades; and in particular, what makes this such an intriguing topic isn't when people are making good decisions, but when they are screwing up in major ways. Although making terrible decisions doesn't necessarily bar you from having success - just look at our past six or seven presidents - alleviating terrible decisions can sometimes make your life easier, especially when it comes to avoiding decisions that could be bad for you, such as licking a steak knife.

A recent Neuron paper by Lee, Shimojo, and O'Doherty examined how the brain switches between relying on using habitual actions to make decisions, versus generating a cognitive model of what decisions might be associated with which outcomes, and making a decision based on your prediction about what should be most optimal, similar to making a decision-tree or flowchart outlining all the different possibilities associated with each action. These decision-making strategies are referred to as model-free and model-based decision systems, respectively; and reliance on only one system, especially in a context where that system might be inappropriate, would lead to inefficiencies and sometimes disastrous consequences, such as asking out your girlfriend's sister. O'Doherty, who seems to churn out high-impact journals with the effortlessness of a Pez Dispenser, has been working on these and related problems for a while; and this most recent publication, to me, represents an important step forward in computational modeling and how such decision-making processes are reified in the brain.

Before discussing the paper, let me clarify a couple of important distinctions about the word "errors," particularly since one of the layers of the model discussed in the paper calculates different kinds of error. When computational modelers talk about errors, they can come in multiple forms. The most common description of an error, however, is some sort of discrepancy between what an organism is trying to do, or what an individual is expecting, and what that organism actually does or actually receives. Errors of commission, in particular, have been extensively studied, especially in popular decision-making and reaction-time paradigms such as the Stroop task, which is simply screwing up or making an unintended mistake; but recently other forms of error have been defined, such as reward prediction error, which calculates the discrepancy between what was expected, and what was actually received. The authors contrast this reward prediction error with a related concept called state prediction error, which is the discrepancy between an internal model of the environment and the actual state that someone is in. So, actions that are appropriate or likely to be rewarded in one state, may no longer be valid once the state is detected to have shifted or somehow changed.

While this may sound like so much jargon and namby-pampy scientific argot, state prediction errors and reward prediction errors are actually all around us, if we have eyes to see. To take one example, near the end of Django Unchained, our protagonist, Django, has killed all of Calvin Candie's henchmen in a final climactic shootout in the CandyLand foyer. Stephen, thinking that Django has spent all six revolver rounds in the shootout - including a particularly sadistic dismemberment of Billy Crash - believes that he still has some options left open for dealing with Django, such as continuing to talk trash. However, when Django reveals that he has a second revolver, Stephen's internal model of his environment needs to update to take this new piece of information into account; actions that would have been plausible under the previous state he believed himself to be in are no longer viable.

A reward prediction error, on the other hand, can be observed in the second half of the scene, where Django lights the dynamite to demolish the CandyLand mansion. After walking some distance away from the manse, Django turns around to look at the explosion; clearly, he predicts the house to explode in an enormous fireball, and also predicts it to occur at a certain time. If the dynamite failed to go off, or if it went off far too early or too late, would lead to a prediction error. This distinction between the binary occurrence/non-occurrence of an event, as well as its temporal aspect, has been detailed in a recent computational model of prediction and decision-making behavior by Alexander & Brown (2011), and also illustrates how a movie such as Django Unchained can not only provide wholesome entertainment for the whole family, but also serve as a teaching tool for learning models.

This brings us to the present paper, which attempted to locate where in the brain such an arbitration process is done in order to select a model-based or model-free decision system. A model-free system, as described above, takes the lesser amount of cognitive effort and control, since using habitual or "cached" behaviors to guide decisions is relatively quick and automatic; model-based systems, on the other hand, require more cognitive control and mapping out prospective outcomes associated with each decision, but can be more useful than reflexive behaviors when more reflection is appropriate.

The task required participants to make either a left or right button press, which would make a new icon appear on the screen, and after a few button presses, a coin would appear. However, the coin was only rewarding in certain circumstances; in one condition, or "state," only certain colors of coins would be accepted and turned into rewards, while in the other condition, any type of coin would be rewarding. This was designed to favor either model-free or model-based control in certain situations, and also to compare how an arbitration model would correlate with behavior that either is more flexible under model-based conditions, or more fixed under model-free conditions, using a dynamical threshold to shift behavior from model-based to model-free systems over time. The arbitration model also computes the reliability of the model-based and model-free systems to determine which should be implemented, which is affected by prediction errors on previous trials.

Figure 2 from Lee et al showing how prediction errors are computed and then used to calculate the reliability of either a model-based or model-free system, which in turn affects the probability of implementing either system.

The authors then regressed the computational signals against the FMRI data, in order to see where such computational signals would load onto observed brain activity during trials requiring either more or less model-based or model-free strategies. The reliability signals from the model-free and model-based systems were found to load on the inferior lateral PFC (ilPFC) and right frontopolar cortex (FPC), suggesting that these two cortical regions might be involved in the arbitration process to decide which system to implement, with the more reliable system being weighted more.

Figure 4, ibid, with panel A depicting orthogonal reliability signals for both model-based and model-free systems in bilateral ilPFC. Panel B shows a region of rostral anterior cingulate cortex associated with the difference in reliability between the two systems, and both the ilPFC and right FPC correlated with the highest reliability index for a particular trial for whichever system was implemented during that trial.

Next, a psychophysiological interaction (PPI) analysis was conducted to see whether signals in specific cortical or subcortical regions modulated the activity of model-free or model-based signals, which revealed that when the probability of a model-free state was high, there was a corresponding negative correlation between both the ilPFC and right FPC and regions of the putamen also observed to encode model-free signals; significantly, no effects were found for the reverse condition when the probability of model-based activity was high, suggesting that the arbitrator functions primarily by affecting the model-free system.

In total, these results suggest that reliability signals for different decision systems are modulated by activity in the frontocortical regions, and that signals for the model-based and model-free systems themselves are encoded by several different cortical regions, including the orbital PFC for model-based system activity, and supplementary motor area and dorsolateral PFC for model-free activity. In addition, the ventromedial PFC appears to encode a weighted signal of both model-based and model-free signals, tying together how subcortical and value-computing structures may influence the decision to either implement a model-based or model-free system, incorporating reliability information from frontopolar regions about which system should be used. Which, on the face of it, can be particularly useful when dealing with revolver-wielding, dynamite-planting psychopaths.

Link to paper