Does unconscious racial bias affect trial judges?[1]

Professor J Rachlinski,[2] Professor S L Johnson,[3] A Wistrich,[4] and Professor C Guthrie[5] [6]

Race matters in the criminal justice system. Black defendants appear to fare worse than similarly situated white defendants. Why? Implicit bias is one possibility. Researchers, using a well-known measure called the Implicit Association Test, have found that most white Americans harbor implicit bias toward black Americans. Do judges, who are professionally committed to egalitarian norms, hold these same implicit biases? And if so, do these biases account for racially disparate outcomes in the criminal justice system? We explored these two research questions in a multi-part study involving a large sample of trial judges drawn from around the country. Our results-which are both discouraging and encouraging-raise profound issues for courts and society. We find that judges harbor the same kinds of implicit biases as others; that these biases can influence their judgment; but that given sufficient motivation, judges can compensate for the influence of these biases.

Justice is not blind.

Researchers have found that black defendants fare worse in court than do their white counterparts. In a study of bail-setting in Connecticut, for example, Ian Ayres and Joel Waldfogel found that judges set bail at amounts that were 25% higher for black defendants than for similarly situated white defendants.[7] In an analysis of judicial decision making under the Sentencing Reform Act of 1984, David Mustard found that federal judges imposed sentences on black Americans that were 12% longer than those imposed on comparable white defendants.[8] Finally, research on capital punishment shows that “killers of White victims are more likely to be sentenced to death than are killers of Black victims” and that “Black defendants are more likely than White defendants” to receive the death penalty.[9]

Understanding why racial disparities like these and others persist in the criminal justice system is vital. Only if we understand why black defendants fare less well than similarly situated white defendants can we determine how to address this deeply troubling problem.

Two potential sources of disparate treatment in court are explicit bias and implicit bias.[10] By explicit bias, we mean the kinds of bias that people knowingly — sometimes openly — embrace. Explicit bias exists and undoubtedly accounts for many of the racial disparities in the criminal justice system, but it is unlikely to be the sole culprit. Researchers have found a marked decline in explicit bias over time, even as disparities in outcomes persist.[11]

Implicit bias — by which we mean stereotypical associations so subtle that people who hold them might not even be aware of them — also appears to be an important source of racial disparities in the criminal justice system.[12] Researchers have found that most people, even those who embrace nondiscrimination norms, hold implicit biases that might lead them to treat black Americans in discriminatory ways.[13] If implicit bias is as common among judges as it is among the rest of the population, it might even account for more of the racially disparate outcomes in the criminal justice system than explicit bias.

In this article, we report the results of the first study of implicit racial bias among judges. We set out to explore whether judges hold implicit biases to the same extent the general population and to determine whether those biases correlate with their decision making in court. Our results are both alarming and heartening:

(1) 

judges hold implicit racial biases

(2) 

these biases can influence their judgment

(3) 

judges can, at least in some instances, compensate for their implicit biases.

Our article proceeds as follows. We begin, in “Implicit bias”, by introducing the research on implicit bias and its impact on behavior. In “The study design”, we briefly describe the methods of our study. We provide a much more detailed account in the Appendix. In “The study results”, we report our results and interpret them. Finally, in “Mitigating implicit bias in court”, we explore the implications of our results for the criminal justice system, identifying several possible measures for combating implicit racial bias.

Implicit bias

Psychologists have proposed that implicit biases might be responsible for many of the continuing racial disparities in society.[14] To assess the extent to which implicit biases account for racial disparities, researchers must first ascertain whether people hold implicit biases and then determine the extent to which implicit biases influence their actions.

Demonstrating implicit bias

In their efforts to assess whether people harbor implicit biases, psychologists have used a variety of methods.[15] Standing front and center among these methods, however, is the Implicit Association Test (IAT).[16] Developed by a research group led largely by Tony Greenwald, Mahzarin Banaji, and Brian Nosek, the IAT is the product of decades of research on the study of bias and stereotypes[17] and has attracted enormous scholarly and popular attention.[18] More than 4.5 million people have taken the IAT.[19] The test takes different forms, but most commonly, it consists of a computer-based sorting task in which study participants pair words and faces. A typical administration of the “Race IAT” proceeds as follows.[20]

First, researchers present participants with a computer screen that has the words “white or good” in the upper left-hand corner of the screen and “black or bad” in the upper right. The researchers then inform the participants that one of four types of stimuli will appear in the center of the screen: white people’s faces, black people’s faces, good (positive) words, or bad (negative) words. The researchers then explain that the participants should press a designated key on the left side of the computer when a white face or a good word appears and press a designated key on the right side of the computer when a black face or a bad word appears. Researchers refer to the white/good and black/bad pairings as “stereotype congruent”, because they are consistent with negative stereotypes associated with black Americans.[21] The participants complete several trials of this first task.

Then, the computer is programmed to switch the spatial location of “good” and “bad” so that the words “white or bad” appear in the upper left-hand corner and “black or good” appear in the upper right. The researchers explain to the participants that they are now supposed to press a designated key on the left side of the keyboard when a white face or a bad word appears and press a designated key on the right side of the keyboard when a black face or a good word appears. Researchers refer to these white/bad and black/good pairings as “stereotype-incongruent”, because they are inconsistent with the negative stereotypes associated with black Americans. The participants then complete several trials of this second task.[22]

Researchers have consistently found that white Americans express a strong “white preference” on the IAT.[23] They make this determination by comparing the amount of time it takes respondents to complete the two tasks identified above — that is, their “response latency”.[24] Most white Americans complete the first task (in which they sort white and good from black and bad) more quickly than the second (in which they sort black and good from white and bad).[25] In other words, most white Americans produce higher response latencies when faced with the stereotype-incongruent pairing (white/bad or black/good) than when faced with the stereotype-congruent pairing (white/good or black/bad).

Researchers have observed a different pattern of implicit biases among black Americans. Black Americans do not exhibit the same white preference that whites express, but neither do they show a mirror-image black preference.[26] Rather, black Americans express a much greater variation, with many expressing moderate to strong black preferences that are rarely found in white Americans.[27] But some also express white preferences — sometimes even strong ones.[28] On average, black Americans express a slight white preference, but the average masks wide variation in response.[29] Latinos also express a small white preference. Asian Americans show a white preference that is comparable to but somewhat weaker than that found in white Americans.[30]

The implications of the research using the IAT are a matter of some debate,[31] but the cognitive mechanisms underlying the research are clear enough. The white preference arises from well-established mnemonic links. Whites more closely associate white faces with positive words and black faces with negative words than the opposite. Thus, when they complete the white/good versus black/bad trials, they need only make a judgment about whether the stimulus that appears in the middle of the screen is positive or negative. The incongruent association, in contrast, requires that they first judge whether the stimulus is a word or a face and then decide on which side it belongs. Stereotype-incongruent associations interfere with the sorting task in much the same way that the use of green ink can make the word “blue” hard to read.[32]

The white preference on the IAT is well-documented among white Americans.[33] Researchers have conducted and published hundreds of academic studies, and several million people have participated in IAT research.[34] They have determined that the implicit biases documented through IAT research are not the product of the order in which people undertake the tasks, their handedness, or any other artifact of the experimental method.[35] The prevailing wisdom is that IAT scores reveal implicit or unconscious bias.[36]

Implicit bias and behaviour

Even if implicit bias is as widespread as the IAT studies suggest, it does not necessarily lead to, or explain, racially disparate treatment. Only if researchers can show that implicit bias influences decision makers can we infer that implicit bias is a cause of racial disparities.

Implicit bias, at least as measured by the IAT, appears to correlate with behavior in some settings. In a recent review, Greenwald and his colleagues identified 122 research reports assessing the relationship between IAT scores and observable behaviors;[37] of these, 32 involved “White-Black interracial behavior”.[38] Across these 24 studies, the researchers found a modest correlation of 0.24 between the implicit bias measures and the observed behaviors tested in the studies.[39] This means that implicit bias accounted for roughly 6% of the variation in actual behavior.[40]

Six percent might not sound like much, but a 6% disparity could have an enormous impact on outcomes in the criminal justice system. In a typical year, judges preside over approximately 21 million criminal cases in state courts[41] and 70 000 in federal courts,[42] many of which involve black defendants. Throughout the processing of these cases, judges make many judgments concerning bail, pretrial motions, evidentiary issues, witness credibility, and so forth. Each of these judgments could be influenced by implicit biases, so the cumulative effect on bottom-line statistics like incarceration rates and sentence length is much larger than one might imagine.[43] Furthermore, 6% is only an average. Some judges likely hold extremely strong implicit biases. And some defendants are apt to trigger an unconscious bias to a much greater extent than others.[44] Even this seemingly small effect might harm tens or even hundreds of thousands of black defendants every year.

Researchers have found, however, that people may have the ability to compensate for the effects of implicit bias.[45] If they are internally driven or otherwise motivated to suppress their own biases, people can make judgments free from biases,[46] even implicit ones.[47] In one recent study,[48] for example, a team of researchers administered the IAT to a group of physicians and asked them to diagnose and treat a hypothetical patient — identified to some of the physicians as a white man and to others as a black man — based on a description of symptoms.[49] The researchers found a correlation between IAT scores and treatment; the physicians with higher IAT scores were more likely to offer appropriate treatment to white patients than to black patients diagnosed with the same condition.[50] But among the 67 physicians who reported some awareness of the purpose of the study, those with higher IAT scores were more likely to recommend the treatment to black patients.[51] In other words, the doctors who were aware of the purpose of the study compensated for their implicit biases when the situation made them sensitive to the risk of behaving — or being observed to behave — in a biased way. “This suggests”, argue the authors, “that implicit bias can be recognised and modulated to counteract its effect on treatment decisions.”[52]

Jack Glaser and Eric Knowles found similar results in a study using the so-called “Shooter Task”.[53] In research of this type, subjects participate in a simulation akin to a video game in which they watch a person on screen pull either a gun or an innocent object, like a wallet, out of his pocket.[54] If he pulls a gun, the participants are instructed to “shoot” by pushing a button on a joystick; if he pulls a benign object, they are instructed to refrain from shooting.[55] Researchers have found that most white adults exhibit a “shooter bias” in that they are more likely to shoot a black target — regardless of what object the on-screen target pulls out of his pocket,[56] and that this effect correlates with a white preference on the IAT.[57] Glaser and Knowles found in their study, however, that those rare individuals with a white preference on the IAT and who are highly motivated to control prejudice were able to avoid the shooter bias.[58] In short, “those high in an implicit negative attitude toward prejudice show less influence of implicit stereotypes on automatic discrimination”.[59]

In sum, the research on implicit bias suggests that people exhibit implicit biases, that there is some evidence that implicit bias can influence behavior, and that people can overcome or compensate for implicit biases if properly motivated and if the racial context is made sufficiently salient. Whether and how this research applies to judges and the criminal justice system is an open question and one to which we turn in “The study design”.

The study design

We are aware of only two IAT studies exploring a behavior of direct interest to the criminal justice system. In one study, researchers found that college student subjects harboring a strong implicit bias in favour of whites imposed longer criminal sentences on a Latino defendants than on a white defendants.[60] In another study in Germany, researchers correlated implicit attitudes towards native Germans and Turkish immigrants among German college students with judgments of guilt of a Turkish defendant.[61] The researchers found a high correlation between negative association with Turkish immigrants and judgments of guilt when the materials made “threatening” aspects of the Turkish defendant salient.[62] Though suggestive, these studies, standing alone, do not tell us much about implicit bias in the criminal justice system. Most importantly, they tell us nothing about a central actor in the system: the judge. Do judges hold implicit racial biases? If so, do those biases affect their judgments in court? We sought to answer these two questions in our study.[63]

Judges

We recruited judges to participate in our study at judicial education conferences, as we have in our prior work.[64] The 133 judges who participated in our study came from three different jurisdictions.[65] The judges asked us not to identify their jurisdictions,[66] but we can describe the basic characteristics of each of the three. We recruited 70 judges from a large urban center in the eastern United States.[67] These 70 judges, who are appointed to the bench for renewable terms, constitute roughly three-quarters of the judges who sit in this jurisdiction. We recruited 45 judges from a large urban center in the western United States.[68] These 45 judges, who are appointed to the bench but then stand for election, make up roughly half of the judges in their jurisdiction. We recruited our final group of judges at an optional session at a regional conference. These 18 judges, who sit in various towns and cities throughout the state in which the conference was held, are appointed to the bench but are then required to stand for election.[69]

We did not ask the judges to identify themselves by name, but we did ask them to identify their race, gender, exact title, political affiliation, and years of experience on the bench.[70] Table 1 summarises the demographic information that the judges provided. As Table 1 indicates, our sample of judges, particularly those from the eastern jurisdiction, is fairly diverse, at least in terms of gender and race.

Demographic information of the judges (percentage within group and number)
Demographic parameter Eastern jurisdiction (70) Western jurisdiction (45) Optional conference (18) Overall (133)
Race White 52.9 (37) 80.0 (36) 66.7 (12) 63.9 (85)
Black 42.9 (30) 4.4 (2) 5.6 (1) 24.8 (33)
Latino 4.3 (3) 11.1 (5) 16.7 (3) 8.3 (11)
Asian 0.0 (0) 4.4 (2) 11.1 (2) 3.0 (4)
Gender Male 55.7 (39) 66.7 (30) 50.0 (9) 58.7 (78)
Female 44.3 (31) 33.3 (15) 50.0 (9) 41.4 (55)
Political affiliation Democrat 86.6 (58) 64.4 (29) 64.7 (11) 76.0 (98)
Republican 13.4 (9) 35.6 (16) 35.3 (7) 24.0 (31)
Average years of experience 9.8 10.8 9.3 10.1

Methods and materials

To explore the two questions animating this article — that is, whether judges hold implicit racial biases, and if so, whether those biases produce biased judicial decisions — we designed a multipart study requiring the participating judges to complete computer tasks[71] and then to respond to a paper questionnaire.

We proceeded as follows. We placed in front of each judge a laptop computer and a questionnaire. The computer screen and the front page of the questionnaire introduced the study and asked the judges to await instruction before beginning.[72] Once the judges were fully assembled, we announced, “Today, we shall ask you to participate actively in your own education”.[73]

We asked the judges to complete the computer tasks and to respond to the questionnaire according to the instructions provided. We assured the judges that their responses were anonymous and that we had no way of identifying them individually, but we also made clear that participation was entirely voluntary and that any judge who wanted to exclude her results from the study could do so. (Only one judge chose to do so.) We informed the judges that we would compile their cumulative results and share them with the group at the end of the session.

With these important preliminaries out of the way, we then asked the judges to begin the study. The study included a race IAT;[74] two hypothetical vignettes in which the race of the defendant was not explicitly identified but was subliminally primed; and another hypothetical vignette in which the race of the defendant was made explicit.[75] The final page of the questionnaire asked judges to provide the basic demographic information identified above.[76]

The study results

We present the results in two parts. First, we report the judges’ IAT scores, which demonstrate that judges, like the rest of us, harbor implicit racial biases. Second, we report the results of our judicial decision-making studies, which show that implicit biases can influence judicial decision making but can also be overcome, at least in our experimental setting.[77]

The Implicit Association Test

To measure implicit associations involving race, we gave the judges a computer-based-race IAT comparable to the race IAT given to millions of study participants around the world.[78] We asked the judges to perform two trials of the IAT, as described above. The first required them to pair white faces with positive words and black faces with negative words. In other words, the first trial required them to select stereotype-congruent pairings. The second required them to pair white faces with negative words and black faces with positive words. In other words, the second trial required them to select stereotype-incongruent pairings.[79]

To determine each judge’s implicit bias score, we performed two calculations. First, we subtracted each judge’s average response latency in the stereotype-congruent round from the stereotype-incongruent round to calculate the IAT measure. This measure reflects the most commonly used scoring method for large samples of data collected on the Internet, and hence allows us to compare judges to ordinary adults.[80] Second, we constructed a standardised measure consisting of the average difference in response latencies for each judge divided by the standard deviation of that judge’s response latencies in the target rounds. This measure is less commonly reported, but more stable, and produces higher correlations with other behaviors.[81]

We found a strong white preference among the white judges, as shown in Table 2. Among the 85 white judges, 74 (or 87.1%) showed a white preference on the IAT. Overall, the white judges performed the stereotype-congruent trial (white/good and black/bad) 216 milliseconds faster than the stereotype-incongruent trial (black/good and white/bad). The black judges, by contrast, demonstrated no clear preference overall. Although 14 of 43 (or 44.2%) showed a white preference, the black judges performed the stereotype-congruent trial (white/good and black/ bad) a mere 26 milliseconds faster than the stereotype-incongruent trial (black/good and white/bad). Comparing the mean IAT scores of the white judges with those of the black judges revealed that the white judges expressed a significantly larger white preference.[82]

Results of race IAT by race of judge
Race of judge (sample size) Mean IAT score in milliseconds (and standard deviation)* Percent of judges with lower average latencies on the white/good versus black/bad round
Judges Internet sample
White (85) 216 (201) 158 (224) 87.1
Black (43) 26 (208) 39 (244) 44.2
Note:

Positive numbers indicate lower latencies on the white/good versus black/bad round

Because we used a commonly administered version of the IAT, we are able to compare the results of our study to the results of other studies involving ordinary adults. We found that the black judges produced IAT scores comparable to those observed in the sample of black subjects obtained on the Internet.[83] The white judges, on the other hand, demonstrated a statistically significantly stronger white preference than that observed among a sample of white subjects obtained on the Internet.[84] For two reasons, however, this does not necessarily mean that the white judges harbor more intense white preferences than the general population. First, we did not vary the order in which we presented the materials, and this order effect could have led to artificially higher IAT scores.[85] Second, the judges performed both trials much more slowly than the other adults with whom we are making this comparison, and this, too, could have led to artificially higher IAT scores.[86] We also suspect that the judges were older, on average, than the Internet sample. To the extent that implicit racial bias is less pronounced among younger people, we would expect the judges to exhibit more implicit bias than the Internet sample.

IAT and judicial behaviour

To assess the impact of implicit bias on judicial decision making, we gave the judges three hypothetical cases: the first involving a juvenile shoplifter, the second involving a juvenile robber, and the third involving a battery. We speculated that the judges might respond differently depending upon whether we made the race of the defendant salient, so in the first two cases, we did not identify the race of the defendant explicitly, but we did so implicitly through a subliminal priming technique described below. In the third case, we made race explicit, informing some of the judges that the defendant was “Caucasian” and others that he was “African American”.[87] By comparing the judges’ individual IAT scores with their judgments in these hypothetical cases, we are able to assess whether implicit bias correlates with racially disparate outcomes in court.

Race primed

We asked the judges to decide two hypothetical cases, one involving a juvenile shoplifter and one involving a juvenile armed robber. Before giving the judges the scenarios, though, we asked them to perform a subliminal priming task, following a protocol developed by Sandra Graham and Brian Lowery.[88] The task appeared to be a simple, computer-based, spatial recognition task.[89] To complete the task, the judges were required to focus their attention on the center of the computer screen in front of them. Words appeared in one of the four corners for 153 milliseconds before being masked by a string of random letters.[90] At that speed, words are extremely difficult to process consciously.[91] Each judge saw 60 words. Half of the judges saw words associated with black Americans,[92] and half saw words with no common theme.[93] After the 60th trial, the task stopped.[94] The computer screen then instructed the judges to turn to the written materials.[95]

(a)The Shoplifter case

We first presented the judges with a scenario called the “Shoplifter case”. The judges learned that William, a 13-year-old with no prior criminal record, had been arrested for shoplifting several toys from a large, upscale toy store.[96] The judges read that there is some conflicting evidence on the degree to which William resisted arrest, but there is no dispute over the fact that he had shoplifted.[97]

Following the scenario, we asked the judges three questions about William. First, we asked them what disposition they thought most appropriate. We listed seven options below the question, ranging from a dismissal of the case to a transfer to adult court.[98] Second, we asked judges to predict on a seven-point scale (from “Not at all Likely” to “Very Likely”) whether William would commit a similar crime in the future. And finally, we asked them to predict on an identical seven-point scale the likelihood that William would commit a more serious crime in the future. In short, we asked them one question about sentencing and two questions about recidivism.

The judges’ determinations were not influenced by race. As shown in Table 3, judges primed with the black-associated words did not produce significantly different judgments than the judges primed with the neutral words.[99] Our primary interest, however, was in determining whether the judges’ implicit biases correlated with their judgments. We found that the judges’ scores on the race IAT had a marginally significant influence on how the prime influenced their judgment.[100] Judges who exhibited a white preference on the IAT gave harsher sentences to defendants if they had been primed with black-associated words rather than neutral words, while judges who exhibited a black preference on the IAT gave less harsh sentences to defendants if they had been primed with black-associated words rather than neutral words. We did not find any significant relationship between the judges’ IAT scores and either of the recidivism measures, although the data showed a similar trend.[101]

Average results on juvenile shoplifter (all three questions on a seven-point scale: higher numbers indicate harsher judgments*)
Prime (and n) Q: Disposition Q2: Recidivism — same crime Q3: Recidivism — more serious crime
Black (63) 2.34 2.58 2.23
Neutral (70) 2.40 2.36 1.94
Note:

The seven-point scale for questions 2 and 3 have been transposed from the original for this Table, so that higher numbers consistently meant harsher judgment.

(b)The Robbery case

The second scenario, called the “Robbery case”, described Michael, who was arrested for armed robbery at a gas station convenience store two days shy of his 17th birthday.[102] Michael, who had previously been arrested for a fight in the school lunchroom, threatened the clerk at the convenience store with a gun and made off with $267 in cash. He admitted the crime, claiming that his friends had dared him to do it. After they had read this scenario, we asked the judges the same three questions we asked them about William in the shoplifter case.

Again the judges’ determinations were not influenced by race. As shown in Table 4, the judges primed with black-associated words did not produce significantly different ratings than the judges primed with the neutral words.[103] As noted, however, our primary interest was in the relationship between implicit bias and these judgments. As with the shoplifting case, the judges’ scores on the race IAT had a marginally significant influence on how the prime influenced their judgment in the robbery case.[104] Judges who exhibited a white preference on the IAT gave harsher sentences to defendants if they had been primed with black-associated words rather than neutral words, while judges who exhibited a black preference on the IAT gave less harsh sentences to defendants if they had been primed with black-associated words rather than neutral words. We did not find any significant relationship between the judges’ IAT scores and either of the recidivism measures, although the data showed a similar trend.[105]

To summarise, we found no overall difference between those judges primed with black-associated words and those primed with race-neutral words. This finding contrasts sharply with research conducted by Graham and Lowery, who found that police and parole officers primed with black-associated words were more likely than those primed with neutral words to make harsh judgments of juvenile offenders.[106] The officers who had seen the black-associated words deemed the juveniles more culpable, more likely to recidivate, and more deserving of a harsh punishment.[107]

Average results on juvenile armed robber (all three question on a seven-point scale: higher numbers indicate harsher judgments*)
Prime (and n) Q: Disposition Q2: Recidivism — same crime Q3: Recidivism — more serious crime
Black (63) 4.92 3.54 3.17
Neutral (70) 4.97 3.61 3.48
Note:

The seven-point scale for questions 2 and 3 have been transposed from the original for this Table, so that higher numbers consistently meant harsher judgment.

The overall lack of an effect of the racial prime, however, gives us little reason to conclude that the judges were not affected by their unconscious racial biases. We found in both the shoplifter case and the robbery case that judges who expressed a white preference on the IAT were somewhat more likely to impose harsher penalties when primed with black-associated words than when primed with neutral words, while judges who expressed a black preference on the IAT reacted in an opposite fashion to the priming conditions.

To be sure, we did not find a significant relationship between IAT scores and the judges’ judgments of recidivism. That is, white preferences on the IAT did not lead judges primed with words associated with black Americans to predict higher recidivism rates. The judges made fairly race-neutral assessments of the two defendants’ character. This result suggests that the correlation we found between IAT score and sentence might not be robust. But, of course, a judges’ neutral assessment of character would be a small comfort to a juvenile defendant who received an excessive sentence due to his race.

Race made explicit

The fact that we did not explicitly provide any information about the race of the defendant (although judges obviously might have made assumptions about their race) is important because judges will commonly be aware of the race of the defendant appearing in front of them. To address this concern, we also gave our judges a hypothetical vignette in which we made race explicit. To enable comparison with another study, we used a vignette developed by Samuel Sommers and Phoebe Ellsworth.[108]

We asked the judges to imagine they were presiding over a bench trial in which the prosecution charges Andre Barkley, a high school basketball player, with battering his teammate, Matthew Clinton. There is no question that Barkley injured Clinton, but Barkley claims, somewhat incredibly, that he was only acting in self-defence. We informed some of the judges that the defendant was an African American male and that the victim was a Caucasian male. We informed the rest of the judges that the defendant was Caucasian and that the victim was African American. Following the scenario, we asked all of the judges to render a verdict and to rate their confidence in their judgment on a nine-point scale (from “Very Confident” to “Not at all Confident”).[109]

We found that the white judges were equally willing to convict the defendant whether he was identified as Caucasian or as African American. Among the white judges who read about an African American defendant, 73% (33 out of 45) said they would convict, whereas 80% (35 out of 44) of the white judges who read about a Caucasian defendant said that they would convict.[110] This contrasts sharply with the results obtained by Sommers and Ellsworth, who used only white participants. They found that 90% of the participants in their study who read about an African American defendant said that they would convict as compared to 70% of the participants who read about a Caucasian defendant.[111] On the other hand, we found that black judges were significantly more willing to convict the defendant when he was identified as Caucasian rather than as African American. When the defendant was identified as Caucasian, 92% (24 out of 26) of the black judges voted to convict; when he was identified as African American, however, only 50% (9 out of 18) voted to convict. The difference between the white judges and the black judges is statistically significant.[112] Analysis of the judges’ assessments of their confidence in their verdicts produced similar results.[113]

The focus of this study, however, is on the relationship between implicit bias and judgment. As above, we wanted to assess the effect of the interaction between the judges’ IAT scores and the race of the defendant on the judges’ verdicts. Unlike our results in the first study, however, we did not find even a marginally significant interaction here.[114] Judges who exhibited strong white preferences on the IAT did not judge the white and black defendants differently, and neither did judges who expressed black preferences on the IAT. Analysis of the confidence ratings produced the same result.[115]

Because the white judges and the black judges reacted differently to the problem, we also conducted an analysis to account for these differences. To do this, we assessed the interaction between the race of the defendant and the IAT score, along with the race of the judge.[116] The three-way interaction between race of judge, race of defendant, and IAT score was significant.[117] This result means that the IAT scores of the black judges and the white judges had different effects on the judges’ reactions to the race of the defendant, as we explain below in further analyses. Analysis of the confidence ratings produced similar results.[118]

To allow us to interpret this interaction, we ran the less complex analysis separately for black and white judges. That is, we assessed the interaction between the IAT score and race of the defendant in two separate analyses. With respect to the white judges, we found no significant results; if anything, the white judges with a greater white preference expressed a greater propensity to convict the Caucasian defendant rather than the African American defendant.[119] Among black judges, however, those who expressed a stronger black preference on the IAT were less likely to convict the African American defendant relative to the Caucasian defendant.[120] An analysis of confidence ratings produced similar results.[121]

The findings among black judges can best be seen by dividing the black judges into two groups: those who expressed a black preference on the IAT and those who expressed a white preference on the IAT. Among those black judges who expressed a black preference, 100% (14 out of 14) voted to convict the Caucasian defendant, while only 40% (4 out of 10) of these judges voted to convict the African American defendant. Among those black judges who expressed a white preference, 83% (10 out of 12) voted to convict the Caucasian defendant, while 63% (5 out of 8) voted to convict the African American defendant. In effect, the black judges who expressed white preferences made verdict choices similar to those of their white colleagues, while black judges who expressed a black preference treated the African American defendant more leniently.

In sum, then, IAT scores predicted nothing among the white judges. Among the black judges, however, a black preference on the IAT was associated with a willingness to acquit the black defendant.

Interpretation of results

Our research supports three conclusions. First, judges, like the rest of us, carry implicit biases concerning race. Second, these implicit biases can affect judges’ judgment, at least in contexts where judges are unaware of a need to monitor their decisions for racial bias. Third, and conversely, when judges are aware of a need to monitor their own responses for the influence of implicit racial biases, and are motivated to suppress that bias, they appear able to do so.

Our first conclusion was perhaps the most predictable, though it is still troubling. Given the large number of Americans who have taken the IAT, and given the frequency with which white Americans display at least a moderate automatic preference for white over black, it would have been surprising if white judges had failed to exhibit the same automatic preference. Similarly, the black judges carry a more diverse array of implicit biases, just like black adults generally: some exhibit a white preference just like the white judges; others exhibit no preference; and some exhibit a black preference. Overall, like adults, most of the judges — white and black — showed a moderate-to-large degree of implicit bias in one direction or the other. If ordinary adults carry a “bigot in the brain”, as one recent article put it,[122] then our data suggest that an invidious homunculus might reside in the heads of most judges in the United States, with the potential to produce racially biased distortions in the administration of justice.

It is worth noting, however, that the research on so-called “chronic egalitarians” suggests that this result was not inevitable. Some whites with longstanding and intense personal commitments to eradicating bias in themselves — chronic egalitarians — do not exhibit the preference for whites over blacks on the IAT that most white adults show.[123] Despite their professional commitment to the equal application of the law, judges do not appear to have the same habits of mind as the chronic egalitarians. The proportion of white judges in our study who revealed automatic associations of white with good and black with bad was, if anything, slightly higher than the proportion found in the online surveys of white Americans. Thus, a professional commitment to equality, unlike a personal commitment to the same ideal, appears to have limited impact on automatic racial associations, at least among the judges in our study. Alternatively, the over-representation of black Americans among the criminal defendants who appear in front of judges might produce invidious associations that overwhelm their professional commitment. In either case, our findings are consistent with the implicit associations found among capital defence attorneys. White capital defence attorneys, another group which might be expected to have strong professional commitments to the norm of racial equality,[124] exhibit the same automatic preference for whites as the general population.[125]

Taken together, then, the research on judges and capital defence attorneys raises serious concerns about the role that unconscious bias might play in the criminal justice system. Jurors are drawn from randomly selected adults, and a majority of white jurors will harbor implicit white preferences. If police, prosecutors, jurors, judges, and defence attorneys all harbor anti-black preferences, then the system would appear to have limited safeguards to protect black defendants from bias. Based on IAT scores alone, both black judges and black jurors seem to be less biased than either white judges or white jurors, because black Americans show less implicit bias than white Americans. But even considerable numbers of blacks express implicit biases. Perhaps the only entity in the system that might avoid the influence of the bigot in the brain is a diversely composed jury.

That said, the rest of our results call into question the importance of IAT scores alone as a metric to evaluate the potential bias of decision makers in the legal system. Our second and third conclusions show that implicit biases can translate into biased decision making under certain circumstances, but that they do not do so consistently.

Implicit associations influenced judges — both black judges and white judges — when we manipulated the race of the defendant by subliminal methods. Judges with strong white preferences on the IAT made somewhat harsher judgments of the juvenile defendants after being exposed to the black subliminal prime, and judges with strong black preferences on the IAT were somewhat more lenient after exposure to the black subliminal prime. In effect, the subliminal processes triggered unconscious bias, and in just the way that might be expected.

The story for the explicit manipulation of race is more complicated, however. The white judges, unlike the white adults in the Sommers and Ellsworth study,[126] treated African American and Caucasian defendants comparably. But the proper interpretation of this finding is unclear. We observed a trend among the white judges in that the higher their white preference, the more favorably they treated the African American defendant in the battery case. Thus, among the white judges, implicit bias did not translate into racial disparities when the race of the defendant was clearly identified in an experimental setting.

We believe that the data demonstrate that the white judges were attempting to compensate for unconscious racial biases in their decision making. These judges were, we believe, highly motivated to avoid making biased judgments, at least in our study. Codes of judicial conduct demand that judges make unbiased decisions, at least in our study.[127] Moreover, impartiality is a prominent element in almost every widely accepted definition of the judicial role.[128] Judges take these norms seriously. When the materials identified the race of the defendant in a prominent way, the white judges probably engaged in cognitive correction to avoid the appearance of bias.

The white judges in our study behaved much like the subjects in other studies who were highly motivated to avoid bias in performing an assigned task.[129] What made our white judges different from the subjects studied by these other researchers is that most of the judges reported that they suspected racial bias was being studied, despite the fact that the only cue they received was the explicit mention of the defendant’s race.[130] We think this report was truthful, given that the judges behaved the same way as other white subjects who attempted to avoid the influence of implicit bias.

The black judges responded somewhat differently to the overt labeling of the defendant’s race. Like the white judges, the black judges in our study also reported being aware of the subject of the study, yet they showed a correlation between implicit associations and judgment when race was explicitly manipulated. Among these judges, a greater white preference produced a greater propensity to convict the African American defendant. In other words, the black judges clearly reacted differently when they were conscious that race was being manipulated — a difference that correlated with their score on the race IAT.

We do not conclude, however, that black judges are less concerned about avoiding biased decision making than white judges. We have no doubt that the professional norms against bias concern the black judges just as deeply as their white counterparts — if not more so. And we are mindful that research on the effect of race on judges’ decisions in actual cases demonstrates no clear effects.[131] We believe that both white and black judges were motivated to avoid showing racial bias.

Why then did the black judges produce different results? We can only speculate, but we suspect that both groups of judges were keen to avoid appearing to favor the white defendant (or conversely, wanted to avoid appearing to disfavor the black defendant). Black judges, however, might have been less concerned with appearing to favor the black defendant than the white judges. Those black judges who expressed a white preference, however, behaved more like their white counterparts in this regard, thereby producing a correlation between verdict and IAT score among black judges.

We also cannot ignore the possibility that the judges were reacting to the race of the victim, rather than (or in addition to) the race of the defendant. In all cases, we identified the victim as the opposite race as the defendant. Furthermore, black judges might have reacted differently to the fact that the case involved a cross-racial crime.

Given our results, we cannot definitively ascribe continuing racial disparities in the criminal justice system to unconscious bias. We nevertheless can draw some firm conclusions. First, implicit biases are widespread among judges. Second, these biases can influence their judgment. Finally, judges seem to be aware of the potential for bias in themselves and possess the cognitive skills necessary to avoid its influence. When they are motivated to avoid the appearance of bias, and face clear cues that risk a charge of bias, they can compensate for implicit bias.

Whether the judges engage their abilities to avoid bias on a continual basis in their own courtrooms, however, is unclear. Judges are subject to the same significant professional norms to avoid prejudice in their courtrooms that they carried with them into our study. And judges might well point to our study as evidence that they avoid bias in their own courtrooms, where the race of defendants is often reasonably clear, and they never face subliminal cues. But courtrooms can be busy places that do not afford judges the time necessary to engage the corrective cognitive mechanisms that they seem to possess. And even though many decisions are made on papers only, judges might unwittingly react to names or neighborhoods that are associated with certain races. Control of implicit bias requires active, conscious control.[132] Judges who, due to time pressure or other distractions, do not actively engage in an effort to control the “bigot in the brain” are apt to behave just as the judges in our study in which we subliminally primed with race-related words. Moreover, our data does not permit us to determine whether a desire to control bias or avoid the appearance of bias motivates judges in their courtrooms the way it seemed to in our study.

Furthermore, judges might be over-confident about their abilities to control their own biases. In recently collected data, we asked a group of judges attending an educational conference to rate their ability to “avoid racial prejudice in decision making” relative to other judges who were attending the same conference. Ninety-seven percent (35 out of 36) of the judges placed themselves in the top half and 50% (18 out of 36) placed themselves in the top quartile, even though by definition, only 50% can be above the median, and only 25% can be in the top quartile.[133] We worry that this result means that judges are over-confident about their ability to avoid the influence of race and hence fail to engage in corrective processes on all occasions.

To be sure, this is only one study, and it has its limitations. The results might be the product of the particular judges who participated in our study, or the materials we used, or even the fact that hypothetical scenarios were used. Most importantly, we cannot determine whether the mental processes of judges on the bench more closely resemble those of judges subliminally primed with race or those for whom race was explicitly manipulated. Thus, it is not clear how implicit racial bias influences judicial decision making in court, but our study suggests, at a minimum, that there is a sizeable risk of such influence, so we turn in “Mitigating implicit bias in court” to reforms the criminal justice system might consider implementing.

Mitigating implicit bias in court

To minimise the risk that unconscious or implicit bias will lead to biased decisions in court, the criminal justice system could take several steps. These include exposing judges to stereotype-incongruent models, providing testing and training, auditing judicial decisions, and altering courtroom practices. Taking these steps would both facilitate the reduction of unconscious biases and encourage judges to use their abilities to compensate for those biases.

Exposure to stereotype-incongruent models

Several scholars have suggested that society might try to reduce the presence of unconscious biases by exposing decision makers to stereotype-incongruent models.[134] This suggestion, in fact, probably represents the dominant policy proposal among legal scholars who write about unconscious bias.[135] We certainly agree, for example, that posting a portrait of President Obama alongside the parade of mostly white male judges in many courtrooms would be an inexpensive, laudable intervention.

Our results, however, also raise questions about the effectiveness of this proposal. The white judges from the eastern jurisdiction in our study showed a strong set of implicit biases, even though the jurisdiction consists of roughly half white judges and half black judges. Indeed, the level of implicit bias in this group of judges was only slightly smaller than that of the western jurisdiction, which included only two black judges (along with 36 white, five Latino, and two Asian judges). Exposure to a group of esteemed black colleagues apparently is not enough to counteract the societal influences that lead to implicit biases.

Consciously attempting to change implicit associations might be too difficult for judges. Most judges have little control over their dockets, which tend to include an over-representation of black criminal defendants.[136] Frequent exposure to black criminal defendants is apt to perpetuate negative associations with black Americans. This exposure perhaps explains why capital defence attorneys harbor negative associations with blacks,[137] and might explain why we found slightly greater negative associations among the white judges than the population as a whole (although as we noted above, the latter finding might have other causes).

Testing and training

The criminal justice system might test candidates for judicial office using the IAT or other devices to determine whether they possess implicit biases. We do not suggest that people who display strong white preferences on the IAT should be barred from serving as judges, nor do we even support using the IAT as a measure of qualification to serve on the bench.[138] The direct link between IAT score and decision making is far too tenuous for such a radical recommendation. And our data shows that judges can overcome these implicit biases at least to some extent and under some circumstances. Rather, knowing a judge’s IAT score might serve two other purposes. First, it might help newly elected or appointed judges understand the extent to which they have implicit biases and alert them to the need to correct for those biases on the job.[139] Second, it might enable the system to provide targeted training about bias to new judges.[140]

Judicial training should not end with new judges, however. Training for sitting judges is also important. Judicial education is common these days, but one problem with it, at least as it exists at this time, is that it is seldom accompanied by any testing of the individual judge’s susceptibility to implicit bias, or any analysis of the judge’s own decisions, so the judges are less likely to appreciate and internalise the risks of implicit bias.[141] As Timothy Wilson and his colleagues have observed, “people’s default response is to assume that their judgments are uncontaminated”.[142] Surely this is true of judges as well. Moreover, because people are prone to egocentric bias, they readily assume that they are better than average, or the factors that might induce others to make poor or biased decisions would not affect their own decisions. Our research demonstrates that judges are inclined to make the same sorts of favorable assumptions about their own abilities that non-judges do.[143] Therefore, while education regarding implicit bias as a general matter might be useful, specific training revealing the vulnerabilities of the judges being trained would be more useful.[144]

Another problem with training is that although insight into the direction of a bias frequently can be gained, insight into the magnitude of that bias cannot. One group of psychologists provided the following example:[145]

Consider Ms Green, a partner in a prestigious law firm, who is interviewing candidates for the position of an associate in her firm. When she interviews Mr Jones, a young African-American attorney, she has an immediate negative impression, finding him to be arrogant and lacking the kind of brilliance she looks for in new associates. Ms Green decides that her impression of Mr Jones was accurate and at a meeting of the partners, argues against hiring him. She wonders, however, whether her negative evaluation was influenced by Mr Jones’ race.

The psychologists explained:[146]

Ms Green may know that her impression of Mr Jones is unfairly negative and want to avoid this bias, but have no idea of the extent of the bias. Should she change her evaluation from “Should not be hired” to “Barely acceptable” or to “Best applicant I’ve seen in years”?

This scenario illustrates the problem well. How is one to know if correction is warranted, and if so, how much?[147] In a circumstance like the one depicted above or like any of the circumstances described in the materials included in our study, there is a risk of insufficient correction, unnecessary correction, or even over-correction, resulting in a decision that is distorted as a result of the adjustment, but simply in the opposite direction.[148] Testing might mitigate this problem by helping judges appreciate how much compensation or correction is needed.

The results of our study are thus somewhat surprising in that the white judges’ corrections in the case in which the defendant’s race was explicit seemed to be neither too much nor too little. On average, these judges treated white and black defendants about the same. This result cannot, however, reasonably be taken as meaning that judges correct for the influence of implicit bias perfectly in all cases in which they attempt to do so. We presented only one scenario — other cases might produce over-compensation or under-compensation. And individual judges are apt to vary in terms of their willingness or ability to correct for the influence of unconscious racial bias. Also, the white judges were slightly less harsh on the black defendants. The difference simply failed to rise to the level of statistical significance, as it was small (only six percentage points). Had we collected data on a thousand judges rather than a hundred, we might have begun to observe some over-compensation or under-compensation.

Auditing

The criminal justice system could also implement an auditing program to evaluate the decisions of individual judges in order to determine whether they appear to be influenced by implicit bias. For example, judges’ discretionary determinations, such as bail-setting, sentencing, or child-custody allocation, could be audited periodically to determine whether they exhibit patterns indicative of implicit bias. Such proposals have been suggested as correctives for umpires in Major League Baseball and referees in the National Basketball Association after both groups displayed evidence of racial bias in their judgments.[149]

Auditing could provide a couple of benefits. First, it would obviously increase the available data regarding the extent to which bias affects judicial decision making. Second, it could enhance the accountability of judicial decision making.[150] Unfortunately, judges operate in an institutional context that provides little accountability, at least in the sense that they receive little prompt and useful feedback.[151] Existing forms of accountability, such as appellate review or retention elections, primarily focus on a judge’s performance in a particular case, not on the systematic study of long-term patterns within a judge’s performance that might reveal implicit bias.[152]

Altering courtroom practices

In addition to providing training or implementing auditing programs, the criminal justice system could also alter practices in the courtroom to minimise the untoward impact of unconscious bias. For example, the system could expand the use of three judge courts.[153] Research reveals that improving the diversity of appellate court panels can affect outcomes. One study found that “adding a female judge to the panel more than doubled the probability that a male judge ruled for the plaintiff in sexual harassment cases ... and nearly tripled this probability in sex discrimination cases.”[154] In trial courts, judges typically decide such issues alone, so adopting this mechanism would require major structural changes. Although convening a three-judge trial court was once required by statute when the constitutionality of a state’s statute was at issue,[155] three-judge trial courts are virtually nonexistent today.[156] The inefficiency of having three judges decide cases that one judge might be able to decide nearly as well led to their demise, and this measure might simply be too costly to resurrect.

Another possibility would be to increase the depth of appellate scrutiny, such as by employing de novo review rather than clear error review, in cases in which particular trial court findings of fact might be tainted by implicit bias. For example, there is some evidence that male judges may be less hospitable to sex discrimination claims than they ought to be.[157] If that bias does exist, less deferential appellate review by a diverse panel might offer a partial solution.

Conclusion

Our study contains both bad news and good news about implicit biases among judges. As expected, we found that judges, like the rest of us, possess implicit biases. We also found that these biases have the potential to influence judgments in criminal cases, at least in those circumstances where judges are not guarding against them. On the other hand, we found that the judges managed, for the most part, to avoid the influence of unconscious biases when they were told of the defendant’s race.

The presence of implicit racial bias among judges — even if its impact on actual cases is uncertain — should sound a cautionary note for those involved in the criminal justice system. To prevent implicit biases from influencing actual cases, we have identified several reforms that the criminal justice system could implement, ranging from relatively inexpensive measures, like implementing focused judicial training and testing, to relatively expensive measures, like altering courtroom practices. To render justice blind, as it is supposed to be, these reforms are worth considering.

Appendix A: Materials

Shoplifter Case

You are presiding over a case involving criminal charges against a juvenile, William T William is a 13-year-old who was arrested for shoplifting in a large, upscale toy store in __ . He has no prior record. You are trying to get a sense of the case and the only facts available to you follow:

According to a store clerk, on Saturday, April 2, at about two o’clock in the afternoon, the clerk observed William putting video games under his shirt. The clerk rang for a security guard, but before the guard arrived, the boy started to leave the store. When the clerk grabbed William, the boy dropped the toys and kicked him in an attempt to escape. A uniformed security guard arrived as the clerk let go of William, and when the guard told the boy to stop, he did.

According to the security guard, when he arrived he observed five items on the floor in front of William. The prices of those items together added up to $90. He said that William told him that he was shopping, and showed him $10 he had brought along with which to make purchases. William claimed that he had used his shirt as a sort of pouch to hold the items he was looking at. William also told the guard he was startled when grabbed by someone from behind, and then tripped, but that he did not kick anyone.

1. 

In your opinion, without regard to the options actually available in this kind of situation, what would be the most appropriate disposition of this case?

  • Dismiss it with an oral warning

  • Adjourn the case in contemplation of dismissal (assuming William gets in no further trouble)

  • Put William on probation for six months or less

  • Put William on probation for more than six months

  • Commit William to a juvenile detention facility for six months or less

  • Commit William to a juvenile detention facility for more than six months

  • Transfer William to adult court.

2. 

In your opinion, on a scale of one to seven, how likely is it that William will later commit a crime similar to the one with which he is charged?

very likely           not at all likely
1 2 3 4 5 6 7
3. 

In your opinion, on a scale of one to seven, how likely is it that William will commit more serious crimes in the future?

very likely           not at all likely
1 2 3 4 5 6 7

Robbery case

You are presiding over a case involving criminal charges against a juvenile, Michael S, who was arrested for armed robbery of a gas station when he was two days shy of his 17th birthday. He has one prior arrest for a fight in the school lunchroom the previous year. You are trying to get a sense of the case and the only facts available to you follow:

According to the gas station clerk, on Friday, March 17, at about seven in the evening, she heard a male voice say, “Don’t look at me, but give me the money”. She kept her eyes down, and as she opened the cash register, the man said, “I could shoot you, don’t think I won’t.” She handed him the drawer’s contents ($267.60) and saw him run out the door with a gun. After he jumped into the passenger side of a car and it left, she called the police.

According to the responding officer, the clerk could not identify the robber, but a customer said he thought he recognised Michael, and gave the officer Michael’s name and address. Michael’s mother was home, and at 9:45 am, Michael walked in the door, was given Miranda warnings, and waived his rights. He first stated that he had just been hanging around with friends, not doing anything special. After the officer asked who the friends were, Michael admitted that he had walked into the gas station with a gun. He told the officer that he said to the clerk, “Give me the money, please. I don’t want to hurt you.” Michael insisted that the gun was not loaded and that he no longer had it. He said that the money was gone, that he was sorry, and would pay it back. When asked why he did it, Michael said that his friends had dared him, but he would not reveal who those friends were, or to whom the gun belonged.

1. 

In your opinion, without regard to the options actually available in this kind of situation, what would be the most appropriate disposition of this case?

  • Dismiss it with an oral warning

  • Adjourn the case in contemplation of dismissal (assuming Michael gets in no further trouble)

  • Put Michael on probation for six months or less

  • Put Michael on probation for more than six months

  • Commit Michael to a juvenile detention facility for six months or less

  • Commit Michael to a juvenile detention facility for more than six months

  • Transfer Michael to adult court

2. 

In your opinion, on a scale of one to seven, how likely is it that Michael will later commit a crime similar to the one with which he is charged?

very likely           not at all likely
1 2 3 4 5 6 7
3. 

In your opinion, on a scale of one to seven, how likely is it that Michael will commit more serious crimes in the future?

very likely           not at all likely
1 2 3 4 5 6 7

Battery Case

Defendant: Andre Barkley, 6'0", 175 lbs, African American male, 18-years-old, student

Alleged Victim: Matthew Clinton, 6'2", 185 lbs, Caucasian male, 16-years-old, student

Charge One count of battery with serious bodily injury

Prosecution

The prosecution claims that Andre Barkley is guilty of battery with serious bodily injury. Barkley was the starting point guard on the high school basketball team, but the team had been struggling, and the coach decided to bench him in favor of a younger, less experienced player named Matthew Clinton. Before the first game after the lineup change, Barkley approached Clinton in the locker room and began yelling at him. Witnesses explain that the frustrated defendant told Clinton, “You aren’t half the player I am, you must be kissing Coach’s ass pretty hard to be starting.”

When other teammates stepped between the two players, Barkley told them to get out of the way. When two other players then grabbed Barkley and tried to restrain him, the defendant threw them off, pushed Clinton into a row of lockers, and ran out of the room, according to prosecution witnesses. As a result of this fall, two of Clinton’s teeth were chipped and he was knocked unconscious. The prosecution claims that Barkley has shown no remorse for his crime, and has even expressed to friends that Clinton “only got what he had coming”.

Defence

The defence claims that Barkley was merely acting in self-defence, and that Clinton’s injuries were accidental. According to an assistant coach, Barkley did not get along with many people on the team and had been the subject of obscene remarks and unfair criticism from many of his teammates throughout the season. Barkley claims that he was afraid for his own safety during the altercation in the locker room and “definitely felt ganged up on”.

Barkley admits he “might have been aggressive towards Matthew and started the whole thing”, but says that he was just frustrated and the argument was “nothing that should have started a big locker room fight or anything”. Barkley claims that when several other players grabbed him from behind for no reason, he tried to break free and must have accidentally knocked into Clinton in the attempt to get out of the locker room. He explained that the reason he never apologised to Clinton in the hospital was that he “didn’t think he’d want to see me”, but Barkley did say he “was truly, truly sorry” that Clinton had been injured.

1. 

Based on the available evidence, if this were a bench trial, would you convict the defendant?

Yes/No

2. 

How confident are you that your judgment is correct?

Very Confident               Not at all Confident
1 2 3 4 5 6 7 8 9

Demographic Questions Provided to Judges

1. 

What is the title of the judicial position you currently hold?

2. 

How many years have you served as a Judge (in any position)? _ years.

3. 

Please identify your gender:

Male Female
4. 

During your judicial career, approximately what percentage of your time has been devoted to the following areas:

  • Criminal cases

  • Civil cases

  • Family law cases

  • Probate or trusts

  • Other

5. 

Which of the two major political parties in the United States most closely matches your own political beliefs?

The Republican Party The Democratic Party
6. 

Please identify your race (Check all that apply)

  • White (non-Hispanic)

  • Black or African American

  • Hispanic or Latino

  • Asian

  • Native American or Pacific Islander

  • Other.

Appendix B: IAT procedure

We used seven rounds of trials to produce the IAT score. Rounds one, two, three, five, and six are essentially practice rounds designed to minimise order effects and variation associated with unfamiliarity with the task. The study begins with one round in which the participants only sort black and white faces. In this round the word “White” appeared in the upper left and the word “Black” appeared in the upper right of the screen. In each trial, one of ten faces, five white and five black, appeared in the middle of the screen.[158] The faces appeared at random, although an equal number of white and black faces appeared in the 16 trials.[159]

The instructions before each round informed the judges as to what they would be sorting in the upcoming round. For example, in the first round, the instructions indicated that the judge should press the “E” key (labeled with a red dot) if a white face appeared and the “I” key (also labeled with a red dot) if a black face appeared. The materials also state that if the judge pressed the correct key, the next face would appear; if the judge pressed the wrong key, a red “X” would appear. These instructions were similar in all seven rounds of the IAT.[160]

The remaining six rounds were similar to the first, although they varied the stimuli and categories. In the second round, instead of the black and white faces, the computer presented good and bad words. These consisted of seven words with positive associations (joy, love, peace, wonderful, pleasure, friend, laughter, happy) and seven words with negative associations (agony, terrible, horrible, nasty, evil, war, awful, failure). Like the faces, these words were taken from previous work on the IAT. Throughout the trials in the second round, the word “good” remained in the upper-left of the computer screen and the word “bad” remained in the upper-right of the computer screen. The judges were instructed in a similar fashion to round one, to press the “E” key when a good word appeared in the center of the screen and to press the “I” key when a bad word appeared in the center of the screen.

The third round combined the tasks in the first two rounds. The words “white or good” appeared in the upper-left of the computer screen and the words “black or bad” appeared in the upper-right of the computer screen. Thus, the task presented both categories in the same spatial location as they had been in the first two rounds. The instructions indicated to the judge that either a white or black face or a good or bad word would appear in the center of the computer screen. The instructions continued that the judges should press the “E” key if either a white face or a good word appeared and the “I” key if either a black face or a bad word appeared. Although the computer selected randomly from the faces and concept words, the computer presented an equal number of names and faces of both types. We presented the judges with sixteen trials of this task.

Round four was identical to round three in every respect except that the computer presented 40 trials, rather than 16.

Round five prepared the judges for the reverse association. To create the reversal, the spatial locations of the good and bad words were reversed. The word “bad” was moved to the left and the word “good” was moved to the right. The fifth round was thus identical to the second round in that the computer presented only the good and bad words, but that the computer presented the words in their new locations. The instructions were also identical to those of round two except that they identified the new locations and corresponding response keys for the words.

The penultimate round paired the good and bad words in their new locations with the black and white labels in their original location. Thus, the words “white or bad” appeared in the upper left and the words “black or good” appeared in the upper right. The instructions resembled those for rounds three and four. They indicated, however, that judges should press the “E” key if a white face or bad word appeared and to press the “I“ key if a black face or good word appeared. Round six, like the other practice rounds, consisted of 16 trials.

Round seven was identical to round six in every respect except that the computer presented 40 trials, rather than 16. The computer recorded the reaction times between the presentation of the stimuli and the time of the correct response for all judges in all rounds. The computer also recorded which stimuli it presented and whether an error occurred.

Appendix C: IAT scoring

Scoring the IAT requires researchers to make several judgments about the data. It requires deciding which of the seven rounds to use (some studies make use of the practice rounds); how to manage latencies that seem too long or too short; how to assess erroneous responses; how to identify and score participants who respond too slowly, too quickly, or made too many errors; whether to standardise the responses; and whether to use every round in a trial (or drop the first two, which commonly produce excessively long latencies). Greenwald and his colleagues tested essentially all variations on answers to these issues and produced a scoring method that they believe maximises the correlation between the IAT and observed behavior.[161]

We used two different scoring methods. First, for each judge, we calculated the difference between the average latency in the stereotype-congruent rounds in which the judges sorted white/good versus black/bad and the average latency in the stereotype-incongruent rounds in which the judges sorted white/bad versus black/good. This procedure follows the method that other researchers have used in reporting data from hundreds of thousands of participants collected on the Internet.[162] Hence, we can compare this average score with that of large groups of ordinary adults. (We describe this procedure at greater length below.)

In an exhaustive review of IAT methodology, however, Greenwald and his colleagues concluded that the average difference might not be the best measure of implicit associations.[163] These researchers found that people who are slower on the task produce larger differences in their IAT scores.[164] This tendency confounds the IAT score, as people who are simply less facile with a keyboard will appear to have stronger stereotypic associations. Furthermore, Greenwald and his colleagues also found that the average difference did not correlate as well with people’s decisions and behavior as other scoring methods.[165] After conducting their review, Greenwald and his colleagues identified a preferred scoring method, which we followed to assess the correlation between IAT effects and judges’ decisions.[166] The method essentially uses the mean difference for each participant divided by the standard deviation of that participant’s response latencies, although it includes some variations. (We also describe this procedure at greater length below.)

Mean-Difference IAT score calculation

To calculate the mean-difference IAT score, we largely followed the procedures outlined in Nosek and his colleagues’ report of IAT scores from tens of thousands of people collected through the Internet.[167] We also wanted to compare our results with the more detailed, contemporary Internet data collected and reported on the “Project Implicit” website, which appears to use the same scoring method.[168] Because the data in these studies come from voluntary participants who access the site on the Internet, the authors have adopted a number of techniques for excluding data from participants who may have wandered off during the study or are otherwise not fully engaged with the tasks.[169] While such techniques are less appropriate for our participants, who were engaged in person, we followed the Project Implicit scoring methods to facilitate a comparison.

The authors of the Internet study first adjusted raw latency scores that seemed much slower or faster than participants who are fully engaged with the task. The researchers treat any latency larger than 3000 milliseconds (ms) as 3000 ms, and any latency shorter than 300 ms as 300 ms.[170] The researchers also eliminated the first two trials in all rounds from consideration, having found that these rounds often displayed an erratic pattern of long latencies — presumably because participants commonly begin the task, and then pause to get settled in.[171] These researchers also excluded participants who failed to perform to certain criteria. They excluded participants who exhibited overall average latencies in the two critical rounds greater than 1800 ms, or who displayed average latencies in either of the two critical rounds (four or seven) greater than 1500 ms.[172] They also excluded participants who produced any critical round in which more than 25% of the latencies were less than 300 ms.[173] Finally, they excluded participants who made more than ten errors in any critical round.[174] These researchers report that these criteria resulted in the exclusion of 15% of their subjects.[175] After these adjustments and exclusions, these researchers calculated the mean difference between the critical stereotype-congruent round (either round four or seven) and the stereotype-incongruent rounds (either round four or seven).[176]

We followed these procedures to calculate the mean IAT score for the judges in our study. We capped latencies greater than 3000 ms as 3000 ms, and raised latencies lower than 300 ms to 300 ms.[177] We also discarded the first two rounds from the analysis. We excluded the results of the race IAT from six judges (or 4.5%) who produced either mean latencies greater than 1800 ms in one of the two critical rounds of the race IAT or a mean across both rounds greater than 1500 ms.[178] Similarly, we excluded the results of the gender IAT from ten judges (or 7.5%) who violated one or both of these criteria.[179] Nosek and his colleagues reported that they eliminated 2% of their participants for being too slow,[180] whereas we eliminated more. At the same time, none of the judges in our studies produced more than a 25% error rate in either of the critical rounds in either IAT. By contrast, Nosek and his colleagues eliminated roughly 13% of their participants for having high error rates.[181] The judges were thus slower and more accurate than Nosek and his colleagues’ subjects, and overall, the application of their criteria eliminated fewer judges than their results would have predicted.

Unlike Nosek and his colleagues,[182] we did not randomise the order in which we presented the IAT. That is, roughly half of the participants in the Internet sample receive the stereotype-congruent round first, while half receive the stereotype-incongruent round first. The seven-round IAT is designed to reduce order effects substantially, but nevertheless, they remain. Greenwald and his colleagues report that the IAT scores can correlate weakly with the order in which the materials are presented.[183] Randomizing the order would have produced a cleaner measure of the IAT effect across all judges, but would have reduced the correlation between the IAT score and behavior.[184] Hence, all of our judges received the materials in the same order. On the race IAT, judges receive the stereotype-congruent pairing first (white/good and black/bad) and on the gender IAT, judges receive the stereotype-incongruent pairing first (male/humanities and female/science). Our procedure would have tended to increase the IAT score on the race IAT, as compared to the sample by Nosek and his colleagues, and decrease the IAT score on the gender IAT.

By using these procedures, we scored judges in exactly the same method as Nosek and his colleagues in the data that they harvested

Standardised IAT Score Calculation

To calculate the standardised IAT score, we followed the procedures recommended by Greenwald and his colleagues.[185] These researchers designed their methods precisely to improve the reliability and predictive power of their measures.[186] We use the methods that produced the highest correlations between implicit measures and behavioral measures. They differ from the scoring method used to calculate the mean differences. As noted above, we used the Greenwald methodology to collect the IAT scores.[187] Following those scoring procedures, we removed single trials with latencies greater than 10,000 ms (that is, 10 seconds) from the analysis. We otherwise left low and high values in the analysis without adjustment. We made no correction for errors, because our IAT collection methods required the judges to provide the correct response before proceeding and hence the latency includes the delay that would result from an incorrect answer. Error rates were also low, as noted above. Following Greenwald and his colleagues’ scoring method, we used all of the trials, rather than dropping the first two in the round.

We departed from the method Greenwald and his colleagues endorse, however, in one respect. Those researchers suggested using the two paired practice rounds (rounds three and six) in the analysis.[188] They reported that using this data produced slightly higher correlations between the IAT scores and explicit choices.[189] We found, however, that latencies in the practice rounds were highly erratic. A high percentage of the trials eliminated for being greater than 10,000 ms were in the trial rounds.[190] Even with these observations removed, the average standard deviation in the two practice rounds on the race IAT was over one second (1064 ms), as compared to 596 ms in the trial rounds. This suggested to us that we ought not to use the practice rounds in the analysis. The practice rounds of the gender IAT were more stable. The standard deviation from the practice rounds (724 ms) was much closer to that of the trial rounds (560 ms). Even though the practice rounds in the gender IAT seemed more stable, for consistency, we dropped these as well. Our measure of the IAT effect for purposes of correlating the IAT scores with judges’ decisions was therefore the average difference between the stereotype-congruent round and the stereotype-incongruent round divided by the standard deviation of latencies in both rounds combined. Following Greenwald and his colleagues, we call the measure d'.

Because the latencies that we observed seemed slower than those which have been observed in the Internet study, we assessed the correlation between our two IAT measures and the mean latency. The correlation coefficients between the mean differences and the overall latency were 0.305 on the race IAT and 0.361 on the gender IAT. These correlations are high enough to indicate that our judges have higher IAT scores than other populations simply because they were somewhat slower.[191] The standardised IAT measure using only the trial rounds, however, produced correlations of only 0.046 and 0.002 with the overall mean latencies for the race and sex IATs, respectively. Hence, the d' measure provided a much more reliable measure of the IAT effect than the mean difference.



[1] © 2009 Jeffrey J Rachlinski, Sheri Lynn Johnson, Andrew J Wistrich, and Chris Guthrie. Individuals and nonprofit institutions may reproduce and distribute copies of this Article in any format, at or below cost, for educational purposes, so long as each copy identifies the author, provides a citation to the Notre Dame Law Review, and includes this provision and copyright notice. Originally published in (2009) Cornell Law Faculty Publications Paper 786 and (2009) 84 Notre Dame L Rev 1203.

[2] Professor of Law, Cornell Law School.

[3] Professor of Law, Cornell Law School.

[4] Magistrate Judge, United States District Court, Central District of California.

[5] Professor of Law, Vanderbilt Law School.

[6] The authors are grateful for the comments and assistance of Ian Ayres, Steve Burbank, Jack Glaser, Tracey George, Tony Greenwald, Matthew Patrick Henry, Reid Hastie, Christine Jolls, Dan Kahan, Jerry Kang, Cass Sunstein, and the participants in workshops at the University of Arizona Law School, Bar Ilan University Law School, Brooklyn Law School, the University of Chicago Law School, Chicago-Kent Law School, Cornell Law School, George Washington University Law School, Harvard Law School, Hebrew University Law School, the University of Illinois School of Law, Notre Dame Law School, Ohio State University Law School, St. Louis University Law School, Syracuse University Law School, Tel-Aviv University Law School, Temple Law School, Villanova Law School, the University of Zurich, the Annual Meeting of the American Law and Economics Association, and the Annual Conference on Empirical Legal Studies.

[7] I Ayres and J Waldfogel, “A market test for race discrimination in bail setting” (1994) 46 Stan L Rev 987 at 992. To calculate this disparity, Ayres and Waldfogel controlled for 11 other variables, but they conceded that they might still be missing one or more omitted variables that might explain the differential. By comparing differences in both bond rates and bail rates, however, they were able to provide even more compelling evidence that the bail rate differences they observed were race-based. See ibid at 993.

[8] D Mustard, “Racial, ethnic, and gender disparities in sentencing: evidence from the US Federal Courts” (2001) 44 JL & Econ 285 at 300.

[9] R Banks et al, “Discrimination and implicit bias in a racially unequal society” (2006) 94 Cal L Rev 1169 at 1175.

[10] See C Jolls and C Sunstein, “The law of implicit bias” (2006) 94 Cal L Rev 969 at 969–970 (providing examples of both explicit and implicit bias).

[11] See P Sniderman and T Piazza, Black pride and black prejudice, Princeton University Press, 2002 at pp 6–8.

[12] A Greenwald and L Hamilton Krieger, “Implicit bias: scientific foundations” (2006) 94 Cal L Rev 945 at 951, 961 (“[E]vidence that implicit attitudes produce discriminatory behavior is already substantial and will continue to accumulate.” (footnote omitted)); K Lane et al, “Implicit social cognition and law” (2007) 3 Ann Rev L & Soc Sci 427 at 433 (calling implicit social cognitions “robust” and “pervasive”).

[13] See J Kang and M Banaji, “Fair measures: a behavioral realist revision of ‘affirmative action’” (2006) 94 Cal L Rev 1063 at 1065 (arguing that implicit bias shows that affirmative action programs are necessary to address “discrimination in the here and now” (emphasis omitted)).

[14] J Kang, “Trojan horses of race” (2005) 118 Harv L Rev 1489 at 1512.

[15] In addition to the Implicit Association Test, which we discuss in detail, researchers have used subliminal priming techniques, see, eg S Graham and B Lowery, “Priming unconscious racial stereotypes about adolescent offenders” (2004) 28 L and Hum Behav 483 at 487–488; reaction-time studies, see, eg Greenwald and Krieger, above n 12 at 950–953 (labeling studies of implicit bias as studies of biases in reaction times); and novel brain-imaging techniques, see, eg, E Phelps et al, “Performance on indirect measures of race evaluation predicts amygdala activation” (2000) 12 J Cognitive Neurosci 729 at 729–730.

[16] A Green et al, “Implicit bias among physicians and its prediction of thrombolysis decisions for black and white patients” (2007) 22 J Gen Internal Med 1231 at 1231–1232.

[17] See Greenwald and Krieger, above n 12 at 952.

[18] See, eg M Orey, “White men can’t help it” Bus WK, 15 May 2006, at 54 (discussing the role of expert witness testimony on “unconscious bias theory” in gender and race employment discrimination cases); D Cole, “Don’t race to judgment” US News and World Rep, 26 Dec 2005/2 Jan 2006, at 90.

[19] See Project Implicit, “About us”, at www.projectimplicit.net/about-us/, accessed 14 July 2021. (“Visitors have completed more than 4.5 million demonstration tests since 1998, currently averaging over 15,000 tests completed each week.”).

[20] Greenwald and Krieger, above n 12 at 952–953 (describing the basic IAT technique).

[21] See Online Psychology Laboratory, Implicit Association Test (Race), at https://opl.apa.org/, accessed 14 July 2021.

[22] ibid.

[23] See B Nosek et al, “Harvesting implicit group attitudes and beliefs from a demonstration web site” (2002) 6 Group Dynamics 101 at 105 (reporting data indicating that white adults taking the IAT strongly favored the white/good versus the black/bad pairing on the IAT).

[24] ibid at 104.

[25] ibid at 105.

[26] ibid.

[27] ibid. Throughout, we adopt the convention that a “strong” bias means a tendency to favor one pairing over another on the IAT by over three-quarters of a standard deviation, a “small” bias means an effect of less than one-quarter of a standard deviation, and a “moderate” effect means an effect that is in between one-quarter and three-quarters of a standard deviation.

[28] ibid.

[29] ibid.

[30] ibid at 110.

[31] See H Arkes and P Tetlock, “Attributions of implicit prejudice, or would Jesse Jackson ‘fail’ the implicit association test?” (2004) 15 Psychol Inquiry 257 at 257–258 (arguing that the IAT does not measure bias or prejudice); M Banaji et al, “No place for nostalgia in science: a response to Arkes and Tetlock” (2004) 15 Psychol Inquiry 279 (responding to the arguments of Arkes and Tetlock).

[32] See J Stroop, “Studies of interference in serial verbal reactions” (1935) 18 J Experimental Psychol 643 at 659–660 (presenting evidence that words colored differently from their semantic meaning are difficult to read).

[33] See Project Implicit, above n 19.

[34] ibid.

[35] See A Greenwald et al, “Understanding and using the Implicit Association Test: I. an improved scoring algorithm” (2003) 85 J Personality & Soc Psychol 197 at 209–211 (discussing mechanisms for reducing order effects); see also A Greenwald and B Nosek, “Health of the Implicit Association Test at Age 3” (2001) 48 Zeitschrift for experiementelle psychologie 85 at 87 (“Subject handedness was found to have essentially zero relation to magnitude of the race IAT effect.”).

[36] See, eg, S Bagenstos, “Implicit bias, ‘science’, and antidiscrimination law” (2007) 1 Hav L & pol’y rev 477; Greenwald et al, ibid, at 199–200.

[37] A Greenwald et al, “Understanding and using the Implicit Association Test: III. meta-analysis of predictive validity” (2009) J Personality & Soc Psychol 17.

[38] Note that some of the papers Greenwald and his co-authors include in their analysis report multiple studies using independent samples of subjects. ibid.

[39] ibid.

[40] To be precise, the square of the correlation coefficient of 0.24 is 0.0576, which we round up to 6%.

[41] See National Center for State Courts, Examining the work of State Courts: an analysis of 2008 State Court caseloads at p 19 at www.courtstatistics.org/__data/assets/pdf_file/0021/23835/ewsc-2008-online.pdf, accessed 14 July 2021 (providing data for criminal cases entering state courts in 2008).

[42] United States Courts, “Statistical Tables for the Federal Judiciary–June 2007”, 2007, at tbl D, at www.uscourts.gov/statistics-reports/analysis-reports/statistical-tables-federal-judiciary, accessed 14 July 2021 (observing US district courts to have 71,652 and 69,697 cases pending in the 12-month periods ending 31 March 2006 and 2007, respectively).

[43] Kang and Banaji, above n 13, at 1073.

[44] See J Eberhardt et al, “Looking deathworthy: perceived stereotypicality of black defendants predicts capital-sentencing outcomes” (2006) 17 Psychol Scl 383 at 384 (“Defendants whose appearance was perceived as more stereotypically black were more likely to receive a death sentence than defendants whose appearance was perceived as less stereotypically black.”).

[45] See J Glaser and E Knowles, “Implicit motivation to control prejudice” (2008) 44 J Experimental Soc Psychol 164 at 164–165, 170–171 .

[46] See B Dunton and R Fazio, “An individual difference measure of motivation to control prejudiced reactions” (1997) 23 Personality & Soc Psychol Bull 316 at 324–326 ; E Plant and P Devine, “Internal and external motivation to respond without prejudice” (1998) 75 J Personality & Soc Psychol 811 at 824–828.

[47] See J Bargh, “The cognitive monster: the case against the controllability of automatic stereotype effects”, S Chaiken and Y Trope (Eds), Dual-process theories in social psychology, 1999, p 361 at pp 375–378; P Devine et al, “The regulation of explicit and implicit race bias: the role of motivations to respond without prejudice” (2002) 82 J Personality & Soc Psychol 835 at 845–847; J Dovidio et al, “On the nature of prejudice: automatic and controlled processes” (1997) 33 J Experimental Soc Psychol 510 at 535–536; R Fazio et al, “Variability in automatic activation as an unobtrusive measure of racial attitudes: a bona fide pipeline?” (1995) 69 J Experimental Soc Psychol 1013 at 1025–1026.

[48] Green et al, above n 16.

[49] ibid at 1232–1233.

[50] ibid at 1235. The researchers also found that white doctors who express white preferences on the IAT were more likely to diagnose black patients than white patients as having coronary artery disease, based upon the same symptoms: ibid at 1234–1235. Indeed, the doctors offered the appropriate treatment — thrombolysis — to an equal number of black patients as white patients! As the authors rightly point out, this does not mean there was no disparity; among patients who were diagnosed as suffering from coronary artery disease, black patients were less likely to be offered the appropriate treatment. It is at least curious, however, that doctors with implicit white preferences would be more likely to diagnose coronary artery disease for black patients than white patients, but less likely to treat it. The diagnosis disparity runs in the opposite direction of the treatment-for-diagnosis disparity, and ultimately, the two effects actually cancel each other out: ibid at 1236–1237. Of course, if doctors behaved the same way in the real world, black and white patients who presented the same symptoms would be treated in the same way. Thus, though the IAT predicted discriminatory acts, implicit bias does not seem to result in discrimination overall: ibid at 1234–1237. This aspect of the study has been the source of some debate. See J Tierney, “In bias test, shades of gray”, New York Times, 18 Nov 2008, at www.nytimes.com/2008/11/18/science/18tier.html, accessed 14 July 2021. One other recent study also shows no correlation between measures of implicit bias and medical decisions among physicians. See J Sabin et al, “Physician implicit attitudes and stereotypes about race and quality of medical care” (2008) 46 Med Care 678 at 682 (“We did not find a relationship between difference in treatment recommendations by patient race and implicit measures.”).

[51] Green et al, above n 16, at 1235.

[52] ibid at 1237.

[53] Glaser and Knowles, above n 45, at 167–171.

[54] J Correll et al, “The police officer’s dilemma: using ethnicity to disambiguate potentially threatening individuals” (2002) 83 J Personality & Soc Psychol 1314 at 1315–1317.

[55] ibid at 1315–1316.

[56] ibid at 1320.

[57] ibid at 1320–1321; Glaser and Knowles, above n 45, at 168–169.

[58] Glaser and Knowles, ibid at 169–170.

[59] ibid at 171.

[60] R Livingston, “When motivation isn’t enough: evidence of unintentional deliberative discrimination under conditions of response ambiguity”, 2002, 9–10 (unpublished manuscript, on file with the Notre Dame Law Review).

[61] See A Florack et al, “Der Einfluss Wahrgenommener Bedrohung auf die Nutzung Automatischer Assoziationen bei der Personenbeurteilung” [ “The impact of perceived threat on the use of automatic associations in person judgments”] (2001) 32 Zeitschrift for Sozial Psychologie 249.

[62] ibid at 255, tbl 1.

[63] We recognise that we have emphasised disparities concerning black Americans, rather than other races. We have done so for three reasons. First, even though Latinos, Native Americans, and Asian Americans are also targets of racism, both explicit and implicit, in the United States some of the most striking disparities involve black Americans in the legal system. Second, the research on the IAT has emphasised biases concerning black Americans as well. Third, our sample of judges includes a large group of black American judges, but few Latinos, few Asian Americans, and no Native Americans. We thus cannot draw any conclusions about the reactions of judges of these ethnicities. We therefore focus our attention here on biases involving black Americans.

[64] See C Guthrie et al, “Blinking on the Bench: how judges decide cases” (2007) 93 Cornell L Rev 1 at 13 [hereinafter Guthrie et al, “How judges decide”] ; C Guthrie et al, “Inside the judicial mind” (2001) 86 Cornell L Rev 777 at 814–815 [hereinafter Guthrie et al, “Judicial mind”]; J Rachlinski et al, “Inside the bankruptcy judge’s mind” (2006) 86 Boston University Law Rev 1227 at 1256–1259; A Wistrich et al, “Can judges ignore inadmissible information? The difficulty of deliberately disregarding” (2005) 153 University Pennsylvania Law Rev 1251 at 1323–1324.

[65] At two of the conferences, we collected data from judges attending a plenary session. At the third, we collected data from judges attending an optional session.

[66] Their concerns might be justified. Some of our previous work has been reported in the New York Times and the American Bar Association Journal, among other places. See, eg, P Cohen, “Judicial reasoning is all too human”, New York Times, 30 June 2001, at www.nytimes.com/2001/06/30/arts/judicial-reasoning-is-all-too-human.html, accessed 14 July 2021; D Cassens Weiss, “Judges flunk story problem test, showing intuitive decision-making”, ABA Journal, 19 Feb 2008, at www.abajournal.com/search/results/eyJrZXl3b3JkcyI6Ikp1ZGdlcyBGbHVuayBTdG9yeSBQcm9ibGVtIFRlc3QsIFNob3dpbmcgSW50dWl0aXZlIERlY2lzaW9uLU1ha2luZyJ9, accessed 14 July 2021. The latter report leads with the unfortunate headline “Judges flunk story problem test”, which casts the judges in a more negative light than the data warrant. Interest in the present article is sufficiently high that, despite our own efforts to limit its use before it was finalised, it was cited by Weinstein J in a published opinion, United States v Taveras, 424 F Supp 2d 446, 462 (EDNY 2006), and discussed at length in a recent volume of the Annual Review of Law and Social Science, Lane et al, above n 12, at 441–445.

[67] Eighty judges attended the session at which we collected data, but we excluded 10 from our study. We excluded one judge at his or her request. We excluded nine other judges because they failed to provide us with demographic information. We believe that these failures were largely accidental. To complete the demographic page, the judges had to return to the written materials after completing the final IAT, and these nine judges failed to do so. We did not realise that this process would cause problems at our presentation in the eastern jurisdiction, and hence we did not obtain this data. In the subsequent presentations, we made sure that the judges completed the last page as we collected the surveys.

[68] Forty-eight judges attended the session at which we collected the data, but we excluded three from our study. One judge neglected to provide demographic information, and we lost the data for two other judges due to a computer malfunction.

[69] Over 90% of the judges in the eastern jurisdiction attended this conference (although, as noted, we did not obtain data from all of them). Attendance was lower among the western judges; the sample includes roughly half of the judges in their jurisdiction. These judges’ willingness to participate in our study was thus unlikely to have been affected by their interest (or lack thereof) in the content of the material. In fact, the judges were not aware of the subject matter of the talk before the session began. This was not our first presentation to the eastern judges. Three years earlier, we had presented a completely different set of materials to the same educational conference. Some of the results from that earlier session have been published, also without identifying the jurisdiction. Wistrich et al, above n 64, at 1279–1281. Many of the judges were therefore familiar with our methods, although the present study differs from our earlier work. Our prior work dealt largely with judicial reliance on heuristics in making judgments, whereas this research is entirely devoted to the influence of race and gender on judgment. This was our first presentation to the western judges. The regional judges differed from the eastern and western judges in that they opted not only to attend the judicial education conference at which we spoke but also to attend our optional session.

[70] We include these questions below in Appendix A.

[71] The computer tasks were all conducted on laptop computers rented for the purpose of running the experiment. They were all relatively contemporary machines of similar makes. At the eastern and western sessions, all were Hewlett-Packard NX9010; at the regional conference, they were IBM ThinkPads. All had 15-inch screens. The software to run the tasks was designed with a program called Inquisit 2.0, created specifically for measuring implicit associations by a company called Millisecond Software. See Inquisit, www.millisecond.com, accessed 14 July 2021.

[72] The instructions on the survey were as follows:

Many of the points to be discussed at this session are best experienced directly. We therefore ask that before the session starts, you participate in a series of exercises on the laptop computer and evaluate a series of hypothetical cases in the pages that follow. (Participation in all aspects of this exercise is voluntary, of course.) Please do not discuss these materials while you are participating. We shall collect these surveys before the discussion and present the results during the session.

The first part of the exercise consists of a computer task. Please do not begin the task or turn this page until asked to do so.

The instructions on the computer screen were:

JURISDICTION: Judicial Education Conference, DATE

We shall begin by making announcements as to the nature of this exercise.

Please DO NOT BEGIN until after the announcements.

After the announcements, please press the space bar to begin.

[73] Judge Wistrich conducted the introduction at the eastern and western conferences; Professor Rachlinski did it at the regional conference.

[74] We also conducted an IAT related to gender after the race IAT, but do not report those results here.

[75] We also included a scenario in which we manipulated the gender of a target legal actor as the third scenario. We do not report these results here.

[76] The order of the materials was thus as follows: the priming task; the written scenario of the shoplifter; the written scenario of the armed robber; the gender scenario (not reported here); the battery case; the race IAT; the gender IAT (not reported here); and the demographics page.

[77] We analysed the three groups of judges separately, but there were no significant differences between the judges, except as noted below, so we have kept them together throughout the analysis. Similarly, we found no differences between the judges on the basis of the gender, political affiliation, or experience. Because previous research on the IAT suggests that Latinos score somewhat closer to black Americans on the IAT we used, we combined the few Latino judges with the black judges for these analyses, Nosek et al, above n 23, at 110 tbl 2. Similarly, we combined the Asian American judges with the white judges.

[78] The exact instructions at the outset of the IAT were as follows:

The remaining computer tasks involve making CATEGORY JUDGMENTS. Once the tasks begin, a word or words describing the CATEGORIES will appear in the upper left and upper right corners of the computer screen, A TARGET word or picture will also be displayed in the center of the screen, which you must assign to one of the two categories

Please respond AS RAPIDLY AS POSSIBLE, but don’t respond so fast that you make many errors. (Occasional errors are okay.)

An “X” will appear when you make an error. Whenever the “X” appears, correct the mistake by pressing the other key.

[79] For a more detailed account of our IAT procedure, see Appendix B.

[80] See, eg, Nosek et al, above n 23, at 104–105 (reporting average differences in response latencies among large samples of subjects obtained through the Internet).

[81] See Greenwald et al, above n 35, at 209–210 (describing standardised measures). The full account of our scoring methods is included as Appendix C.

[82] The specific statistical result was: t(82) = 4.94, p < .0001. Throughout this article, we reserve the use of the words “significant” and “significantly” for statistical significance.

[83] The specific statistical result was: t(42) = 0.18, p = .86. In conducting this test, we took the effect size among the Internet sample of 0.16 standard deviations to be the “population” effect size among black participants on the Internet, and tested whether our observed difference, with our observed standard deviation, would be likely to be reliably higher or lower than the effect in the Internet data. The priming condition did not appear to affect the judges’ IAT scores. Also, the judges themselves varied somewhat in their IAT scores. White judges in the eastern jurisdiction expressed an average standardised preference of 0.33, compared to 0.48 and 0.55 in the western jurisdiction and the regional conferences, respectively. These differences were marginally significant. Because the black judges in our study were concentrated largely in the eastern jurisdiction, similar tests for variations among these judges would not be reliable.

[84] The specific statistical result was: t(84) = 2.26, p = .026. We compared our results to those of the Internet sample reported in Nosek et al, above n 23, at 105. In making this comparison, we took the effect size among the Internet sample of 0.83 standard deviations to be the “population” effect size among white participants on the Internet, and tested whether our observed difference, with our observed standard deviation, would likely be reliably higher or lower than the effect in the Internet data.

[85] We selected data collection and scoring procedures so as to minimise the effects of order of presentation. Greenwald and his fellow authors reported that the effect of order of presentation is less than 1%, using the methods we followed. See Greenwald et al, above n 35, at 210 tbl 2.

[86] See ibid at 200 (“IAT effects will be artificially larger for any subjects who respond slowly.”).

[87] Throughout this article we follow the convention of using the terms “black” and “white” to denote race, as the terms more closely reflect the faces in the IAT, the instructions in the IAT (which refer to black and white), and might more closely reflect how the black judges would describe themselves (although there would be variation on this). When referring to the criminal defendants, however, we use African American and Caucasian, following the references mentioned in the hypothetical cases.

[88] Graham and Lowery, above n 15, at 487–488.

[89] At the beginning of the task, three asterisks appeared in the center of the screen. A 16 character letter string then appeared in one of the four quadrants of the screen. The judges were instructed to press a specific key on the left-hand side of the computer (the “E” key, which was marked with a red dot) when the letter string appeared in one of the quadrants on the left and to press a specific key on the right hand side of the computer (the “I” key, which was also marked with a red dot) when a word appeared in one of the two quadrants on the right. Reminders as to which key to press also remained on the computer screen throughout the first task (that is, “press the ‘E’ key for left” and “press the ‘I’ key for right”). When the judges identified the quadrant correctly, the word “correct” would appear in the center in letters. When the judges made an error, the word “error” would appear instead. In either case, the three asterisks would then replace the words “correct” or “error” and the task would repeat. The exact instructions the judges saw are below.

Once you begin the first computer task, the screen will go blank, then three asterisks (* * *) will appear in the center. Focus your attention on these. A string of letters will then appear in the upper-right, lower-right, upper-left, or lower-left portion of the computer screen.

If the string appears on the left-hand side (either up or down), press the “E” key.

If the string appears on the right-hand side (either up or down), press the “I” key.

If you correctly identify the position, the screen will flash the word “correct”;

if you identify the wrong position, the screen will flash the word “error”.

The task will then repeat a number of times. Other words may appear with the letter string. Ignore these and try to identify the position of the letters as quickly as possible.

When you are ready, press the space bar to begin the task.

[90] Each trial thus proceeded as follows: the three asterisks would appear in the center of the screen; 1200 milliseconds later (1.2 seconds) one of the prime words (selected at random) would appear in one of the four quadrants (at random as determined by the computer); 153 milliseconds after that, the letter-string would appear over the prime; this would remain until the judge pressed either the “E” or “I” key; then either the “correct” or “error” in the center (depending upon the judge’s response) and would remain for roughly one second; then the three asterisks would replace the word “correct” or “error”; and the process would repeat. Due to an error in the computer programming, the judges in the eastern conference were only exposed to the subliminal prime for 64 milliseconds, rather than 153 milliseconds.

[91] Graham and Lowery reported that none of the officers in their study was able to identify the nature of the words being shown to them. Graham and Lowery, above n 15, at 491. We did not ask our judges their assessment of what the words were.

[92] The words came directly from the Graham and Lowery study: graffiti, Harlem, homeboy, jerricurl, minority, mulatto, negro, rap, segregation, basketball, black, Cosby, gospel, hood, Jamaica, roots, afro, Oprah, Islam, Haiti, pimp, dreadlocks, plantation, slum, Tyson, welfare, athlete, ghetto, calypso, reggae, rhythm, soul: ibid at 489 n 5.

[93] These words also came directly from Graham and Lowery: baby, enjoyment, heaven, kindness, summer, sunset, truth, playful, accident, coffin, devil, funeral, horror, mosquito, stress, toothache, warmth, trust, sunrise, rainbow, pleasure, paradise, laughter, birthday, virus, paralysis, loneliness, jealousy, hell, execution, death, agony. Graham and Lowery used neutral words that matched the words associated with black Americans for positive or negative associations: ibid.

[94] Our study differed from that of Graham and Lowery in several ways, any of which might have affected the results. First, Graham and Lowery used 80 trials, rather than the 60 we used. ibid at 489–490. Second, because we ran a large group of judges at the same time, we did not use audible beeps to indicate correct responses. Third, our hypothetical defendants differed. We did not have access to the original materials Graham and Lowery used, and so wrote our own. See fact pattern below at Appendix A. Fourth, we asked fewer questions concerning the hypothetical defendants. Although we do not see how any of these differences would necessarily affect the results, priming tasks can be sensitive to details.

[95] The following appeared on the screen:

Thank you for completing the first computer task.

Now please turn to the written materials.

Please leave this computer on with the screen up.

After you have completed four pages of written materials, please press the space bar to continue with the final computer tasks.

In case a judge accidentally or mistakenly hit the space bar, we added another intervening page before the second computer task, which appeared once the space bar was pressed. It read as follows:

If you have completed the four case summaries, please press the space bar to begin the final computer task.

[96] The location of the crime would reveal the jurisdiction and hence we delete it. The location was an upscale shopping district.

[97] The exact materials for this scenario and all others are included below at Appendix A.

[98] The options were as follows:

1. 

dismiss it with an oral warning

2. 

adjourn the case in contemplation of dismissal (assuming William gets in no further trouble)

3. 

put William on probation for six months or less

4. 

put William on probation for more than six months

5. 

commit William to a juvenile detention facility for six months or less

6. 

commit William to a juvenile detention facility for more than six months

7. 

transfer William to adult court.

[99] The results were as follows: Question 1, z = 0.51, p = .61; Question 2, z = 0.73, p = .46; Question 3, z =1.09, p = .28.

[100] To accomplish this analysis, we conducted an ordered logit regression of the judges’ disposition against the priming condition, the judges’ IAT scores, and an interaction of the two. The interaction term reflects the effect of the IAT score on how the prime affected the judge. This term was marginally significant in the model, z = 1.84, p = .07.

[101] For the first recidivism question, z = 1.41, p = .16. On the second recidivism question, z = 1.49, p = .14. On these questions, the black judges and the white judges seemed to respond in similar ways. We ran the full model (predictors of prime, race of judge, IAT, and all interactions between these variables) on all three variables as well. Adding the race-of judge terms and interactions did not produce any significant effects.

[102] The use of an armed robbery breaks somewhat with Graham and Lowery, who had used two simple property crimes. See Graham and Lowery, above n 15, at 490.

[103] The results were as follows: Question 1, z = 0.17, p = .87; Question 2, z = 0.09, p = .93; and Question 3, z = 1.62, p = .11.

[104] Our findings were: z = 1.85, p = .06.

[105] For the first recidivism question, z = 0.62, p = .53; on the second recidivism question, z = 0.54, p = .59. As above, on these questions, the black judges and the white judges seemed to respond in similar ways. We ran the full model (predictors of prime, race of judge, IAT, and all interactions between these variables) on all three variables as well. Adding the race-of-judge terms and interactions did not produce any significant effects.

[106] See Graham and Lowery, above n 15, at 493–494, 496.

[107] ibid. Only police officers predicted that the defendant was more likely to recidivate; parole officers did not show any differences on this question.

[108] S Sommers and P Ellsworth, “White juror bias: an investigation of prejudice against black defendants in the American courtroom” (2001) 7 Psychol Pub Pol’y & L 201 at 216–217. We thank the authors for graciously sending us the materials and giving us permission to use them.

[109] We used the same question to elicit verdicts and confidence ratings as the one Sommers and Ellsworth used: “Based on the available evidence, if this were a bench trial, would you convict the defendant?” Below this were the words “Yes” and “No”. Finally, we asked the judges, “How confident are you that your judgment is correct?” Below this question, the materials presented a nine-point scale, with “1” labeled “Not at all Confident” and “9” labeled “Very Confident”. ibid at 217; see also Appendix A (providing the materials used in our study).

[110] This difference was not statistically significant. Fishers exact test, p = .62.

[111] The difference between our results and those obtained by Sommers and Ellsworth is significant: X2 (1) = 6.74, p < .01 (using the expected conviction rates of 70% for Caucasian defendants and 90% for African American defendants, as reported by Sommers and Ellsworth, Sommers and Ellsworth, above n 108, at 217).

[112] The analysis consisted of a logistic regression of the verdict against the race of the defendant, the race of the judge, and the interaction of these two parameters. The interaction was significant, z = 2.12, p = .03, which was the result of the differential treatment of the two defendants by the black judges. The race of the defendant was also significant, z = 2.81, p = .005, indicating that overall, the judges were less likely to convict the African American defendant than the Caucasian defendant.

[113] We combined the nine-point confidence measure with the binary outcome to create an eighteen-point scale. In our coding, a “1” corresponded to a judge who was very confident that the defendant should be acquitted, whereas an “18” corresponded to a judge who was very confident that the defendant should be convicted. The average confidence that the judges expressed in the defendant’s guilt were as follows: white judges judging Caucasian defendants-13.64; white judges judging African American defendants — 12.2; black judges judging Caucasian defendants — 16.08; black judges judging African American defendants — 9.89. Statistical analysis of these results (by ANOVA) produced results consistent with the analysis of the verdicts alone. That is, the judges were significantly more convinced of the Caucasian defendant’s guilt than of the African American’s guilt (F(1, 129) = 15.04, p < .001). This disparity was much more pronounced among black judges (F(1, 129) = 5.84, p < .025).

[114] To accomplish this analysis, we conducted a logistic regression of the judges’ verdict against the priming condition, the judges’ IAT scores, and an interaction of the two. The interaction term reflects the effect of the IAT score on how the race of the defendant affected the judges’ verdict. This term was not significant in the model, z - 1.04, p = .30.

[115] We also replicated this analysis with the eighteen-point confidence ratings. See n 118. Specifically, we regressed the judges’ confidence in the defendant’s guilt against the defendant’s race, the judges’ IAT score, and the interaction between the race and IAT score. As with the verdict itself, this analysis showed that the race of the defendant was significant, t-ratio = 3.49, p < .001, but the interaction between race of defendant and IAT score was not, t-ratio = 1.51, p = .13.

[116] In this analysis, the race of the defendant and the interaction between race of judge and race of the defendant were significant, just as they were in the simpler models. (Race of defendant, z = 1.99, p = .05; interaction between race of the judge and race of the defendant, z = 2.35, p = .02. The interaction of the defendant’s race and IAT score was not significant, z = 1.00, p = .23.)

[117] The result was as follows: z = 2.18, p = .03.

[118] Regressing the eighteen-point confidence rating against the race of the judge, the race of the defendant, the judges’ IAT scores, and all interactions between these variables revealed significant effects for race of the defendant, t-ratio = 2.95, p = .005; a significant interaction of race of the defendant with race of the judge, t-ratio = 2.68, p = .01; and the three-way interaction of race of judge, race of defendant, and IAT score, t-ratio = 2.68, p = .02. The interaction of race of defendant and IAT scores was still not significant in this model, t-ratio = 1.27, p = .20.

[119] The results are as follows: z = 1.15, p = .25.

[120] The results are as follows: z = 1.87, p = .06. Given the high conviction rate of the black judges for the Caucasian defendant, this trend actually meant that they were more likely to convict the African American defendants to the extent that they exhibited greater white preferences on the IAT.

[121] The white judges displayed a greater propensity to convict the Caucasian defendant relative to the African American defendant as the IAT score increased, but the trend did not approach significance, t-ratio = 1.00, p = .40. The black judges showed the opposite trend, which was significant: t-ratio = 2.25, p = .03.

[122] S Carpenter, “Buried prejudice: the bigot in your brain”, Sci Am Mind, May 2008, at 32.

[123] See G Moskowitz and A Salomon, “Preconsciously controlling stereotyping: implicitly activated egalitarian goals prevent the activation of stereotypes” (2000) 18 Soc cognition 151.

[124] See T Eisenberg and S Johnson, “Implicit racial attitudes of death penalty lawyers” (2004) 53 Depaul L Rev 1539 at 1540 (“One would hope that those who represent capital defendants (or at least African-American capital defendants) would themselves be free of racialised thinking ... ”.)

[125] ibid at 1546–1548.

[126] See Sommers and Ellsworth, above n 108, at 217.

[127] See Model Code of Judicial Conduct: Canon 2 at www.americanbar.org/groups/professional_responsibility/publications/model_code_of_judicial_conduct/model_code_of_judicial_conduct_canon_2/, accessed 15 July 2021. (“A judge shall perform the duties of judicial office impartially, competently, and diligently.”).

[128] See, eg American Bar Association, Black letter guidelines for the evaluation of judicial performance, at Guideline 5-2.3, available at www.americanbar.org/content/dam/aba/publications/judicial_division/aba_blackletterguidelines_jpe.pdf, accessed 15 July 2021, (prescribing “[a]bsence of favor or disfavor toward anyone, including but not limited to favor or disfavor based upon race, sex, religion, national origin, disability, age, sexual orientation, or socioeconomic status”).

[129] See Glaser and Knowles, above n 45, at 171.

[130] During our presentation, one of us asked for a show of hands to indicate how many thought we were studying race. While not the most ideal way to make this inquiry, and while we did not keep a precise count, most of the judges raised their hands.

[131] See, eg, K Abrams, “Black judges and ascriptive group identification”, Norms and the Law, J Drobak ed, 2006, p 208, at p 215 (“The most noteworthy feature of these studies is that they find no consistent, and only a few salient, differences in decision making that correlate with the race of the judge.”).

[132] See Carpenter, above n 122, at 37–38.

[133] These data were collected by us at a conference of New York City administrative law judges in the summer of 2008. As one of the questions, we asked the following:

Relative to the other judges attending this conference, how would you rate yourself on the following:

Avoiding racial bias in making decisions:

  • in the highest quartile (meaning that you are more skilled at this than 75% of the judges attending this conference)

  • in the second highest quartile (meaning that you are more skilled at this than 50% of the judges in this room, but less skilled than 25% of the judges attending this conference)

  • in the second lowest quartile (meaning that you are more skilled at this than 25% of the judges in this room, but less skilled than 50% of the judges attending this conference)

  • in the lowest quartile (meaning that you are less skilled at this than 75% of the judges attending this conference).

[134] Jolls and Sunstein, above n 10, at 988–990; Kang and Banaji, above n 13, at 1105–1108.

[135] See, eg, Kang and Banaji, above n 13, at 1112 (“In Grutter v Bollinger, the court emphasised that student diversity was valuable because it could help ‘break down racial stereotypes’.” (quoting Grutter v. Bollinger, 539 US 306, 330 (2003))); see also Kang, above n 14, at 1579–1583 (arguing that public broadcasting should be regulated so as to promote positive images of minorities).

[136] Bureau of Justice Statistics, US Department of Justice, State Court Processing Statistics, Felony Defendants in Large Urban Counties, 2009 — Statistical Tables, 2009, at p 5, available at https://bjs.ojp.gov/library/publications/felony-defendants-large-urban-counties-2009-statistical-tables, accessed 15 July 2021, (stating that an estimated 45% of defendants were black).

[137] See Eisenberg and Johnson, above n 124, at 1553–1556.

[138] Others have made tentative suggestions that the IAT be used as a screening device for certain professions. See, eg, I Avrs, Pervasive Prejudice?, University of Chicago Press, 2001 (“Implicit attitude testing might also itself be used as a criterion for hiring both governmental and nongovernmental actors.”).

[139] Green et al, above n 16, at 1237 (“These findings support the IAT’s value as an educational tool.”).

[140] See ibid (recommending “securely and privately administered IATs to increase physicians’ awareness of unconscious bias”).

[141] See Carpenter, above n 122, at 32.

[142] T Wilson et al, “Mental Contamination and the Debiasing Problem”, Heuristics and Biases, T Gilovich et al eds, 2002, p 185 at p 190.

[143] See Guthrie et al, Judicial Mind, above n 64, at 814–815.

[144] See Green et al, above n 16, at 1237.

[145] Wilson et al, above n 142, at p 185.

[146] ibid at p 187.

[147] See ibid at p 191. (“Three kinds of errors have been found: insufficient correction (debiasing in the direction of accuracy that does not go far enough), unnecessary correction (debiasing when there was no bias to start with), and over-correction (too much debiasing, such that judgments end up biased in the opposite direction).”).

[148] See ibid (suggesting that people’s “corrected judgments might be worse than their uncorrected ones”); see also A Page, “Batson’s blind-spot: unconscious stereotyping and the peremptory challenge” (2005) 85 Boston University Law Review 155 at 239–240 (“One major problem for any correction strategy is determining the magnitude of the correction required. Unfortunately, people are not very good at this determination. Some research suggests that among those who are very motivated to avoid discrimination, over-correction is a common problem. ... A second problem is that a correction strategy appears to require significant cognitive resources .... (citations omitted)”); ibid at 241–242. (“[T]o consciously and willfully regulate one’s own ... evaluations [and] decisions . . . requires considerable effort and is relatively slow. Moreover, it appears to require a limited resource that is quickly used up, so conscious self-regulatory acts can only occur sparingly and for a short time.” (omissions in original) (quoting J Bargh and T Chartrand, “The unbearable automaticity of being” (1999) 54 American Psychologist Journal 462 at 476).

[149] See C Parsons et al, “Strike three: umpires’ demand for discrimination”, Nat’l Bureau of Econ Research, Working Paper Series, Paper No 13665, 2007, pp 24–25 at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1077091, accessed 19 July 2021; J Price and J Wolfers, “Racial discrimination among NBA referees”, Nat’l Bureau of Econ Research, Working Paper Series, Paper No 13206, 2007, p 30, available at http://ssrn.com/abstract=997562, accessed 19 July 2021.

[150] Accountability improves performance in other contexts, so it likely would do so for judges as well. See J Lerner and P Tetlock, “Accounting for the effects of accountability” (1999) 125 Psychol Bull 255 at 270–271.

[151] See Guthrie et al, How Judges Decide, above n 64, at p 32.

[152] See, eg J Dubofsky, “Judicial performance review: a balance between judicial independence and public accountability” (2007) 34 Fordham Urb L J 315 at 320–322 (explaining that the judicial performance review system in Colorado focuses only on a judge’s performance in a particular case).

[153] See M Solimine, “Congress, ex P Young, and the fate of the three-judge District Court” (2008) 70 University Pittsburgh Law Review 101 at 128–134 .

[154] J Peresie, “Female judges matter: gender and collegial decision making in the federal appellate courts” (2005) 114 Yale L J 1759 at 1778.

[155] Notes and comments, “Judicial limitation of three-judge court jurisdiction” (1976) 85 Yale L J 564.

[156] A Hellman, “Legal problems of dividing a state between federal judicial circuits” (1974) 122 University of Pennsylvania Law Rev 1188 at 1225.

[157] See Peresie, above n 154, at 1778.

[158] The faces were taken from the Project Implicit website. See B Nosek et al, Project Implicit, Stimulus Materials, 2006, at www.projectimplicit.net/resources/study-materials/, accessed 19 July 2021. They include only the center of the face, with ears, hair, and anything below the chin cropped out. None of the faces has facial hair, eyeglasses, or distinguishing features: ibid (providing faces that can be downloaded under the “race faces” stimulus set).

[159] In this respect we varied from the procedures recommended by Greenwald and his colleagues (see Greenwald et al, above n 35, at 198), by reducing the practice rounds from the 20 they suggested to 16. We did this in the interest of saving time. We did retain the forty trials in the critical rounds. We had more time available in the western jurisdiction, and increased the length of rounds three and six to 20 trials.

[160] The exact instructions were as follows:

1. 

In the first round, the two CATEGORIES that you are to distinguish are:

BLACK vs WHITE faces.

  • Press the “E” key if the TARGET is a WHITE face.

  • Press the “I” key if the TARGET is a BLACK face.

2. 

Remember that an “X” will appear when you make an error. Whenever the “X” appears, correct the mistake by pressing the other key.

3. 

Please respond AS RAPIDLY AS POSSIBLE, but don’t respond so fast that you make many errors. (Occasional errors are okay.)

4. 

Press the space bar when you are ready to begin.

[161] Greenwald et al, above n 35, at 212–215.

[162] Nosek et al, above n 23, at 103–104.

[163] Greenwald et al, above n 35, at 212–215.

[164] ibid at 201–202.

[165] ibid at 203.

[166] ibid at 214 tbl 4.

[167] Nosek et al, above n 23, at 103–104.

[168] Project Implicit, “Services”, 2002, at www.projectimplicit.net/services/, accessed 19 July 2021.

[169] See Nosek et al, above n 23, at 104.

[170] ibid.

[171] ibid.

[172] ibid.

[173] ibid.

[174] ibid.

[175] ibid.

[176] ibid.

[177] None of the judges provided latencies that were less than 300 ms in either of the two critical rounds measuring the race IAT; two of the judges provided responses that were faster than 300 ms in the gender IAT (one round each). Many more of the judges produced latencies that exceeded 3000 ms. On the race IAT, 58 judges (or 50.4%) produced at least one latency greater than 3000 ms in the stereotype-congruent round (round four). Specifically, in the stereotype-congruent round:33 judges produced one long latency; 20 produced two; three produced three; and two produced four. In the stereotype-incongruent round on the race IAT (round seven), 68 judges (or 59.1%) produced at least one latency greater than 3000 ms. Specifically, in the stereotype-incongruent round: 33 judges produced one long latency; 12 produced two; 10 produced three; four produced four; two produced five; four produced six; and three produced seven. On the gender IAT, 57 judges (or 49.6%) produced at least one latency greater than 3000 ms in the stereotype-congruent round (round seven). Specifically, in the stereotype-congruent round: 36 judges produced one long latency; seven produced two; nine produced three; three produced four; one produced five; and one produced eight. In the stereotype-incongruent round on the gender IAT (round four), 56 judges (or 48.7%) produced at least one latency greater than 3000 ms. Specifically, in the stereotype-incongruent round: 27 judges produced one long latency; 15 produced two; six produced three; three produced four; two produced five; one produced six; and one produced seven. Note that because some of these long latencies fell into the first two rounds, they are not included in the analysis.

[178] One of the judges violated both criteria. We calculated both means after excluding the first two rounds.

[179] Four judges violated both criteria.

[180] Nosek et al, above n 23, at 104.

[181] ibid.

[182] ibid.

[183] Greenwald et al, above n 35, at 210 tbl 2, report the effect of order with a correlation coefficient, rather than a mean or percent difference. They report that the correlation varies with the IAT, noting that the gender IAT that we used here produces a higher correlation between order and IAT score than do other IATs. They report correlations as high as 0.29 (depending upon the scoring method), which would mean that order can account for up to 10% of the IAT score: ibid. By contrast, the race IAT that we used produces small correlations with order, ranging from 0.002 to 0.054; thus, order accounts for, at most, one-quarter of 1% of the IAT score. The order effects seem to vary with context, and hence we cannot be certain of the extent of the influence of order on our materials.

[184] Had we randomised the order, each judge’s IAT score would have varied with the order to some extent. This would have introduced some variation to the IAT score that would inherently reduce the correlation we observed across all judges. Our measure of the IAT score across all judges would have been more reliable had we randomised, but the IAT score for the individual judges would have been less consistent, thereby interfering with the correlation.

[185] Greenwald et al, above n 35, at 199–200.

[186] ibid.

[187] In the eastern and western samples we reduced the number of trials in the practice rounds (rounds 1, 2, 3, 5, and 6) from 20 to 16, so as to save time.

[188] Greenwald et al, above n 35, at 213.

[189] ibid at 214–215.

[190] In the race IAT, 29 out of the 33 instances in which judges produced latency scores of greater than 10,000 ms on a trial (or 87.9%) occurred during the practice rounds. In the gender IAT, the two instances in which judges exhibited trials that exceeded 10,000 ms occurred in the target round.

[191] Note that these correlations used all judges, with no exclusions for speed, did not bound the data between 300 and 3000 Ins, and did not exclude the first two rounds, as we did for calculating the mean differences.