need help7

ATTACHED FILE(S)
HIRING
The Problem with Using
Personality Tests for Hiring
by Whitney Martin
AUGUST 27, 2014
A decade ago, researchers discovered something that should have opened eyes and raised red flags
in the business world.
Sara Rynes, Amy Colbert, and Kenneth Brown conducted a study in 2002 to determine whether the
beliefs of HR professionals were consistent with established research findings on the effectiveness
of various HR practices. They surveyed 1,000 Society for Human Resource Management (SHRM)
members — HR Managers, Directors, and VPs —with an average of 14 years’ experience.
The results? The area of greatest disconnect was in staffing— one of the lynchpins of HR. This was
particularly prevalent in the area of hiring assessments, where more than 50% of respondents were
unfamiliar with prevailing research findings.
https://hbr.org/topic/hiring
https://hbr.org/search?term=whitney+martin
http://www.cebma.org/wp-content/uploads/Rynes-et-al-HR-Professionals-belief-about-effective-human-resource-practices-HRM-2002.pdf
https://hbr.org/resources/images/article_assets/2014/08/whathrmanagersget1.gif
https://hbr.org/
Several studies since have explored why theseresearch findings have seemingly failed to transfer to
HR practitioners. Among the causes are the fact that HR professionals often don’t have time to read
the latest research; the research itself is often present with technically complex language and data;
and that the prospect of introducing an entirely new screening measure is daunting from multiple
angles.
At the same time, anyone who has ever been responsible for hiring, much less managing, employees
knows that there is a wide variation in worker performance levels across jobs. Therefore, it is critical
for organizations to understand what differences among individuals systematically affect job
performance so that the candidates with the greatest probability of success can be hired.
So what are the most effective screening measures?
Extensive research has been done on the ability of various hiring methods and measures to actually
predict job performance. A seminal work in this area is Frank Schmidt’s meta-analysis of a century’s
worth of workplace productivity data, first published in 1998 and recently updated. The table below
shows the predictive validity of some commonly used selection practices, sorted from most
effective to least effective, according to his latest analysis that was shared at the Personnel Testing
Counsel Metropolitan Washington chapter meeting this past November:
So if your hiring process relies primarily on interviews, reference checks, and personality tests, you
are choosing to use a process that is significantly less effective than it could be if more effective
measures were incorporated.
https://hbr.org/resources/images/article_assets/2014/08/themosteffective.gif
ANALYTICS
A Guide to Solving Social
Problems with Machine Learning
by Jon Kleinberg, Jens Ludwig, and Sendhil Mullainathan
DECEMBER 08, 2016
It’s Sunday night. You’re the deputy mayor of a big city. You sit down to watch a movie and ask
Netflix for help. (“Will I like Birdemic? Ishtar? Zoolander 2?”) The Netflix recommendation algorithm
predicts what movie you’d like by mining data on millions of previous movie-watchers using
sophisticated machine learning tools. And then the next day you go to work and every one of your
agencies will make hiring decisions with little idea of which candidates would be good workers;
community college students will be largely left to their own devices to decide which courses are too
https://hbr.org/topic/analytics
https://hbr.org/search?term=jon+kleinberg
https://hbr.org/search?term=jens+ludwig
https://hbr.org/search?term=sendhil+mullainathan
https://hbr.org/
INSIGHT CENTER
The Next Analytics Age
SPONSORED BY SAS
Harnessing the power of machine learning and other
technologies.
hard or too easy for them; and your social service system will implement a reactive rather than
preventive approach to homelessness because they don’t believe it’s possible to forecast which
families will wind up on the streets.
You’d love to move your city’s use of predictive analytics into the 21century, or at least into the
20century. But how? You just hired a pair of 24-year-old computer programmers to run your data
science team. They’re great with data. But should they be the ones to decide which problems are
amenable to these tools? Or to decide what success looks like? You’re also not reassured by the
vendors the city interacts with. They’re always trying to up-sell you the very latest predictive tool.
Decisions about how these tools are used seem too important for you to outsource, but raise a host
of new issues that are difficult to understand.
This mix of enthusiasm and trepidation over the
potential social impact of machine learning is not
unique to local government or even to
government: non-profits and social entrepreneurs
share it as well. The enthusiasm is well-placed.
For the right type of problem, there are enormous
gains to be madefrom using these tools. But so is
the trepidation: as with all new “products,” there is potential for misuse. How can we maximize the
benefits while minimizing the harm?
In applying these tools the last few years, we have focused on exactly this question. We have learned
that some of the most important challenges fall within the cracks between the discipline that builds
algorithms (computer science) and the disciplines that typically work on solving policy problems
(such as economics and statistics). As a result, few of these key challenges are even on anyone’s
radar screen. The good news is that many of these challenges, once recognized, are fairly
straightforward to solve.
We have distilled what we have learned into a “buyer’s guide.” It is aimed at anyone who wants to
use data science to create social good, but is unsure how to proceed.
How machine learning can improve public policy
st
th
https://hbr.org/insight-center/the-next-analytics-age
First things first: There is always a new “new thing.” Especially in the social sector. Are these
machine learning tools really worth paying attention to?
Yes. That’s what we’ve concluded from our own proof-of-concept project, applying machine
learning to a dataset of over onemillion bond court cases (in joint work with Himabindu Lakkaraju
and Jure Leskovec of Stanford University). Shortly after arrest, a judge has to decide: will the
defendant await their legal fate at home? Or must they wait in jail? This is no small question. A
typical jail stay is between two and three months. In making this life-changing decision, by law, the
judge has to make a prediction: if released, will the defendant return for their court appearance, or
will they skip court? And will they potentially commit further crimes?
We find that there is considerable room to improve on judges’ predictions. Our estimates show that
if we made pre-trial release decisions using our algorithm’s predictions of risk instead of relying on
judge intuition, we could reduce crimes committed by released defendants by up to 25% without
having to jail any additional people. Or, without increasing the crime rate at all, we could jail up to
42% fewer people. With 12 million people arrested every year in the U.S., this type of tool could let
us reduce jail populations by up to several hundred thousand people. And this sort of intervention
is relatively cheap. Compared to investing millions (or billions) of dollars into more social programs
or police, the cost of statistically analyzing administrative datasets that already exist is next-to-
nothing. Plus, unlike many other proposals to improve society, machine learning tools are easily
scaled.
By now, policymakers are used to hearing claims like this in sales pitches, and they should
appropriately raise some skepticism. One reason it’s hard to be a good buyer of machine learning
solutions is that there are so many overstated claims. It’s not that people are intentionally misstating
the results from their algorithms. In fact, applying a known machine learning algorithm to a dataset
is often the most straightforward part of these projects. The part that’s much more difficult, and the
reason we struggled with our own bail project for several years, is accurately evaluating the potential
impact of any new algorithm on policy outcomes. We hope the rest of this article, which draws on
our own experience applying machine learning to policy problems, will help you better evaluate
these sales pitches and make you a critical buyer as well.
Look for policy problems that hinge on prediction
Our bail experience suggests that thoughtful application of machine learning to policy can create
very large gains. But sometimes these tools are sold like snake oil, as if they can solve every problem.
Machine learning excels at predicting things. It can inform decisions that hinge on a prediction, and
where the thing to be predicted is clear and measurable.
For Netflix, the decision is what movie to watch. Netflix mines data on large numbers of users to try
to figure out which people have prior viewing histories that are similar to yours, and then it
recommends to you movies that these people have liked. For our application to pre-trial bail
decisions, the algorithm tries to find past defendants who are like the one currently in court, and
then uses the crime rates of these similar defendants as the basis for its prediction.
If a decision is being made that already depends on a prediction, why not help inform this decision
with more accurate predictions? The law already requires bond court judges to make pre-trial
release decisions based on their predictions of defendant risk. Decades of behavioral economics and
social psychology teach us that people will have trouble making accurate predictions about this risk
– because it requires things we’re not always good at, like thinking probabilistically, making
attributions, and drawing inferences. The algorithm makes the same predictions judges are already
making, but better.
But many social-sector decisions do not hinge on a prediction. Sometimes we are asking whether
some new policy or program works – that is, questions that hinge on understanding the causal effect
of something on the world. The way to answer those questions is not through machine learning
prediction methods. We instead need tools for causation, like randomized experiments. In addition,
just because something is predictable, that doesn’t mean we are comfortable having our decision
depend on that prediction. For example we might reasonably be uncomfortable denying welfare to
someone who was eligible at the time they applied just because we predict they have a high
likelihood to fail to abide by the program’s job-search requirements or fail a drug test in the future.
Make sure you’re comfortable with the outcome you’re predicting
Algorithms are most helpful when applied to problems where there is not only a large history of past
cases to learn from but also a clear outcome that can be measured, since measuring the outcome
concretely is a necessary prerequisite to predicting. But a prediction algorithm, on its own, will
focus relentlessly on predicting the outcome you provide as accurately as possible at the expense of
everything else. This creates a danger: if you care about other outcomes too, they will be ignored. So
even if the algorithm does well on the outcome you told it to focus on, it may do worse on the other
outcomes you care about but didn’t tell it to predict.
This concern came up repeatedly in our own work on bail decisions. We trained our algorithms to
predict the overall crime rate for the defendents eligible for bail. Such an algorithm treats every
crime as equal. But what if judges (not unreasonably) put disproportionate weight on whether a
defendant engages in a very serious violent crime like murder, rape, or robbery? It might look like
the algorithm’s predictions leads to “better outcomes” when we look at overall rates of crime. But
the algorithm’s release rule might actually be doing worse than the judges with respect to serious
violent crimes specifically. The possibility of this happening doesn’t mean algorithms can’t still be
useful. In bail, it turns out that different forms of crime are correlated enough so that an algorithm
trained on just one type of crime winds up out-predicting judges on almost every measure of
criminality we could construct, including violent crime. The point is that the outcome you select for
your algorithm will define it. So you need to think carefully about what that outcome is and what
else it might be leaving out.
Check for bias
Another serious example of this principle is the role of race in algorithms. There is the possibility
that any new system for making predictions and decisions might exacerbate racial disparities,
especially in policy domains like criminal justice. Caution is merited: the underlying data used to
train an algorithm may be biased, reflecting a history of discrimination. And data scientists may
sometimes inadvertently report misleading performance measures for their algorithms. We should
take seriously the concern about whether algorithms might perpetuate disadvantage, no matter
what the other benefits.
Ultimately, though, this is an empirical question. In our bail project, we found that the algorithm can
actually reduce race disparities in the jail population. In other words, we can reduce crime, jail
populations and racial bias – all at the same time – with the help of algorithms.
This is not some lucky happenstance. An appropriate first benchmark for evaluating the effect of
using algorithms is the existing system – the predictions and decisions already being made by
humans. In the case of bail, we know from decades of research that those human predictions can be
biased. Algorithms have a form of neutrality that the human mind struggles to obtain, at least within
their narrow area of focus. It is entirely possible—as we saw—for algorithms to serve as a force for
equity. We ought to pair our caution with hope.
The lesson here is that if the ultimate outcome you care about is hard to measure, or involves a hard-
to-define combination of outcomes, then the problem is probably not a good fit for machine
learning. Consider a problem that looks like bail: Sentencing. Like bail, sentencing of people who
have been found guilty depends partly on recidivism risk. But sentencing also depends on things
like society’s sense of retribution, mercy, and redemption, which cannot be directly measured. We
intentionally focused our work on bail rather than sentencing because it represents a point in the
criminal justice system where the law explicitly asks narrowly for a prediction. Even if there is a
measurable single outcome, you’ll want to think about the other important factors that aren’t
encapsulated in that outcome – like we did with race in the case of bail – and work with your data
scientists to create a plan to test your algorithm for potential bias along those dimensions.
Verify your algorithm in an experiment on data it hasn’t seen
Once we have selected the right outcome, a final potential pitfall stems from how we measure
success. For machine learning to be useful for policy, it must accurately predict “out-of-sample.”
That means it should be trained on one set of data, then tested on a dataset it hasn’t seen before. So
when you give data to a vendor to build a tool, withhold a subset of it. Then when the vendor comes
back with a finished algorithm, you can perform an independent test using your “hold out” sample.
An even more fundamental problem is that current approaches in the field typically focus on
performance measures that, for many applications, are inherently flawed. Current practice is to
report how well one’s algorithm predicts only among those cases where we can observe the
outcome. In the bail application this means our algorithm can only use data on those defendants
who were released by the judges, because we only have a label providing the correct answer to
whether the defendant commits a crime or not for defendants judges chose to release. What about
defendants that judges chose not to release? The available data cannot tell us whether they would
have reoffended or not.
This makes it hard to evaluate whether any new machine learning tool can actually improve
outcomes relative to the existing decision-making system —in this case, judges. If some new
machine learning-based release rule wants to release someone the judges jailed, we can’t observe
their “label”, so how do we know what would happen if we actually released them?
This is not merely a problem of academic interest. Imagine that judges have access to information
about defendants that the algorithm does not, such as whether family members show up at court to
support them. To take a simplified, extreme example, suppose the judge is particularly accurate in
using this extra information and can apply it to perfectly predict whether young defendants re-
offend or not. Therefore the judges release only those young people who are at zero risk for re-
offending. The algorithm only gets to see the data for those young people who got released – the
ones who never re-offend. Such an algorithm would essentially conclude that the judge is making a
serious mistake in jailing so many youthful defendants (since none of the ones in its dataset go on to
commit crimes). The algorithm would recommend that we release far more youthful defendants.
The algorithm would be wrong. It could inadvertently make the world worse off as a result.
In short, the fact that an algorithm predicts well on the part of the test data where we can observe
labels doesn’t necessarily mean it will make good predictions in the real world. The best way to solve
this problem is to do a randomized controlled trial of the sort that is common in medicine. Then we
could directly compare whether bail decisions made using machine learning lead to better outcomes
than those made on comparable cases using the current system of judicial decision-making. But
even before we reach that stage, we need to make sure the tool is promising enough to ethically
justify testing it in the field. In our bail case, much of the effort went into finding a “natural
experiment” to evaluate the tool.
Our natural experiment built on two insights. First, within jurisdictional boundaries, it’s essentially
random which judges hear which cases. Second, judges are quite different in how lenient they are.
This lets us measure how good judges are at selecting additional defendants to jail. How much crime
reduction does a judge with a 70% release rate produce compared to a judge with an 80% release
rate? We can also use these data to ask how good an algorithm would be at selecting additional
defendants to jail. If we took the caseload of an 80% release rate judge and used our algorithm to
pick an additional 10% of defendants to jail, would we be able to achieve a lower crime rate than
what the 70% release rate judge gets? That “human versus machine” comparison doesn’t get tripped
up by missing labels for defendants the judges jailed but the algorithm wants to release, because we
are only asking the algorithm to recommend additional detentions (not releases). It’s a comparison
that relies only on labels we already have in the data, and it confirms that the algorithm’s predictions
do indeed lead to better outcomes than those of the judges.
It can be misguided, and sometimes outright harmful, to adopt and scale up new predictive tools
when they’ve only been evaluated on cases from historical data with labels, rather than evaluated
based on their effect on the key policy decision of interest. Smart users might go so far as to refuse to
use any prediction tool that does not take this evaluation challenge more seriously.
Remember there’s still a lot we don’t know
While machine learning is now widely used in commercial applications, using these tools to solve
policy problems is relatively new. There is still a great deal that we don’t yet know but will need to
figure out moving forward.
Perhaps the most important example of this is how to combine human judgment and algorithmic
judgment to make the best possible policy decisions. In the domain of policy, it is hard to imagine
moving to a world in which the algorithms actually make the decisions; we expect that they will
instead be used as decision aids.
For algorithms to add value, we need people to actually use them; that is, to pay attention to them in
at least some cases. It is often claimed that in order for people to be willing to use an algorithm, they
need to be able to really understand how it works. Maybe. But how many of us know how our cars
work, or our iPhones, or pace-makers? How many of us would trade performance for
understandability in our own lives by, say, giving up our current automobile with its mystifying
internal combustion engine for Fred Flintstone’s car?
The flip side is that policymakers need to know when they should override the algorithm. For people
to know when to override, they need to understand their comparative advantage over the algorithm
– and vice versa. The algorithm can look at millions of cases from the past and tell us what happens,
on average. But often it’s only the human who can see the extenuating circumstance in a given case,
since it may be based on factors not captured in the data on which the algorithm was trained. As
with any new task, people will be bad at this in the beginning. While they should get better over
time, there would be great social value in understanding more about how to accelerate this learning
curve.
Pair caution with hope
A time traveler going back to the dawn of the 20century would arrive with dire warnings. One
invention was about to do a great deal of harm. It would become one of the biggest causes of death—
and for some age groups the biggest cause of death. It would exacerbate inequalities, because those
who could afford it would be able to access more jobs and live more comfortably. It would change
the face of the planet we live on, affecting the physical landscape, polluting the environment and
contributing to climate change.
The time traveler does not want these warnings to create a hasty panic that completely prevents the
development of automobile transportation. Instead, she wants these warnings to help people skip
ahead a few steps and follow a safer path: to focus on inventions that make cars less dangerous, to
build cities that allow for easy public transport, and to focus on low emissions vehicles.
A time traveler from the future talking to us today may arrive with similar warnings about machine
learning and encourage a similar approach. She might encourage the spread of machine learning to
help solve the most challenging social problems in order to improve the lives of many. She would
also remind us to be mindful, and to wear our seatbelts.
th
Jon Kleinberg is a professor of computer science at Cornell University and the coauthor of the
textbooks Algorithm Design (with Éva Tardos) and Networks, Crowds, and Markets (with David
Easley).
Jens Ludwig is the McCormick Foundation Professor of Social Service Administration, Law and Public Policy at the
University of Chicago.
Sendhil Mullainathan is a professor of economics at Harvard University and the coauthor (with Eldar Shar) of
Scarcity: Why Having Too Little Means So Much.
https://hbr.org/search?term=jon+kleinberg
https://hbr.org/search?term=jens+ludwig
https://hbr.org/search?term=sendhil+mullainathan
4/11/2017 TheUtterUselessnessofJobInterviews­TheNewYorkTimes
https://www.nytimes.com/2017/04/08/opinion/sunday/the­utter­uselessness­of­job­interviews.html 1/4

SundayReview
TheUtterUselessnessofJobInterviews
GrayMatter
ByJASONDANA APRIL8,2017
Afriendofmineoncehadacuriousexperiencewithajobinterview.Excitedabout
thepossibleposition,shearrivedfiveminutesearlyandwasimmediatelyushered
intotheinterviewbythereceptionist.Followinganamicablediscussionwithapanel
ofinterviewers,shewasofferedthejob.
Afterward,oneoftheinterviewersremarkedhowimpressedshewasthatmy
friendcouldbesocomposedaftershowingup25minuteslatetotheinterview.Asit
turnedout,myfriendhadbeentoldthewrongstarttimebyhalfanhour;shehad
remainedcomposedbecauseshedidnotknowshewaslate.
Myfriendisnotthetypeofpersonwhowouldhaveremainedcoolhadshe
knownshewaslate,buttheinterviewersreachedtheoppositeconclusion.Ofcourse,
theyalsocouldhaveconcludedthathercalmreflectedaflippantattitude,whichis
alsonotatraitofhers.Eitherway,theywouldhavebeenwrongtoassumethather
behaviorintheinterviewwasindicativeofherfutureperformanceatthejob.
Thisisawidespreadproblem.Employersliketousefree­form,unstructured
interviewsinanattemptto“gettoknow”ajobcandidate.Suchinterviewsarealso
increasinglypopularwithadmissionsofficersatuniversitieslookingtomoveaway
http://www.nytimes.com/

https://www.nytimes.com/pages/opinion/index.html#sundayreview
https://www.nytimes.com/column/gray-matter
4/11/2017 TheUtterUselessnessofJobInterviews­TheNewYorkTimes
https://www.nytimes.com/2017/04/08/opinion/sunday/the­utter­uselessness­of­job­interviews.html 2/4
fromtestscoresandotherstandardizedmeasuresofstudentquality.Butasinmy
friend’scase,interviewerstypicallyformstrongbutunwarrantedimpressionsabout
interviewees,oftenrevealingmoreaboutthemselvesthanthecandidates.
Peoplewhostudypersonnelpsychologyhavelongunderstoodthis.In1979,for
example,theTexasLegislaturerequiredtheUniversityofTexasMedicalSchoolat
Houstontoincreaseitsincomingclasssizeby50studentslateintheseason.The
additional50studentsthattheschooladmittedhadreachedtheinterviewphaseof
theapplicationprocessbutinitially,followingtheirinterviews,wererejected.Ateam
ofresearcherslaterfoundthatthesestudentsdidjustaswellastheirother
classmatesintermsofattrition,academicperformance,clinicalperformance(which
involvesrapportwithpatientsandsupervisors)andhonorsearned.Thejudgmentof
theinterviewers,inotherwords,addednothingofrelevancetotheadmissions
process.
ResearchthatmycolleaguesandIhaveconductedshowsthattheproblemwith
interviewsisworsethanirrelevance:Theycanbeharmful,undercuttingtheimpact
ofother,morevaluableinformationaboutinterviewees.
Inoneexperiment,wehadstudentsubjectsinterviewotherstudentsandthen
predicttheirgradepointaveragesforthefollowingsemester.Thepredictionwasto
bebasedontheinterview,thestudent’scoursescheduleandhisorherpastG.P.A.
(WeexplainedthatpastG.P.A.washistoricallythebestpredictoroffuturegradesat
theirschool.)InadditiontopredictingtheG.P.A.oftheinterviewee,oursubjects
alsopredictedtheperformanceofastudenttheydidnotmeet,basedonlyonthat
student’scoursescheduleandpastG.P.A.
Intheend,oursubjects’G.P.A.predictionsweresignificantlymoreaccuratefor
thestudentstheydidnotmeet.Theinterviewshadbeencounterproductive.
Itgetsworse.Unbeknowntooursubjects,wehadinstructedsomeofthe
intervieweestorespondrandomlytotheirquestions.Thoughmanyofour
interviewerswereallowedtoaskanyquestionstheywanted,someweretoldtoask
onlyyes/noorthis/thatquestions.Inhalfoftheseinterviews,theintervieweeswere
instructedtoanswerhonestly.Butintheotherhalf,theintervieweeswereinstructed
toanswerrandomly.Specifically,theyweretoldtonotethefirstletterofeachofthe
http://journal.sjdm.org/12/121130a/jdm121130a.pdf
4/11/2017 TheUtterUselessnessofJobInterviews­TheNewYorkTimes
https://www.nytimes.com/2017/04/08/opinion/sunday/the­utter­uselessness­of­job­interviews.html 3/4
lasttwowordsofanyquestion,andtoseewhichcategory,A­MorN­Z,eachletter
fellinto.Ifbothletterswereinthesamecategory,theintervieweeanswered“yes”or
tookthe“this”option;iftheletterswereindifferentcategories,theinterviewee
answered“no”ortookthe“that”option.
Strikingly,notoneinterviewerreportednoticingthatheorshewasconducting
arandominterview.Morestrikingstill,thestudentswhoconductedrandom
interviewsratedthedegreetowhichthey“gottoknow”theintervieweeslightly
higheronaveragethanthosewhoconductedhonestinterviews.
Thekeypsychologicalinsighthereisthatpeoplehavenotroubleturningany
informationintoacoherentnarrative.Thisistruewhen,asinthecaseofmyfriend,
theinformation(i.e.,hertardiness)isincorrect.Andthisistrue,asinour
experiments,whentheinformationisrandom.Peoplecan’thelpseeingsignals,even
innoise.
Therewasafinaltwistinourexperiment.Weexplainedwhatwehaddone,and
whatourfindingswere,toanothergroupofstudentsubjects.Thenweaskedthemto
ranktheinformationtheywouldliketohavewhenmakingaG.P.A.prediction:
honestinterviews,randominterviews,ornointerviewsatall.Theymostoften
rankednointerviewlast.Inotherwords,amajorityfelttheywouldratherbasetheir
predictionsonaninterviewtheyknewtoberandomthantohavetobasetheir
predictionsonbackgroundinformationalone.
Sogreatispeople’sconfidenceintheirabilitytogleanvaluableinformation
fromafacetofaceconversationthattheyfeeltheycandosoeveniftheyknowthey
arenotbeingdealtwithsquarely.Buttheyarewrong.
Whatcanbedone?Oneoptionistostructureinterviewssothatallcandidates
receivethesamequestions,aprocedurethathasbeenshowntomakeinterviews
morereliableandmodestlymorepredictiveofjobsuccess.Alternatively,youcanuse
interviewstotestjob­relatedskills,ratherthanidlychattingoraskingpersonal
questions.
Realistically,unstructuredinterviewsaren’tgoingawayanytimesoon.Until
then,weshouldbehumbleaboutthelikelihoodthatourimpressionswillprovidea
4/11/2017 TheUtterUselessnessofJobInterviews­TheNewYorkTimes
https://www.nytimes.com/2017/04/08/opinion/sunday/the­utter­uselessness­of­job­interviews.html 4/4
reliableguidetoacandidate’sfutureperformance.
JasonDanaisanassistantprofessorofmanagementandmarketingattheYaleSchool
ofManagement.
FollowTheNewYorkTimesOpinionsectiononFacebookandTwitter(@NYTopinion),
andsignupfortheOpinionTodaynewsletter.
Aversionofthisop­edappearsinprintonApril9,2017,onPageSR6oftheNewYorkeditionwiththe
headline:AgainstJobInterviews.
©2017TheNewYorkTimesCompany
https://www.facebook.com/nytopinion

http://www.nytimes.com/newsletters/opiniontoday/
http://www.nytimes.com/content/help/rights/copyright/copyright-notice.html
8 The process-performance paradox in
expert judgment
How can experts know so much and predict so badly?
COLIN F. CAMERER AND ERIC J. JOHNSON
I. INTRODUCTION
A mysterious fatal disease strikes a large minority of the population.
The disease is incurable, but an expensive drug can keep victims alive. Con-
gress decides that the drug should be given to those whose lives can be
extended longest, which only a few specialists can predict. The experts work
around the clock searching for a cure; allocating the drug is a new chore they
would rather avoid.
In research on decision making there are two views about such experts. The
views suggest different technologies for modeling experts’ decisions so that
they can do productive research rather than make predictions. One view,
which emerges from behavioral research on decision making, is skeptical
about the experts. Data suggest that a wide range of experts like our hypotheti-
cal specialists are not much better predictors than less expert physicians, or
interns. Furthermore, this view suggests a simple technology for replacing
experts – a simple linear regression model (perhaps using medical judgments
as inputs). The regression does not mimic the thought process of an expert,
but it probably makes more accurate predictions than an expert does.
The second view, stemming from research in cognitive science, suggests that
expertise is a rare skill that develops only after much instruction, practice, and
experience. The cognition of experts is more sophisticated than that of nov-
ices; this sophistication is presumed to produce better predictions. This view
suggests a model that strives to mimic the decision policies of experts – an
“expert (or knowledge-based) system” containing lists of rules experts use in
judging longevity. An expert system tries to match, not exceed, the perfor-
mance of the expert it represents.
In this chapter we describe and integrate these two perspectives. Integra-
tion comes from realizing that the behavioral and cognitive science ap-
proaches have different goals: Whereas behavioral decision theory empha-
sizes the performance of experts, cognitive science usually emphasizes differ-
ences in experts’ processes (E. Johnson, 1988).
A few caveats are appropriate. Our review is selective; it is meant to empha-
size the differences between expert performance and process. The generic
195
196 COLIN F. CAMERER AND ERIC J. JOHNSON
decision-making task we describe usually consists of repeated predictions,
based on the same set of observable variables, about a complicated outcome –
graduate school success, financial performance, health- that is rather unpre-
dictable. For the sake of brevity, we shall not discuss other important tasks
such as probability estimation or revision, inference, categorization, or trade-
offs among attributes, costs, and benefits.
The literature we review is indirectly related to the well-known “heuristics
and biases” approach (e.g., Kahneman, Slovic, & Tversky, 1982). Our theme
is that experts know a lot but predict poorly. Perhaps their knowledge is
biased, if it comes from judgment heuristics or they use heuristics in applying
it. We can only speculate about this possibility (as we do later, in a few places)
until further research draws the connection more clearly.
For our purposes, an expert is a person who is experienced at making
predictions in a domain and has some professional or social credentials. The
experts described here are no slouches: They are psychologists, doctors, aca-
demics, accountants, gamblers, and parole officers who are intelligent, well
paid, and often proud. We draw no special distinction between them and
extraordinary experts, or experts acclaimed by peers (cf. Shanteau, 1988). We
suspect that our general conclusions would apply to more elite populations of
experts, 1 but clearly there have been too few studies of these populations.
The chapter is organized as follows: In section 2 we review what we cur-
rently know about how well experts perform decision tasks, then in section 3
we review recent work on expert decision processes. Section 4 integrates the
views described in sections 2 and 3. Then we examine the implications of this
work for decision research and for the study of expertise in general.
2. PERFORMANCE OF EXPERTS
Most of the research in the behavioral decision-making approach to
expertise has been organized around performance of experts. A natural mea-
sure of expert performance is predictive accuracy; later, we discuss other
aspects. Modern research on expert accuracy emanates from Sarbin (1944),
who drew an analogy between clinical reasoning and statistical (or “actuar-
ial”) judgment. His data, and the influential book by Meehl (1954), estab-
lished that in many clinical prediction tasks experts were less accurate than
simple formulas based on observable variables. As Dawes and Corrigan
(1974, p. 97) wrote, “the statistical analysis was thought to provide a floor to
which the judgment of the experienced clinician could be compared. The floor
turned out to be a ceiling.”
I While presenting a research seminar discussing the application of linear models, Robyn Dawes
reported Einhorn’s (1972) classic finding that three experts’ judgments of Hodgkin’s disease
severity were uncorrelated with actual severity (measured by how long patients lived). One
seminar participant asked Dawes what would happen if a certain famous physician were studied.
The questioner was sure that Dr. So-and-so makes accurate judgments. Dawes called Einhorn;
the famous doctor turned out to be subject 2.
The process-performance paradox in expert judgment 197
2.1. A language for quantitative studies of performance
In many studies, linear regres.sion techniques are used to construct
statistical models of expert judgments (and to improve those judgments) and
distinguish components of judgment accuracy and error.z These techniques
are worth reviewing briefly because they provide a useful language for discuss-
ing accuracy and its components.
A subject’s judgment (denoted Ys) depends on a set of informational cues
(denoted X 1 , •• • ,Xn). The cues could be measured objectively (college
grades) or subjectively by experts (evaluating letters of recommendation).
The actual environmental outcome (or “criterion”) (denoted Ye) is also as-
sumed to be a function of the same cues.
In the comparisons to be described, several kinds of regressions are com-
monly used. One such regression, the “actuarial” model, predicts outcomes Ye
based on observable cues X 1• The model naturally separates Y” into a predict-
able component Y0 a linear combination3 of cues weighted by regression
coefficients b;,e, and an unpredictable error component Ze. That is,
Ye = 2: b;,~; + ze (actuarial model) (1)
=Ye + Ze
Figure 8.1 illustrates these relationships, as well as others that we shall discuss
subsequently.
2.2. Experts versus actuarial models
The initial studies compared expert judgments with those of actuarial
models. That is, the correlation between the expert judgment Ys and the
outcome Ye (often denoted ra, for “achievement”) was compared with the
correlation between the model’s predicted outcome Ye and the actual outcome
Ye (denoted Re). 4
Meehl (1954) reviewed about two dozen studies. Cross-validated actuarial
models outpredicted clinical judgment (i.e., Re was greater than ra) in all but
one study. Now there have been about a hundred studies; experts did better in
only a handful of them (mostly medical tasks in which well-developed theory
outpredicted limited statistical experience; see Dawes, Faust, & Meehl,
2 Many regression studies use the general “lens model” proposed by Egon Brunswik (1952) and
extended by Hammond (1955) and others. The lens model shows the interconnection between
two systems: an ecology or environment, and a person making judgments. The notation in the
text is mostly lens-model terminology.
3 Although the functions relating cues to the judgment and the outcome can be of any form, linear
relationships are most often used, because they explain judgments and outcomes surprisingly
well, even when outcomes are known to be nonlinear functions of the cues (Dawes & Corrigan,
1974).
4 The correlation between the actuarial-model prediction and the outcome Y~ is the square root of
the regression R2 , and is denoted Re. A more practical measure of actuarial-model accuracy is
the “cross-validated” correlation, when regression weights derived on one sample are used to
predict a new sample of Ye values.
198 COLIN F. CAMERER AND ERIC J. JOHNSON
Ys
Predictions
by the Expert
Figure 8.1. A quantitative language for describing decision performance.
1989). The studies have covered many different tasks- university admissions,
recidivism or violence of criminals, clinical pathology, medical diagnosis, fi-
nancial investment, sports, weather forecasting. Thirty years after his book
was published, Meehl (1986, p. 373) suggested that “there is no controversy in
social science that shows such a large body of qualitatively diverse studies
coming out so uniformly in the same direction.”
2.3. Experts versus improper models
Despite their superiority to clinical judgment, actuarial models are
difficult to use because the outcome Ye must be measured, to provide the raw
data for deriving regression weights. It can be costly or time-consuming to
measure outcomes (for recidivism or medical diagnosis), or definitions of
outcomes can be ambiguous (What is “success” for a Ph.D.?). And past
outcomes must be used to fit cross-validated regression weights to predict
current outcomes, which makes models vulnerable to changes in true coeffi-
The process-performance paradox in expert judgment 199
dents over time. Therefore, “improper”5 models- which derive regression
weights without using Ye – might be more useful and nearly as accurate as
proper actuarial models.
In one improper method, regression weights are derived from the Ys judg-
ments themselve:;;; then cues arc weighted by the derived weights and·
summed. This procedure amounts to separating the overall expert judgment
Ys into two components, a modeled component Ys and a residual component
zs, and using only the modeled component Ys as a prediction. 6 That is,
Ys = Lb;.,J( + Zs
= Ys + Zs
(2)
If the discarded residual zs is mostly random error, the modeled component Ys
will correlate more highly with the outcome than will the overall judgment,
Ys. (In standard terminology, the correlation between Ys and Ye, denoted r m’
will be higher than ra.)
This method is called “bootstrapping” because it can improve judgments
without any outcome information: It pulls experts up by their bootstraps.
Bowman (1963) first showed that bootstrapping improved judgments in pro-
duction scheduling; similar improvements were found by Goldberg (1970) in
clinical predictions based on MMPI scores7 and by Dawes (1971) in graduate
admissions. A cross-study comparison showed that bootstrapping works very
generally, but usually adds only a small increment to predictive accuracy
(Camerer, 1981a). Table S.l shows some of those results. Accuracy can be
usefully dissected with the lens-model equation, an identity relating several
interesting correlations. Einhorn’s (1974) version of the equation states
(3)
where R; is the bootstrapping model R 2 (how closely the judge resembles the
linear model), and rz is the correlation between bootstrapping-model residuals
zs and outcomes Ye (the “residual validity”). If the residuals zs represent only
random error in weighing and combining the cues, rz will be close to zero. In
this case, r m will certainly be larger than ra, and because Rs < 1, bootstrapping will improve judgments. But even if rz is greater than zero (presumably be- cause residuals contain some information that is correlated with outcomes), bootstrapping works unless s By contrast, actuarial models often are called "optimal linear models," because by definition no linear combination of the cues can pn:uict Ye more accurately. 6 Of course, such an explanation is "paramorphic" (Hoffman, 1960): It describes judgments in a purely statistical way, as if experts were weighing and combining cues in their heads; the process they use might be quite different. However, Einhorn, Kleinmuntz, and Kleinmuntz (1979) argued persuasively that the paramorphic regression approach might capture process indirectly. 7 Because suggested Minnesota Multiphasic Personality Inventory (MMPI) cutoffs were origi- nally created by statistical analysis, it may seem unsurprising that a statistical model beats a judge who tries to mimic it. But the model combines scores linearly, whereas judges typically use various scores in configura} nonlinear combinations. 200 COLIN F. CAMERER AND ERIC J. JOHNSON ( 1 - Rs) liz r > r
z m 1 + R
s
(4)
For Rs = .6 (a reasonable value; see Table 8.1), residual validity rz must be
about half as large as model accuracy for experts to outperform their own
bootstrapping models. This rarely occurs.
When there are not many judgments, compared with the number of vari-
ables, the regression. weights in a bootstrapping model cannot be estimated
reliably. Then one can simply weight the cues equallys and add them up.
Dawes and Corrigan (1974) showed that equal weights worked remarkably
well in several empirical comparisons (the accuracies of some of these are
shown in the column rew’ in Table 8.1). Simulations show that equal weighting
generally works as well as least squares estimation of weights unless there are
twenty times as many observations as predictors (Einhorn & Hogarth, 1975).
As Dawes and Corrigan (1974) put it, “the whole trick is to decide what
variables to look at and then to know how to add” (p. 105).
2.4. Training and experience: experts versus novices
Studies have shown that expert judgments are less accurate than
those of statistical models of varying sophistication. Two other useful compari-
sons are those between experts and novices and between experienced and
inexperienced experts.
Garb (1989) reviewed more than fifty comparisons of judgments by clinical
psychologists and novices. The comparisons suggest that (academic) training
helps but additional experience does not. Trained clinicians and graduate
students were more accurate than novices (typically untrained students, or
secretaries) in using the MMPI to judge personality disorders. Students did
better and better with each year of graduate training. The effect of training
was not large (novices might classify 28% correctly, and experts 40% ), but it
existed in many studies. Training, however, generally did not help in interpret-
ing projective tests (drawings, Rorschach inkblots, and sentence-completion
tests); using such tests, clinical psychologists probably are no more accurate
than auto mechanics or insurance salesmen.
Training has some effects on accuracy, but experience has almost none. In
judging personality and neurophysiological disorders, for example, clinicians
do no better than advanced graduate students. Among experts with varying
amounts of experience, the correlations betvveen amount of clinical experience
and accuracy are roughly zero. Libby and Frederick (1989) found that experi-
ence improved the accuracy of auditors’ explanations of audit errors only
slightly (although even inexperienced auditors were better than students).
In medical judgments too, training helps, but experience does not. Gustaf-
8 Of course, variables must be standardized by dividing them by their sample standard deviations.
Otherwise, a variable with a wide range would account for more than its share of the variation in
the equally weighted sum.
Table 8.1. Examples of regression-study results
Mean accuracy of:
Model fit, Judge,
Study Prediction task Rs ra
Goldberg (1970) Psychosis vs. neurosis .77 .28
Dawes (1971) Ph.D. admissions .78 .19
Einhorn (1972) Disease severity .41 .01
Libby (1976)b Bankruptcy .79 .50
Wiggens & Kohen (1971) Grades .85 .33
0 All are cross-validated Re except Einhorn (1972) and Libby (1976).
bfigures cited are recalculations by Goldberg (1976).
Source: Adapted from Camerer (1981a) and Dawes & Corrigan (1974).
Bootstrapping Bootstrapping Equal-weight Actuarial
model, residuals, model, model, 0
rm rz ‘ew Re
.31 .07 .34 .45
.25 .01 .48 .38
.13 .06 n.a. .35
.53 .13 n.a. .67
.50 .01 .60 .57
202 COLIN F. CAMERER AND ERIC J. JOHNSON
son (1963) found no difference between residents and surgeons in predicting
the length of hospital stay after surgery. Kundel and LaFollette (1972) re-
ported that novices and first-year medical students were unable to detect
lesions from radiographs of abnormal lungs, but fourth-year students (who
had had some training in radiography) were as good as full-time radiologists.
These tasks usually hav~ a rath~r low performance ceiling. Graduate train-
ing may provide all the experience one requires to approach the ceiling. But
the myth that additional experience helps is persistent. One of the psychology
professors who recently revised the MMPI said that “anybody who can count
can score it [the MMPI], but it takes expertise to interpret it.” (Philadelphia
Inquirer, 1989). Yet Goldberg’s (1970) data suggest that the only expertise
required is the ability to add scores with a hand calculator or paper and pencil.
If a small amount of training can make a person as accurate as an experi-
enced clinical psychologist or doctor, as the data imply, then lightly trained
paraprofessionals could replace heavily trained experts for many routine kinds
of diagnoses. Citing Shortliffe, Buchanan, and Feigenbaum (1979), Garb
(1989) suggested that “intelligent high school graduates, selected in large part
because of poise and warmth of personality, can provide competent medical
care for a limited range of problems when guided by protocols after only 4 to 8
weeks of training.”
It is conceivable that outstanding experts are more accurate than models
and graduate students in some tasks. For instance, in Goldberg’s (1959) study
of organic brain damage diagnoses, a well-known expert (who worked very
slowly) was right 83% of the time, whereas other Ph_D_ clinical psychologists
got 65% right. Whether such extraordinary expertise is a reliable phenome-
non or a statistical fluke is a matter for further research.
2.5. Expert calibration
Whereas experts may predict less accurately than models, and only
slightly more accurately than novices, they seem to have better self-insight
about the accuracy of their predictions. Such self-insight is called “calibra-
tion.” Most people are poorly calibrated, offering erroneous reports of the
quality of their predictions, and these reports systematically err in the direc-
tion of overconfidence: When they say a class of events are 80% likely, those
events occur less than 80% of the time (Lichtenstein, Fischhoff, & Phillips,
1977)- There is some evidence that experts are less overconfident than nov-
ices. For instance, Levenberg (1975) had subjects look at “kinetic family
drawings” to detect whether the children who drew them were normal. The
results were, typically, a small victory for training: Psychologists and secretar-
ies got 66% and 61% right, respectively (a coinfiip would get half right). Of
these cases about which subjects were “positively certain,” the psychologists
and secretaries got 76% and 59% right, respectively. The psychologists were
better calibrated than novices- they used the phrase “positively certain”
more cautiously (and appropriately) – but they were still overconfident.
The process-performance paradox in expert judgment 203
Better calibration of experts has also been found in some other studies
(Garb, 1989). Expert calibration is ~etter than novice calibration in bridge
(Keren, in press), but not in blackjack (Wagenaar & Keren, 1985). Doctors’
judgments of pneumonia and skull fracture are badly calibrated (Christensen-
Szalanski & Bushyhead, 1981; DeSmet, Fryback, & Thornbury, 1979).
Weather forecasters are extremely we11 calibrated (Murphy & Winkler, 1977).
Experiments with novices showed that training improved calibration, reduc-
ing extreme overconfidence in estimating probabilities and numerical quanti-
ties (Lichtenstein t:t al., 1977)
2.6. Summary: expert performance
The depressing conclusion from these studies is that expert judgments
in most clinical and medical domains are no more accurate than those of
lightly trained novices. (We know of no comparable reviews of other domains,
but we suspect that experts are equally unimpressive in most aesthetic, com-
mercial, and physical judgments.) And expert judgments have been wurst:
than those of the simplest statistical models in virtually all domains that have
been studied. Experts are sometimes less overconfident than novices, but not
always.
3. EXPERT DECISION PROCESSES
The picture of expert performance painted by behavioral decision theo-
rists is unflattering. Why are experts predicting so badly? We know that many
experts have special cognitive and memory skills (Chase & Simon, 1973; Erics-
son & Polson, 1988; Larkin, McDermott, Simon, & Simon, 1980). Do expert
decision-makers have similar strategies and skill? If so, why don’t they perform
better? Three kinds of evidence help answer these questions: process analyses
of expert judgments, indirect analyses using regression models, and laboratory
studies in which subjects become “artificial experts” ln a simple domain.
3.1. Direct evidence: process analyses of experts
The rules and cues experts use can be discovered by using process
tracing techniques – protocol analysis and monitoring of information acquisi-
tion. Such studies have yielded consistent conclusions across a diverse set of
domains.
Search i.f; contingent. If people think like a regression model, weighting cues
and adding them, then cue search will be simple- the same variables will be
examined, in the same sequence, in every case. Novices behave that way. But
experts have a more active pattern of contingent search: Subsets of variables
are considered in each case, in differen.t sequences. Differences between nov-
ice and expert searches have been found in studies of financial analysts
204 COLIN F. CAMERER AND ERIC J. JOHNSON
(Bouman, 1980; E. Johnson, 1988), auditors (Bedard & Mock, 1989), gradu-
ate admissions (E. Johnson, 1980), neurologists (Kleinmuntz, 1968), and phy-
sicians (Elstein, Shulman, & Sprafka, 1978; P. Johnson, Hassebrock, Duran,
& Muller, 1982).
Experts search less. A common finding in studies of expert cognition is that
information processing is less costly for experts than for novices. For example,
expert waiters (Ericsson & Chase, 1981) and chess players (Chase & Simon,
1973) have exceptional memory skills. Their memory allows more efficient
encoding of task-specific information; if they wanted to, experts could search
and sit cheaply through more information. But empirical studies show that
experts use less information than novices, rather than more, in auditing
(Bedard, 1989; Bedard & Mock, 1989), financial analysis (Bouman, 1980; E.
Johnson, 1988), and product choice (Bettman & Park, 1980; Brucks, 1985; E.
Johnson & Russo, 1984).
Experts use more knowledge. Experts often search contingently, for limited
sets of variables, because they know a great deal about their domains
(Bouman, 1980; Elstein et al., 1978; Libby & Frederick, 1989). Experts per-
form a kind of diagnostic reasoning, matching the cues in a specific case to
prototypes in a casual brand of hypothesis testing. Search is contingent be-
cause different sets of cues are required for each hypothesis test. Search is
limited because only a small set of cues are relevant to a particular hypothesis.
3.2. Indirect evidence: dissecting residuals
The linear regression models described in section 2 provide a simple
way to partition expert judgment into components. The bootstrapped judg-
ment is a linear combination of observed cues; the residual is everything else.
By dissecting the residual statistically, we can learn how the decision process
experts use deviates from the simple linear combination of cues. It deviates in
three ways.
Experts often use configura[ choice rules. In configura! rules, the impact of one
variable depends on the values of other variables. An example is found in
clinical lore on interpretation of the MMPI. Both formal instruction and verbal
protocols of experienced clinicians give rules that note the state of more than
one variable. A nice example is given by an early rule-based system constructed
by Kleinmuntz (1968) using clinicians’ verbal protocols. Many of the rules in
the system reflect such configura! reasoning: “Call maladjusted if Pa :::> 70 unless
Me< 6r and K > 65.” Because linear regression models weight each cue indepen-
dently, configura! rules will not be captured by the linear form, and the effects
of configura! judgment will be reflected in the regression residual.
Experts use “broken-leg cues.” Cues that are rare but highly diagnostic often
are called broken-leg cues, from an example cited by Meehl (1954; pp. 24-
The process-performance paradox in expert judgment 205
25): A clinician is trying to predict whether or not Professor A will go to the
movies on a given night. A regression model predicts that the professor will
go, hut the clinician knows that the professor recently broke his leg. The cue
“broken leg” probably will get no weight in a regression model of past cases,
because broken legs are rare. 9 But the clinician can confidently predict that
the professor will not go to the movies. The clinician’s recognition of the
broken-leg cue, which is missing from the regression model, will be captured
by the residual. Note that while the frequency of any one broken-leg cue is
rare, in “the mass of cases, there may be many (different) rare kinds of
factors” (Meehl, 1954, p. 25).
Note how the use of configura! rules and broken-leg cues is consistent with
the process data described in section 3. To use configura! rules, experts must
search for different sets of cues in different sequences. Experts also can use
their knowledge about cue diagnosticity to focus on a limited number of highly
diagnostic broken-leg cues. For example, in E. Johnson’s (1988) study of
financial analysts, experts were much more accurate than novices because
they could interpret the impact of news events similar to broken-leg cues.
Experts weight cues inconsistently and make errors in combining them. When
experts do combine cues linearly, any inconsistencies in weighting cues, and
errors in adding them, will be reflected in the regression residual. Thus, if
experts use configura! rules and broken-leg cues, their effects will be con-
tained in the residuals of a linear bootstrapping model. The residuals also
contain inconsistencies and error. By comparing residual variance and test-
retest reliability, Camerer (1981b) estimated that only about 40% of the vari-
ance in residuals was error, 1o and 60% was systematic use of con figural rules
and broken-leg cues. (Those fractions were remarkably consistent across dif-
ferent studies.) The empirical correlation between residuals and outcomes, rz,
however, averaged only about .05 (Camerer, 1981a) over a wider range of
studies. Experts’are using configura! rules and broken-leg cues systematically,
but they are not highly correlated with outcomes. Of course, there may be
some domains in which residuals are more valid. 11
3.3. Artificial experts
A final kind of process evidence comes from “artificial experts,” sub-
jects who spend much time in an experimental environment trying to induce
accurate judgmental rules. A lot of this research belongs to the tradition
9 Unless a broken leg has occurred in the sample used to derive regression weights, the cue
“broken leg” will not vary and will get no regression weight.
10 These data correct the presumption in the early bootstrapping literature (e.g., Dawes, 1971;
Goldberg, 1970) that residuals were entirely human error.
11 A recent study with sales forecasters showed a higher r,, around .2 (Blattberg & Hoch, 1990).
Even though their residuals were quite accurate, the best forecasters only did about as we!~ as
the linear model. In a choice between models and experts, models will win, but a rnr;:chamcal
combination of the two is better still: Adding bootstrapping residuals to an actuarial model
increased predictive accuracy by about 10%.
206 COLIN F. CAMERER AND ERIC J. JOHNSON
of multiple-cue probability learning (MCPL) experiments that stretches back
decades, with the pessimistic conclusion that rule induction is difficult, particu-
larly when outcomes have random error. We shall give three more recent
examples that combine process analysis with a rule induction task.
Several studies have used protocol analysis to determine what it is that
artificial experts have learned. Perhaps the most ambitious attempts to study
extended learning in complex environments were Klayman’s studies of cue
discovery (Klayman, 1988; Klayman & Ha, 1985): Subjects looked at a com-
plex computer display consisting of geometric shapes that affected the dis-
tance traveled by ray traces from one point on the display to another. The true
rule for travel distance was determined by a complex linear model consisting
of seven factors that varied in salience in the display. None of Klayman’s
subjects induced the correct rule over 14 half-hour sessions, but their perfor-
mances improved steadily. Some improvement came from discovering correct
cues (subjects correctly identified only 2.83 of 7 cues, on average). Subjects
who systematically experimented, by varying one cue and holding others
fixed, learned faster and better than others. Because the cues varied greatly in
how much they affected distance, it was important to weight them differently,
but more than four-fifths of the rules stated by subjects did not contain any
numerical elements (such as weights) at all. In sum, cue discovery played a
clear role in developing expertise in this task, but learning about the relative
importance of cues did not.
In a study by Meyer (1987), subjects learned which attributes of a hypotheti-
cal metal alloy led to increases in its hardness. As in Klayman’s study, subjects
continued to learn rules over a long period of time. The true rule for hardness
(which was controlled by the experimenter) was linear, but most subjects
induced configura! rules. Subjects made only fairly accurate predictions, be-
cause the true linear rule could be mimicked by nonlinear rules. Learning
(better performance) consisted of adding in ore elaborate and baroque configu-
ra! rules, rather than inducing the true linear relationships.
In a study by Camerer (1981b), subjects tried to predict simulated wheat-
price changes that depended on two variables and a large interaction between
them (i.e., the true rule was configura!). Subjects did learn to use the interac-
tion in their judgments, but with so much error that a linear bootstrapping
model that omitted the interaction was more accurate. Similarly, in E. John-
son’s (1988) financial-analyst study, even though expert analysts used highly
diagnostic news events, their judgments were inferior to those of a simple
linear model.
3.4. Summary: expert decision processes
Studies of decision processes indicate that expert decision makers are
like experts in other domains: They know more and.use their knowledge to
guide search for small subsets of information, which differ with each case.
Residuals from bootstrapping models and learning experiments also show that
The process-performance paradox in expert judgment 207
experts use configura! rules and cues not captured by linear models (but these
are not always predictive). The process evidence indicates that experts know
more, but what they know does not enable them to outpredict simple statisti-
cal rules. Why not?
4. RECONCILING THE PERFORMANCE AND PROCESS
VIEWS OF EXPERTISE
One explanation for the process-performance paradox is that predic-
tion is only one task that experts must perform; they may do better on other
tasks. Later we shall consider this explanation further. Another explanation is
that experts are quick to develop configura! rules that often are inaccurate,
but they keep these rules or switch to equally poor ones. (The same may be
true of broken-leg cues.) This argument raises three questions, which we
address in turn: Why do experts develop configura! rules? Why are configura!
rules often inaccurate? Why do inaccurate configura! rules persist?
4.1. Why do experts develop configura[ rules?
Configura! rules are easier. Consider two common classes of configura! rules,
conjunctive (hire Hope for the faculty if she has glowing letters of recommen-
dation, good grades, and an interesting thesis) and disjunctive (draft Michael
for the basketball team if he can play guard or forward or center extremely
well). Configura! rules are easy because they bypass the need to trade off
different cues (Are recommendations better predictors than grades?), avoid-
ing the cumbersome weighting and combination of information. Therefore,
configura! rules take much less effort than optimal rules and can yield nearly
optimal choices (E. Johnson & Payne, 1985). 12
Besides avoiding difficult trade-offs, configura! rules require only a simple
categorization of cue values. With conjunctive and disjunctive rules, one need
only know whether or not a cue is above a cutoff; attention can be allocated
economically to categorize the values of many cm~s crudely, rather than catego-
rizing only one or two cues precisely.
Prior theory often suggests configura[ rules. In his study of wheat prices,
Camerer (1981b) found that subjects could learn of the existence of a large
configura! interaction only when cue labels suggested the interaction a priori.
Similarly, cue labels may cause subjects to learn configura! rules where they
are inappropriate, as in Meyer’s (1987) study of alloy hardness. These prior
beliefs about cue-outcome correlations often will be influenced by the “repre-
sentativeness” (Tversky & Kahneman, 1982) of cues to outcomes; the repre-
sentativeness heuristic will sometimes cause errors.
tz Configura! rules are especially useful for narrowing a large set of choices to a subset of
candidates for further consideration.
208 COLIN F. CAMERER AND ERIC J. JOHNSON
Besides their cognitive ease and prior suggestion, complex configura! rules
are easy to learn because it is easy to weave a causal narrative around a
configura! theory. These coherent narratives cement a dependence between
variables that is easy to express but may overweight these “causal” cues, at the
cost of ignoring others. Linear combinations yield no such coherence. Meehl
(1954) provides the following example from clinicial psychology, describing
the case of a woman who was ambivalent toward her husband. One night the
woman came home from a movie alone. Then:
Entering the bedroom, she was terrified to see, for a fraction of a second, a
large black bird (“a raven, I guess”) perched on her pillow next to her
husband’s head. . . . She recalls “vaguely, some poem we read in high
school.” (p. 39)
Meehl hypothesized that the woman’s vision was a fantasy, based on the poem
“The Raven” by Edgar Allen Poe: “The [woman’s] fantasy is that like Poe’s
Lenore, she will die or at least go away and leave him [the husbandl alone.”
Meehl was using a configura! rule that gave more weight to the raven vision
because the woman knew the Poe poem. A linear rule, simply weighting the
dummy variables “raven” and “knowledge of Poe,” yields a narrative that is
much clumsier than Meehl’s compelling analysis. Yet such a model might well
pay attention to other factors, such as the woman’s age, education, and so
forth, which might also help explain her ambivalence.
Configura[ rules can emerge naturally from trying to explain past cases. People
learn by trying to fit increasingly sophisticated general rules to previous cases
(Brehmer, 1980; Meyer, 1987). Complicated configura! rules offer plenty of
explanatory flexibility. For example, a 6-variable model permits 15 two-way
interactions, and a 10-variable model allows 45 interactions. 13 In sports, for
instance, statistics are so plentiful and refined that it is easy to construct subtle
“configuralities” when global rules fail. Bucky Dent was an average New York
Yankee infielder, except in the World Series, where he played “above his
head,” hitting much better than predicted by his overall average. (The vari-
able “Dent” was not highly predictive of success, but adding the interaction
“Dent” x “Series” was.) 14 Because people are reluctant to accept the possibil-
ity of random error (Einhorn, 1986), increasingly complicated configura! ex-
planations are born.
Inventing special cases is an important mechanism for learning in more
13 A linear model with k cues has only k degrees of fr~t:dom, but the k variables offer k(k- 1)/2
multiplicative two-variable interactions (and lots of higher-order interactions).
14 We cannot determine whether Dent was truly better in the World Series or just lucky in a
limited number of Series appearances. Yet his success in “big games” obviously influenced the
Yankees’ owner, George Steinbrenner (who has not otherwise distinguished himself as an
expert decision-maker). He named Dent manager of the Yankees shortly after this conference
was held, citing his ability as a player “to come through when it mattered.” Dent was later fired
49 games into the season (18 wins, 31losses), and the Yankees had tht: worst record in Major
League baseball at the time.
The process-performance paradox in expert judgment 209
deterministic environments, where it can be quite effective. The tendency of
decision-makers to build special-case rules mirrors more adaptive processes of
induction (e.g., Holland, Holyoak, Nisbett, & Thagard, 1986, chapter 3, esp.
pp. 88-89) that can lead to increased accuracy. As Holland and associates
pointed out, however, the validity of these mechanisms rests on the ability to
check each specialization on many cases. In noisy domains like the ones we
are discussing, there are few replications. It was unlikely, for example, that
Dent would appear in many World Series, and even if he did, other “unique”
circumstances (opposing pitching, injuries, etc.) could always yield further
“explanatory” factors.
In sum, configura! rules are appealing because they are easy to use, have
plausible causal explanations, and offer many degrees of freedom to fit data.
Despite these advantages, configura! rules may have a downfall, as detailed in
the next section.
4.2. Why are configura/ rules often inaccurate?
One reason configura! rules may be inaccurate is that whereas they are
induced under specific and often rare conditions, they may well be applied to a
larger set of cases. Often, people induce such rules from observation, they will
be overgeneralizing from a small sample (expecting the sample to be more
“representative” of a population that it is- Tversky & Kahneman, 1982). This
is illustrated by a verbal protocol recorded by a physician who was chair of a
hospital’s admissions committee for house staff, interns, and residents. Seeing
an applicant from Wayne State who had very high board scores, the doctor
recalled a promising applicant from the same school who had perfect board
scores. Unfortunately, after being admitted, the prior aspirant had done poorly
and left the program. The physician recalled this case and applied it to the new
one: “We have to be quite careful with people from Wayne State with very high
board scores .. : . We have had problems in the past.”
Configura! rules may also be wrong because the implicit theories that under-
lie them are wrong. A large literature on “illusory correlation” contains many
examples of variables that are thought to be correlated with outcomes (because
they are similar) but are not. For example, most clinicians and novices think
that people who see male features or androgynous figures in Rorschach ink-
blots are more likely to be homosexual. They are not (Chapman & Chapman,
1967, 1969). A successful portfolio manager we know refused to buy stock in
firms run by overweight CEOs, believing that control of one’s weight and
control of a firm are correlated. Because variables that are only 111usori1y corre-
lated with outcomes are likely to be used by both novices and experts, the small
novice-expert difference suggests that illusory correlations may be common.
Configura! rules are also likely to be unrobust to small errors, or “brittle. “Is
1s Although the robustness of linear models is well established, we know of no analogous work on
the unrobustness of configura! rules.
210 COLIN F. CAMERER AND ERIC J. JOHNSON
Linear models are extremely robust; they fit nonlinear data remarkably well
(Yntema & Torgerson, 1961). That is why omitting a configura! interaction
from a bootstrapping model does not greatly reduce the accuracy of the
model. 16 In contrast, we suspect that small errors in measurement may have
great impacts on configura! rules. For example, the conjunctive rule “require
good grades and test scores” will lead to mistakes if a test score is not a
predictor of success or if the cutoff for “good grades” is wrong; the linear rule
that weights grades and scores and combines them is less vulnerable to either
error.
4.3. Why do inaccurate configura[ rules persist?
One of the main lessons of decision research is that feedback is crucial
for learning. Inaccurate configura! rules may persist because experts who get
slow, infrequent, or unclear feedback will not learn that their rules are wrong.
When feedback must be sought, inaccurate rules may persist because people
tend to search instinctively for evidence that will confirm prior theories
(Klayman & Ha, 1985). Even when feedback is naturally provided, rather
than sought, confirming evidence is more retrievable or “available” than
disconfirming evidence (Tversky & Kahneman, 1973). The disproportionate
search and recall of confirming instances will sustain experts’ faith in inaccu-
rate configura! rules. Even when evidence does disconfirm a particular rule,
we suspect that the natural tendencies to construct such rules (catalogued
earlier) will cause experts to refine their rules rather than discard them.
4.4. Nonpredictive functions of expertise
The thinking of experts is rich with subtle distinctions, novel catego-
ries, and complicated configura! rules for making predictions. We have given
several reasons why such categories and rules might arise, and persist even if
they are inaccurate. Our arguments provide one possible explanation why
knowledgeable experts, paradoxically, are no better at making predictions
than novices and simple models.
Another explanation is that the knowledge that experts acquire as they
learn may not be useful for making better predictions about important long-
range outcomes, but it may be useful for other purposes. Experts are indis-
pensable for measuring variables (Sawyer, 1966) and discovering new ones
(E. Johnson, 1988).
Furthermore, as experts learn, they may be able to make more kinds of
predictions, even if they are no more accurate; we speculate that they mistake
their increasing fertility for increasing accuracy. Taxi drivers know lots of
alternative routes when they see traffic on the Schuylkill Expressway ( cf.
16 Linear models are robust to nonlinearities provided the relationship between each predictor and
outcome: has the same direction for any values qfthe other predictors (although the relationship’s
magnitude will vary). This property is sometimes called “conditional monotonicity.”
The process-performance paradox in expert judgment 211
Chase, 1983), and they probably can predict their speeds on those alternative
routes better than a novice can. Rut can the experts predict whether there will
be heavy traffic on the expressway better than a statistical model can (using
time of day, day of week, and weather, for example)? We doubt it.
There are also many social benefits of expertise that people can provide
better than models can. Models can make occasional large mistakes that
experts, having common sense, would know to avoid (Shanteau, 1988).17
Experts can explain themselves better, and people usually feel that an expert’s
intuitive judgments are fairer than those of a model (cf. Dawes, 1971).
Some of these attitudes toward experts stem from the myth that experts are
accurate predictors, or the hope that an expert will never err. 18 Many of these
social benefits should disappear with time, if people learn that models are
better; until then, experts have an advantage. (Large corporations have
learned: They use models in scoring credit risks, adjusting insurance claims,
and other activities where decisions arc routine and cost savings are large.
Consumers do think that such rules are unfair, but the cost savings overwhelm
their objections.)
5. IMPLICATIONS FOR UNDERSTANDING
EXPERT DECISION MAKING
Our review produces a consistent, if depressing, picture of expert
decision-makers. They are successful at generating hypotheses and inducing
complex decision rules. The result is a more efficient search of the available
information directed by goals and aided by the experts’ superior store of
knowledge. Unfortunately, their knowledge and rules have little impact on
experts’ performance. Sometimes experts are more accurate than novices
(though not always), but they are rarely better than simple statistical models.
An inescapable conclusion of this research is that experts do some things
well and others poorly. Sawyer (1966) found that expert measurement of cues,
and statistical combination of them, worked better than expert combination
or statistical measurement. Techniques that combine experts’ judgments
about configura! and broken-leg cues with actuarial models might improve
performance especially well (Blattberg & Hoch, 1990; E. Johnson, 1988).
Of course, expert performance relative to models depends critically on the
11 This possibility has been stressed by Ken Hammond in discussions of analytical versus intuitive
judgment (e.g., Hammond, Hamm, Grassia, & Pearson, 1987). For example, most of the
unorthodox moves generated by the leading backgammon computer program (which beat a
world champion in 1979) are stupid mistakes an expert would catch; a few are brilliant moves
that might not occur to an expert.
1s A model necessarily errs, by fixing regression coefficients and ignoring many variables. It
“accepts error to make less error” (Einhorn, 1986). An expert, by changing regression coeffi-
cients and selecting variables, conceivably could be right every time. This difference is made
dramatic by a medical example. A statistician developed a simple linear model to make routine
diagnoses. Its features were printed on a card doctors could carry around; the card showed
several cues and how to add them. Doctors wouldn’t use it because they couldn’t defend it in
the inevitable lawsuits that would result after the model would have made a mistake.
212 COLIN F. CAMERER AND ERIC J. JOHNSON
task and the importance of configura! and broken-leg cue~. There may be
tasks in which experts beat models, but it is hard to think of examples. In
pricing antiques, classic cars, or unusual real estate (e.g., houses over $5
million), there may be many broken-leg cues that give experts an advantage,
but a model including the expert-rated cue “special features” may also do
well.
Tasks involving pattern recognition, like judging the prospective taste of
gourmet recipes or the beauty of faces or paintings, seem to involve many
configura! rules that favor experts. But if one adds expert-rated cues like
“consistency” (in recipes) or “symmetry” (in faces) to linear models, the
experts’ configura! edge may disappear.
Another class of highly configura! tasks includes those in which variable
weights change across subsamples or stages. For instance, one should play the
beginning and end of a backgammon or chess game differently. A model that
picks moves by evaluating position features, weighting them with fixed
weights, and combining them linearly will lose to an expert who implicitly
changes weights. But a model that could shift weights during the game could
possibly beat an expert, and one did: Berliner’s (1980) backgammon program
beat the 1979 world champion.
There is an important .need to provide clearer boundaries for this dismal
picture of expert judgment. To what extent, we ask ourselves, does the picture
provided by this review apply to the other domains discussed in this volume?
Providing a crisp answer to this question is difficult, because few of these
domains provide explicit comparisons between experts and linear models.
Without such a set of comparisons, identifying domains in which experts will
do well is speculation.
We have already suggested that some domains are inherently richer in
broken-leg and configura! cues. The presence of these cues provides the op-
portunity for bet~er performance but does not necessarily guarantee it. In
addition, the presence of feedback.and the lack of noise have been suggested
as important variables in determining the performances of both experts and
· expert systems (Carroll, 1987). Finally, Shanteau (1988) has suggested that
“good” experts are those in whom the underlying body of knowledge is more
developed. providing examples such as soil and livestock judgment.
6. IMPLICATIONS FOR THE STUDY OF EXPERTISE
Expertise should be identified by comparison to some standard of
performance. Random and novice performances make for natural compari-
sons. The linear-model literature suggests that simple statistical models pro-
vide another, demanding comparison.
The results from studies of expert decision making have had surprisingly
little effect on the study of expertise, even in related tasks. For instance,
simple linear models do quite well in medical-judgment tasks such as the
hypothetical task discussed at the beginning of this chapter. Yet most of the
The process-performance paradox in expert judgment 213
work in aiding diagnosis has been aimed at developing expert systems that can
mimic human expert performance, not exceed or improve upon it.
Expert systems may predict less accurately than simple models because the
systems are too much like experts. The main lesson from the regression-model
literature is that large numbers of configura! rules, which knowledge engi-
neers take as evidence of expertise, do not necessarily make good predictions;
simple linear combinations of variables (measured by experts) are better in
many tasks.
A somewhat ironic contrast between rule-based systems and linear models
has occurred in recent developments in connectionist models. Whereas these
models generally represent a relatively low level of cognitive activity, there
are some marked similarities to the noncognitive “paramorphic” regression
models we have discussed. In many realizations, a connectionist network is a
set of units with associated weights that specify constraints on how the units
combine the input received. The network generates weights that will maxi-
mize the goodness of fit of the system to the outcomes it observes in training
(Rumelhart, McClelland, & PDP Research Group, 1986).
In a single-layer system, each unit receives its input directly from the envi-
ronment. Thus, these systems appear almost isomorphic to simple regres-
sions, producing a model that takes environmental cues and combines them,
in a linear fashion, to provide the best fit to the outcomes. Much like regres-
sions, we would expect simple, single-layer networks to make surprisingly
good predictions under uncertainty (Jordan, 1986; Rumelhart et al., 1986).
More complex, multilayer systems allow for the incorporation of patterns of
cues, which resemble the configura! cues reported by experts. Like human
experts, we suspect that such hidden units in these more complex systems will
not add much to predictive validity in many of the domains we have discussed.
The parallel between regression models and connectionist networks is pro-
vocative and represents an opportunity for bringing together two quite diver-
gent paradigms.
Finally, we note that this chapter stands in strong contrast to the chapters
that surround it: Our experts, while sharing many signs of superior expert
processing demonstrated in other domains, do not show superior perfor-
mance. The contrast suggests some closing notes. First, the history of the
study of expert decision making raises concerns about how experts are to be
identified. Being revered as an expert practitioner is not enough. Care should
be given to assessing actual performance. Second, the case study of decision
making may say something about the development of expertise in general and
the degree to which task characteristics promote or prevent the development
of superior performance. Experts fail when their cognitive abilities are badly
matched to environmental demands.
In this chapter we have tried to isolate the characteristics of decision tasks
that (1) generate such poor performance, (2) allow experts to believe that
they are doing well, and (3) allow us to believe in them. We hope that the
contrast between these conditions and those provided by other domains may
214 COLIN F. CAMERER AND ERIC J. JOHNSON
contribute to a broader, more informed view of expertise, accounting for
experts’ failures as well as their successes.
ACKNOWLEDGMENTS
The authors contributed equally; the order of authors’ name is purely
alphabetical. We thank Helmut Jungermann, as well as Anders Ericsson,
Jaqui Smith, and the other participants at the Study of Expertise conference in
Berlin, 25-28 June 1989, at the Max Planck Institute for Human Develop-
ment and Education, for many helpful comments. Preparation of this chapter
was supported by a grant from the Office of Naval Research and by NSF grant
SES 88-09299.
REFERENCES
Bedard, J. (1989). Expertise in auditing: Myth or reality? Accounting, Organizations
and Society, 14, 113-131.
Bedard, J., & Mock, T. J. (1989). Expert and novice problem-solving behavior in
audit planning: An experimental study. Unpublished paper, University of South-
ern California. ·
Berliner, H. J. (1980). Backgammon computer program beats world champion. Artifi-
cial Intelligence, 14, 205-220.
Bettman, J. B., & Park C. W. (1980). Effects of prior knowledge, exposure and phrase
of choice process on consumer decision processes. Journal of Consumer Research,
17, 234-248.
Blattberg, R. C., & Hoch, S. J. (1990). Database models and managerial intuition:
50% database+ 50% manager. Management Science, 36, 887-899.
Bouman, M. J. (1980). Application of information-processing and decision-making
research, I. In G. R. Ungson & D. N. Braunstein (Eds.), Decision making: An
1nterdisciplinary inquiry (pp. 129-167). Boston: Kent Publishing.
Bowman, E. H. (1963). Consistency and optimality in management decision making.
Management Science, 10, 310-321.
Brehmer. B. (1980). In one word: Not from experience. Acta Psychulogica, 45,
223-241.
Brucks, M. (1985). The effects of product class knowledge on information search
behavior. Journal of Consumer Reseurch, 12, 1-16.
Brunswik, E. (1952). The conceptual framework of psychology. University of Chicago
Press.
Camerer, C. F. (1981a). The validity and utility of expert judgment. Unpublished
Ph.D. dissertation, Center for Decision Research, University of Chicago Gradu-
ate School of Business.
Camerer, C. F. (1981b). General conditions for the success of bootstrapping models.
Organizational Behavior and Human Performance, 27, 411-422.
Carroll, B. (1987). Expert systems for clinical diagnosis: Are they worth the effort?
Behavioral Science, 32, 274-292.
Chapman, L. J., & Chapman, J.P. (1967). Genesis of popular but erroneous psychodi-
agnostic observations. Journal ofAbnormal Psychology, 73, 193-204.
The process-performance paradox in expert judgment 215
Chapman, L. J., & Chapman, J. P. (1969). Illusory correlation as an otstacle to the use
of valid psychodiagnostic signs. Journal of Abnormal Psychology, 46, 271-280.
Chase, W. G. (1983). Spatial representations of taxi drivers. In D. R. Rogers & J. H
Sloboda (Eds.), Acquisition of symbolic skills (pp. 391-405). New York: Plenum.
Chase, W. G., & Simon, H. A. (1973). Perception in chess. Cognitive Psychology, 4,
55-81.
Christensen-Szalanski, J. J. J., & Bushyhead, J. B. (1981). Physicians’ use of probabilis-
tic information in a real clinical setting. Journal of Experimental Psychology:
Human Perception and Performance, 7, 928-935.
Dawes, R. M. (1971). A case study of graduate admissions: Application of three
principles of human decision making. American Psychologist, 26, 180-188.
Dawes, R. M., & Corrigan, B. (1974). Linear models in decision making. Psychologi-
cal Bulletin, 81, 97.
Dawes, R. M., Faust, D., & Meehl, P. E. (1989). Clinical versus actuarial judgment.
Science, 243, 1668-1674.
DeSmet, A. A., Fryback, D. G., & Thornbury, J. R. (1979). A second look at the
utility of radiographic skull examination for trauma. American Journal of Radiol-
ogy, 132, 95-99.
Einhorn, H. E. (1974). Expert judgment: Some necessary conditions and an example.
Journal of Applied Psychology, 59, 562-571.
Einhorn, H. J. (1972). Expert measurement and mechanical combination. Organiza-
tion Behavior and Human Performance, 7, 86-106.
Einhorn, H. J. (1986). Accepting error to make less error. Journal of Personality
Assessment, 50, 387-395.
Einhorn, H. J., & Hogarth, R. M. (1975). Unit weighting schemas for decision mak-
ing. Organization Behavior and Human Performance, 13, 171-192.
Einhorn, H. J., Kleinmuntz, D. N., & Kleinmuntz, B. (1979). Linear regression and
process tracing models of judgment. Psychological Review, 86, 465-485.
Elstein, A. S., Shulman, A. S., & Sprafka, S. A. (197R). Medicalproblemsolving: An
analysis of clinical reasoning. Cambridge, MA: Harvard University Press.
Ericsson, K. A., & Chase, W. G. (1981). Exceptional memory. American Scientist,
70(6), 607-615.
Ericsson, K. A., & Polson, P. G. (1988). An experimental analysis of the mechanisms
of a memory skill. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 14, 305-316.
Ericsson, K. A., & Simon, H. A. (1987). Verbal reports as data. Psychological Review,
87, 215-251.
Garb, H. N. (1989). Clinical judgment, clinical training, and professional experience.
Psychological Bulletin, 105, 387-396.
Goldberg, L. R. (1959). The effectiveness of clinicians’ judgments: The diagnosis of
organic brain damage from the Bender-Gestalt test. Journal of Consulting Psychol-
ogy, 23, 25-33.
Goldberg, L R. (1968). Simple models or simple processes? American Psyr.hologi<;t, 23, 483-496. Goldberg, L. R. (1970). Man versus model of man: A rationale, plus some evidence, for a method of improving on clinical inferences. Psychological Bulletin, 73, 422-432. Gustafson, J. E. (1963). The computer for use in private practice. In Proceedings of Fifth IBM Medical Symposium, pp. 101-111. White Plains, NY: IBM Technical Publication Division. 216 COLIN F. CAMERER AND ERIC J. JOHNSON Hammond, K. R. (1955). Probabilistic functioning and the clinical method. Psychologi- cal Review, 62, 255-262. Hammond, K. R. (1987). Toward a unified approach to the study of expert judgment. In J. Mumpuwt:r, L. Phillips, 0. Renn, & V. R. R. Uppuluri (Eds.), NATO AS/ Series F: Computer & Systems Sciences: Vol. 35, Expert judgment and expert systems (pp. 1-16). Berlin: Springer-Verlag. Hammond, K. R., Hamm, R. M., Grassia, J., & Pearson, T. (1987). Direct compari- son of the efficacy of intuitive and analytical cognition in expert judgment. IEEE Transactions on Systems, Man, and Cybernetics, SMC-17, 753-770. Hoffman, P. J. (1960). The paramorphic representation of clinical judgment. Psycho- logical Bulletin, 57, 116-131. Holland, J. H., Holyoak, K. J., Nisbett, R. E., & Thagard, P.R. (1986). Induction: Processes of inference, learning, and discovery. Cambridge, MA: MIT Press. Johnson, E. J. (1980). Expertise in admissions judgment. Unpublished doctoral disser- tation, Carnegie-Mellon University. Johnson, E. J. (1988). Expertise and decision under uncertainty: Performance and process. In M. T. H. Chi, R. Glaser & M. J. Farr (Eds.), The nature of expertise (pp. 209-228). Hillsdale, NJ: Erlbaum. Johnson, E. J., & Payne, J. (1985). Effort and accuracy in choice. Management Science, 31, 395-414. Johnson, E. J., & Russo, J. E. (1984). Product familiarity and learning new informa- tion. Journal of Consumer Research, 11, 542-550. Johnson, P. E., Hassebrock, F., Duran, A. S., & Moller, J. (1982). Multimethod study of clinical judgment. Organizational Behavior and Human Performance, 30, 201-230. Jordan, M. I. (1986). An introduction to linear algebra in parallel distributed process- ing. In D. Rumelhart, Rumelhart, J. McClelland, & PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition: Vol. 1. Foundations (pp. 365-422). Cambridge, MA.: MIT Press. Kahneman, D., Slovic, P., & Tversky, A. (1982). Judgm.ent under uncertainty: Heuris- tics and biases. Cambridge University Press. Keren, G. B. (1987). Facing uncertainty in the game of bridge: A calibration study. Organizational Behavior and Human Decision Processes, 139, 98-114. Klayman, J. (1988). Cue discovery in probabilistic environments: Uncertainty and experimentation. Journal of Experimental Psychology:: Learning, Memory, and Cognition, 14, 317-330. Klayman, J., & Ha, Y. (1985). Confirmation, disconfirmation, and information in hypothesis testing. Psychological Review. 94. 211-228. Kleinmuntz, B. (1968). Formal representation of human judgment. New York: Wiley. Kundel, H. L., & LaFollette, P. S. (1972). Visual search patterns and experience with radiological images. Radiology, 103, 523-528. Larkin, J., McDermott, J., Simon, D.P., & Simon, H. A. (1980). Expert and novice performance in solving physics problems. Science, 208, 1335-1342. Lt:venberg, S. B. (1975). Professional training, psychodiagnostic skill, and kinetic family drawings. Journal of Personality Assessment, 39, 389-393. Libby, R. (1976). Man versus model of man: Some conflicting evidence. Organiza- tional Behavior and Human Performance, 16, 1-12. Libby, R., & Frederick, D. M. (1989, February). Expertise and the ability to explain audit findings (University of Michigan Cognitive Science and Machine Intelligence Laboratory Technical Report No. 21). The process-performance paradox in expert judgment 217 Lichtenstein, S., Fischhoff, B., & Phillips, L. D. (1977). Calibration of probabilities: The state of the art. In H. Jungermann & G. de Zeeuw (Eds.), Decision making and change in human affairs. Amsterdam: D. Reidel. Meehl, P. E. (1954). Clinical versus statistical prediction: A theoretical analysis and a review of the evidence. Minneapolis: University of Minnesota Press. Meehl, P. E. (1986). Causes and effects of my disturbing little book. Journal of Person- ality Assessment, 50, 370-375. Meyer, R. J. (1987). The learning of multiattribute judgment policies. Journal of Consumer Research, 14, 155-173. Murphy, A. H., & Winkler, R. L. (1977). Can weather forecasters formulate reliable probability forecasts of precipitation and temperature? National Weather Digest, 2, 2-9. Philadelphia Inquirer. (1989, August 15). Personality test gets revamped for the '80s, pp. 1-D, 3-D. Rumelhart, D., McClelland, J., & PDP Research Group (Eds.). (1986). Parallel distrib- uted processing: Explorations in the microstructure of cognition: Vol.· 1. Founda- tions. Cambridge, MA: MIT Press. Sarbin, T. R. (1944). The logic of prediction in psychology. Psychological Review, 51, 210-228. ' Sawyer J. (1966). Measurement and prediction, clinical and statistical. Psychological Bulletin, 66, 178-200. Shanteau, J. (1988). Psychological characteristics and strategies of expert decision makers. Acta Psychologica, 68, 203-215. Shortliffe, E. H., Buchanan, B. G., & Feigenbaum, E. A. (1979). Knowledge engi- neering for medical decision making: A review of computer-based decision aids. Proceedings of the IEEE, 67, 1207-1224. Tversky, A., & Kahneman, D. (1973). Availability: A heuristic for judging frequency and probability. Cognitive Psychology, 4, 207-232. Tversky, A., & Kahneman, D. (1982). Judgments of and by representativeness. In D. Kahneman, P. Slovic, & A. Tversky (Eds.), Judgment under uncertainty: Heuris- tics and biases (pp. 84-98).,Cambridge: Cambridge University Press. Voss, J. F., & Post, T. A. (1988). On the solving of ill-structured problems. In M. T. H. Chi, R. Glaser, & M. J. Farr (Eds.), The nature of expertise (pp. 261-285). Hillsdale, NJ: Erlbaum. Wagenaar, W. A., & Keren, G. B. (1985). Calibration of probability assessments by professional blackjack dealers, statistical experts, and lay people. Organizational Behavior and Human Decision Processes, 36, 406-416. Wiggens, N., & Kohen, E. S. (1971). Man vs. model of man revisited: The forecasting of graduate school success. Journal of Personality and Social Psychology, 19, 100-106. Yntema, D. B., & Torgerson, W. J. (1961). Man-computer cooperation in decision requiring common sense. IRE Transactions of the Professional Group on Human Factors in Electronics, 2, 20-26. Tips for mastering the write-ups: There rarely exist right answers to these questions. That’s what makes the prompts interesting, useful, and fun (we hope). Good write-ups will always reflect a solid understanding of the material but more importantly you should be able to apply the concepts to the prompt. This means that you should not provide definitions and examples from the reading, but instead figure out what concepts are relevant and how they apply to this business situation. The following are a few tangible, specific tips based on years of grading write-ups. I offer them to you in roughly decreasing order of how frustrating their violations are to a grader. 1. Don’t regurgitate the reading. You never need to waste space including definitions from the reading. Write as if your audience not only has read the assigned materials but also knows them well. When necessary, cite a concept as briefly as possible. The fact that you’ve done the reading should be revealed to us by your thinking, NOT by some quotation. 2. Start quickly and end abruptly. For these short write-ups, introductions, background, and conclusions are entirely unnecessary. Even worse, they take away space that is better used in other ways. We don’t expect these things to read like English essays. Nor are we strangers to why you’re writing in the first place. Treat it like an email to a colleague and jump right in. 3. Choose specific over abstract. Precision is good. It’s good for communication, and it’s good for sharpening thinking. When you feel yourself getting fuzzy, think to yourself: I need an example. We love examples. Make it real. 4. Be realistic. There is nothing more irritating than a cute suggestion (for example, of how an organization might mitigate a particular bias) that works theoretically but is utterly infeasible in the real world. Perhaps the best criterion is to ask yourself if you’d be willing to sit in a manager’s office advocating his or her use of your recommendation. 5. Less is more. Believe it or not, a common mistake is to include too many ideas — not because too many ideas itself is bad, but because these ideas, as intriguing, tantalizing, and, yes, right as they might be, are often too poorly developed. Don’t make this mistake! We’re not impressed with laundry lists. It’s much better to write about a few things really well. Oh, and have fun! This is an opportunity to be creative (the risk-reward tradeoff for creativity is very attractive). A student who is thoughtful and having fun when writing these is generally going to do pretty well. And get more out of it. Thanks!

Place your order
(550 words)

Approximate price: $22

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more