The Dunning-Kruger Effect is Autocorrelation

Have you heard of the ‘Dunning-Kruger effect’? It’s the (apparent) tendency for unskilled people to overestimate their competence. Discovered in 1999 by psychologists Justin Kruger and David Dunning, the effect has since become famous.

And you can see why.

It’s the kind of idea that is too juicy to not be true. Everyone ‘knows’ that idiots tend to be unaware of their own idiocy. Or as John Cleese puts it:

If you’re very very stupid, how can you possibly realize that you’re very very stupid?

Of course, psychologists have been careful to make sure that the evidence replicates. But sure enough, every time you look for it, the Dunning-Kruger effect leaps out of the data. So it would seem that everything’s on sound footing.

Except there’s a problem.

The Dunning-Kruger effect also emerges from data in which it shouldn’t. For instance, if you carefully craft random data so that it does not contain a Dunning-Kruger effect, you will still find the effect. The reason turns out to be embarrassingly simple: the Dunning-Kruger effect has nothing to do with human psychology.1 It is a statistical artifact — a stunning example of autocorrelation.

What is autocorrelation?

Autocorrelation occurs when you correlate a variable with itself. For instance, if I measure the height of 10 people, I’ll find that each person’s height correlates perfectly with itself. If this sounds like circular reasoning, that’s because it is. Autocorrelation is the statistical equivalent of stating that 5 = 5.

When framed this way, the idea of autocorrelation sounds absurd. No competent scientist would correlate a variable with itself. And that’s true for the pure form of autocorrelation. But what if a variable gets mixed into both sides of an equation, where it is forgotten? In that case, autocorrelation is more difficult to spot.

Here’s an example. Suppose I am working with two variables, x and y. I find that these variables are completely uncorrelated, as shown in the left panel of Figure 1. So far so good.

Figure 1: Generating autocorrelation. The left panel plots the random variables x and y, which are uncorrelated. The right panel shows how this non-correlation can be transformed into an autocorrelation. We define a variable called z, which is correlated strongly with x. The problem is that z happens to be the sum x + y. So we are correlating x with itself. The variable y adds statistical noise.

Next, I start to play with the data. After a bit of manipulation, I come up with a quantity that I call z. I save my work and forget about it. Months later, my colleague revisits my dataset and discovers that z strongly correlates with x (Figure 1, right). We’ve discovered something interesting!

Actually, we’ve discovered autocorrelation. You see, unbeknownst to my colleague, I’ve defined the variable z to be the sum of x + y. As a result, when we correlate z with x, we are actually correlating x with itself. (The variable y comes along for the ride, providing statistical noise.) That’s how autocorrelation happens — forgetting that you’ve got the same variable on both sides of a correlation.

The Dunning-Kruger effect

Now that you understand autocorrelation, let’s talk about the Dunning-Kruger effect. Much like the example in Figure 1, the Dunning-Kruger effect amounts to autocorrelation. But instead of lurking within a relabeled variable, the Dunning-Kruger autocorrelation hides beneath a deceptive chart.2

Let’s have a look.

In 1999, Dunning and Kruger reported the results of a simple experiment. They got a bunch of people to complete a skills test. (Actually, Dunning and Kruger used several tests, but that’s irrelevant for my discussion.) Then they asked each person to assess their own ability. What Dunning and Kruger (thought they) found was that the people who did poorly on the skills test also tended to overestimate their ability. That’s the ‘Dunning-Kruger effect’.

Dunning and Kruger visualized their results as shown in Figure 2. It’s a simple chart that draws the eye to the difference between two curves. On the horizontal axis, Dunning and Kruger have placed people into four groups (quartiles) according to their test scores. In the plot, the two lines show the results within each group. The grey line indicates people’s average results on the skills test. The black line indicates their average ‘perceived ability’. Clearly, people who scored poorly on the skills test are overconfident in their abilities. (Or so it appears.)

Figure 2: The Dunning-Kruger chart. From Dunning and Kruger (1999). This figure shows how Dunning and Kruger reported their original findings. Dunning and Kruger gave a skills test to individuals, and also asked each person to estimate their ability. Dunning and Kruger then placed people into four groups based on their ranked test scores. This figure contrasts the (average) percentile of the ‘actual test score’ within each group (grey line) with the (average) percentile of ‘perceived ability’. The Dunning-Kruger ‘effect’ is the difference between the two curves — the (apparent) fact that unskilled people overestimate their ability.

On its own, the Dunning-Kruger chart seems convincing. Add in the fact that Dunning and Kruger are excellent writers, and you have the recipe for a hit paper. On that note, I recommend that you read their article, because it reminds us that good rhetoric is not the same as good science.

Deconstructing Dunning-Kruger

Now that you’ve seen the Dunning-Kruger chart, let’s show how it hides autocorrelation. To make things clear, I’ll annotate the chart as we go.

We’ll start with the horizontal axis. In the Dunning-Kruger chart, the horizontal axis is ‘categorical’, meaning it shows ‘categories’ rather than numerical values. Of course, there’s nothing wrong with plotting categories. But in this case, the categories are actually numerical. Dunning and Kruger take people’s test scores and place them into 4 ranked groups. (Statisticians call these groups ‘quartiles’.)

What this ranking means is that the horizontal axis effectively plots test score. Let’s call this score x.

Figure 3: Deconstructing the Dunning-Kruger chart. In the Dunning-Kruger chart, the horizontal axis ranks ‘actual test score’, which I’ll call x.

Next, let’s look at the vertical axis, which is marked ‘percentile’. What this means is that instead of plotting actual test scores, Dunning and Kruger plot the score’s ranking on a 100-point scale.3

Now let’s look at the curves. The line labeled ‘actual test score’ plots the average percentile of each quartile’s test score (a mouthful, I know). Things seems fine, until we realize that Dunning and Kruger are essentially plotting test score (x) against itself.4 Noticing this fact, let’s relabel the grey line. It effectively plots x vs. x.

Figure 3: Deconstructing the Dunning-Kruger chart. In the Dunning-Kruger chart, the line marked ‘actual test score’ is plotting test score (x) against itself. In my notation, that’s x vs. x.

Moving on, let’s look at the line labeled ‘perceived ability’. This line measures the average percentile for each group’s self assessment. Let’s call this self-assessment y. Recalling that we’ve labeled ‘actual test score’ as x, we see that the black line plots y vs. x.

Figure 3: Deconstructing the Dunning-Kruger chart. In the Dunning-Kruger chart, the line marked ‘perceived ability’ is plotting ‘perceived ability’ y against actual test score x.

So far, nothing jumps out as obviously wrong. Yes, it’s a bit weird to plot x vs. x. But Dunning and Kruger are not claiming that this line alone is important. What’s important is the difference between the two lines (‘perceived ability’ vs. ‘actual test score’). It’s in this difference that the autocorrelation appears.

In mathematical terms, a ‘difference’ means ‘subtract’. So by showing us two diverging lines, Dunning and Kruger are (implicitly) asking us to subtract one from the other: take ‘perceived ability’ and subtract ‘actual test score’. In my notation, that corresponds to y – x.

Figure 3: Deconstructing the Dunning-Kruger chart. To interpret the Dunning-Kruger chart, we (implicitly) look at the difference between the two curves. That corresponds to taking ‘perceived ability’ and subtracting from it ‘actual test score’. In my notation, that difference is y – x (indicated by the double-headed arrow). When we judge this difference as a function of the horizontal axis, we are implicitly comparing y – x to x. Since x is on both sides of the comparison, the result will be an autocorrelation.

Subtracting y – x seems fine, until we realize that we’re supposed to interpret this difference as a function of the horizontal axis. But the horizontal axis plots test score x. So we are (implicitly) asked to compare y – x to x:

\displaystyle (y - x) \sim x

Do you see the problem? We’re comparing x with the negative version of itself. That is textbook autocorrelation. It means that we can throw random numbers into x and y — numbers which could not possibly contain the Dunning-Kruger effect — and yet out the other end, the effect will still emerge.

Replicating Dunning-Kruger

To be honest, I’m not particularly convinced by the analytic arguments above. It’s only by using real data that I can understand the problem with the Dunning-Kruger effect. So let’s have a look at some real numbers.

Suppose we are psychologists who get a big grant to replicate the Dunning-Kruger experiment. We recruit 1000 people, give them each a skills test, and ask them to report a self-assessment. When the results are in, we have a look at the data.

It doesn’t look good.

When we plot individuals’ test score against their self assessment, the data appear completely random. Figure 7 shows the pattern. It seems that people of all abilities are equally terrible at predicting their skill. There is no hint of a Dunning-Kruger effect.

Figure 7: A failed replication. This figure shows the results of a thought experiment in which we try to replicate the Dunning-Kruger effect. We get 1000 people to take a skills test and to estimate their own ability. Here, we plot the raw data. Each point represents an individual’s result, with ‘actual test score’ on the horizontal axis, and ‘self assessment’ on the vertical axis. There is no hint of a Dunning-Kruger effect.

After looking at our raw data, we’re worried that we did something wrong. Many other researchers have replicated the Dunning-Kruger effect. Did we make a mistake in our experiment?

Unfortunately, we can’t collect more data. (We’ve run out of money.) But we can play with the analysis. A colleague suggests that instead of plotting the raw data, we calculate each person’s ‘self-assessment error’. This error is the difference between a person’s self assessment and their test score. Perhaps this assessment error relates to actual test score?

We run the numbers and, to our amazement, find an enormous effect. Figure 8 shows the results. It seems that unskilled people are massively overconfident, while skilled people are overly modest.

(Our lab techs points out that the correlation is surprisingly tight, almost as if the numbers were picked by hand. But we push this observation out of mind and forge ahead.)

Figure 8: Maybe the experiment was successful? Using the raw data from Figure 7, this figure calculates the ‘self-assessment error’ — the difference between an individual’s self assessment and their actual test score. This assessment error (vertical axis) correlates strongly with actual test score (horizontal) axis.

Buoyed by our success in Figure 8, we decide that the results may not be ‘bad’ after all. So we throw the data into the Dunning-Kruger chart to see what happens. We find that despite our misgivings about the data, the Dunning-Kruger effect was there all along. In fact, as Figure 9 shows, our effect is even bigger than the original (from Figure 2).

Figure 9: Recovering Dunning and Kruger. Despite the apparent lack of effect in our raw data (Figure 7), when we plug this data into the Dunning-Kruger chart, we get a massive effect. People who are unskilled over-estimate their abilities. And people who are skilled are too modest.

Things fall apart

Pleased with our successful replication, we start to write up our results. Then things fall apart. Riddled with guilt, our data curator comes clean: he lost the data from our experiment and, in a fit of panic, replaced it with random numbers. Our results, he confides, are based on statistical noise.

Devastated, we return to our data to make sense of what went wrong. If we have been working with random numbers, how could we possibly have replicated the Dunning-Kruger effect? To figure out what happened, we drop the pretense that we’re working with psychological data. We relabel our charts in terms of abstract variables x and y. By doing so, we discover that our apparent ‘effect’ is actually autocorrelation.

Figure 10 breaks it down. Our dataset is comprised of statistical noise — two random variables, x and y, that are completely unrelated (Figure 10A). When we calculated the ‘self-assessment error’, we took the difference between y and x. Unsurprisingly, we find that this difference correlates with x (Figure 10B). But that’s because x is autocorrelating with itself. Finally, we break down the Dunning-Kruger chart and realize that it too is based on autocorrelation (Figure 10C). It asks us to interpret the difference between y and x as a function of x. It’s the autocorrelation from panel B, wrapped in a more deceptive veneer.

Figure 10: Dropping the psychological pretense. This figure repeats the analysis shown in Figures 79, but drops the pretense that we’re dealing with human psychology. We’re working with random variables x and y that are drawn from a uniform distribution. Panel A shows that the variables are completely uncorrelated. Panel B shows that when we plot y – x against x, we get a strong correlation. But that’s because we have correlated x with itself. In panel C, we input these variables into the Dunning-Kruger chart. Again, the apparent effect amounts to autocorrelation — interpreting y – x as a function of x.

The point of this story is to illustrate that the Dunning-Kruger effect has nothing to do with human psychology. It is a statistical artifact — an example of autocorrelation hiding in plain sight.

What’s interesting is how long it took for researchers to realize the flaw in Dunning and Kruger’s analysis. Dunning and Kruger published their results in 1999. But it took until 2016 for the mistake to be fully understood. To my knowledge, Edward Nuhfer and colleagues were the first to exhaustively debunk the Dunning-Kruger effect. (See their joint papers in 2016 and 2017.) In 2020, Gilles Gignac and Marcin Zajenkowski published a similar critique.

Once you read these critiques, it becomes painfully obvious that the Dunning-Kruger effect is a statistical artifact. But to date, very few people know this fact. Collectively, the three critique papers have about 90 times fewer citations than the original Dunning-Kruger article.5 So it appears that most scientists still think that the Dunning-Kruger effect is a robust aspect of human psychology.6

No sign of Dunning Kruger

The problem with the Dunning-Kruger chart is that it violates a fundamental principle in statistics. If you’re going to correlate two sets of data, they must be measured independently. In the Dunning-Kruger chart, this principle gets violated. The chart mixes test score into both axes, giving rise to autocorrelation.

Realizing this mistake, Edward Nuhfer and colleagues asked an interesting question: what happens to the Dunning-Kruger effect if it is measured in a way that is statistically valid? According to Nuhfer’s evidence, the answer is that the effect disappears.

Figure 11 shows their results. What’s important here is that people’s ‘skill’ is measured independently from their test performance and self assessment. To measure ‘skill’, Nuhfer groups individuals by their education level, shown on the horizontal axis. The vertical axis then plots the error in people’s self assessment. Each point represents an individual.

Figure 11: A statistically valid test of the Dunning-Kruger effect. This figure shows Nuhfer and colleagues’ 2017 test of the Dunning-Kruger effect. Similar to Figure 8, this chart plots people’s skill against their error in self assessment. But unlike Figure 8, here the variables are statistically independent. The horizontal axis measures skill using academic rank. The vertical axis measures self-assessment error as follows. Nuhfer takes a person’s score on the SLCI test (science literacy concept inventory test) and subtracts it from the person’s self assessment, called KSSLCI (knowledge survey of the SLCI test). Each black point indicates the self-assessment error of an individual. Green bubbles indicate means within each group, with the associated confidence interval. The fact that the green bubbles overlap the zero-effect line indicates that within each group, the averages are not statistically different from 0. In other words, there is no evidence for a Dunning-Kruger effect.

If the Dunning-Kruger effect were present, it would show up in Figure 11 as a downward trend in the data (similar to the trend in Figure 7). Such a trend would indicate that unskilled people overestimate their ability, and that this overestimate decreases with skill. Looking at Figure 11, there is no hint of a trend. Instead, the average assessment error (indicated by the green bubbles) hovers around zero. In other words, assessment bias is trivially small.

Although there is no hint of a Dunning-Kruger effect, Figure 11 does show an interesting pattern. Moving from left to right, the spread in self-assessment error tends to decrease with more education. In other words, professors are generally better at assessing their ability than are freshmen. That makes sense. Notice, though, that this increasing accuracy is different than the Dunning-Kruger effect, which is about systemic bias in the average assessment. No such bias exists in Nuhfer’s data.

Unskilled and unaware of it

Mistakes happen. So in that sense, we should not fault Dunning and Kruger for having erred. However, there is a delightful irony to the circumstances of their blunder. Here are two Ivy League professors7 arguing that unskilled people have a ‘dual burden’: not only are unskilled people ‘incompetent’ … they are unaware of their own incompetence.

The irony is that the situation is actually reversed. In their seminal paper, Dunning and Kruger are the ones broadcasting their (statistical) incompetence by conflating autocorrelation for a psychological effect. In this light, the paper’s title may still be appropriate. It’s just that it was the authors (not the test subjects) who were ‘unskilled and unaware of it’.


Support this blog

Economics from the Top Down is where I share my ideas for how to create a better economics. If you liked this post, consider becoming a patron. You’ll help me continue my research, and continue to share it with readers like you.

patron_button


Stay updated

Sign up to get email updates from this blog.



This work is licensed under a Creative Commons Attribution 4.0 License. You can use/share it anyway you want, provided you attribute it to me (Blair Fix) and link to Economics from the Top Down.


Notes

Cover image: Nevit Dilmen, altered.

  1. The Dunning-Kruger effect tells us nothing about the people it purports to measure. But it does tell us about the psychology of social scientists, who apparently struggle with statistics.↩︎

  2. It seems clear that Dunning and Kruger didn’t mean to be deceptive. Instead, it appears that they fooled themselves (and many others). On that note, I’m ashamed to say that I read Dunning and Kruger’s paper a few years ago and didn’t spot anything wrong. It was only after reading Jonathan Jarry’s blog post that I clued in. That’s embarrassing, because a major theme of this blog has been me pointing out how economists appeal to autocorrelation when they test their theories of value. (Examples here, here, here, here, and here.) I take solace in the fact that many scientists were similarly hoodwinked by the Dunning-Kruger chart.↩︎

  3. The conversion to percentiles introduces a second bias (in addition to the problem of autocorrelation). By definition, percentiles have a floor (0) and a ceiling (100), and are uniformly distributed between these bounds. If you are close the floor, it is impossible for you to underestimate your rank. Therefore, the ‘unskilled’ will appear overconfident. And if you are close to the ceiling, you cannot overestimate your rank. Therefore, the ‘skilled’ will appear too modest. See Nuhfer et al (2016) for more details.↩︎

  4. In technical terms, Dunning and Kruger are plotting two different forms of ranking against each other — test-score ‘percentile’ against test-score ‘quartile’. What is not obvious is that this type of plot is data independent. By definition, each quartile contains 25 percentiles whose average corresponds to the midpoint of the quartile. The consequence of this truism is that the line labeled ‘actual test score’ tells us (paradoxically) nothing about people’s actual test score.↩︎

  5. According to Google scholar, the three critique papers (Nuhfer 2016, 2017 and Gignac and Zajenkowski 2020) have 88 citations collectively. In contrast, Dunning and Kruger (1999) has 7893 citations.↩︎

  6. The slow dissemination of ‘debunkings’ is a common problem in science. Even when the original (flawed) papers are retracted, they often continue to accumulate citations. And then there’s the fact that critique papers are rarely published in the same journal that hosted the original paper. So a flawed article in Nature is likely to be debunked in a more obscure journal. This asymmetry is partially why I’m writing about the Dunning-Kruger effect here. I think the critique raised by Nuhfer et al. (and Gignac and Zajenkowski) deserves to be well known.↩︎

  7. When Dunning and Kruger published their 1999 paper, they both worked at Cornell University.↩︎

Further reading

Gignac, G. E., & Zajenkowski, M. (2020). The Dunning-Kruger effect is (mostly) a statistical artefact: Valid approaches to testing the hypothesis with individual differences data. Intelligence, 80, 101449.

Kruger, J., & Dunning, D. (1999). Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology, 77(6), 1121.

Nuhfer, E., Cogan, C., Fleisher, S., Gaze, E., & Wirth, K. (2016). Random number simulations reveal how random noise affects the measurements and graphical portrayals of self-assessed competency. Numeracy: Advancing Education in Quantitative Literacy, 9(1).

Nuhfer, E., Fleisher, S., Cogan, C., Wirth, K., & Gaze, E. (2017). How random noise and a graphical convention subverted behavioral scientists’ explanations of self-assessment data: Numeracy underlies better alternatives. Numeracy: Advancing Education in Quantitative Literacy, 10(1).

49 comments

  1. The argument here is terrible. Without DK, you would expect y=x+noise. That means y-x=noise. And noise does not correlate with x. In other words your synthetic experiment where y=noise does not make sense. In particular, for low scores, you would /indeed/ have participants over-estimating their score and for high scores they would indeed under-estimate it, since the average perceived score under your model is constant.

    • Perhaps you should read the Gignac and Zajenkowski paper, where they show that a Dunning-Kruger effect emerges in a model where (just as you say), y = x + noise. The only condition where a DK effect will not emerge is if y = x, and there is no statistical noise. That’s because the noise is the apparent effect.

      • When I mentioned “+noise”, I assumed a “0-mean i.i.d random variable”. The paper by Gignac and Zajenkowski says that this model cannot be correct, since for extreme actual scores you necessarily have non-symmetric noise (e.g., if you have an actual score of 0, a zero-mean noise would mean some people would self-assess their score as some negative number, which is not possible). So you end up with a noise with positive mean at the low-end spectrum of scores, and of negative mean at the high-end spectrum. This creates a spurious Dunning Kruger effect which they account for, and that’s not at all what you refer to in your blog. In your blog, you claim for instance that your Fig 7 shows a pool of (virtual) participants who have no Dunning Kruger effect because they all reply randomly to their test whatever their actual skill level is. So, in your pool of participants, the unskilled ones rate themselves randomly with an average of 50, and the skilled ones rate themselves randomly with an average of 50. And that exactly shows a very extreme Dunning Kruger effect: the unskilled participants have rated themselves exactly like the experts on average, although they should have rated them much lower !
        Now you claim that DK will not emerge unless there is no noise and y=x. That’s only true if you do not account for heteroscedasticity like Gignac and Zajendowski do. But that’s not what someone reading your blog will understand after reading your article.

      • Nicolas,

        There are many ways to look at the problem with the Dunning-Kruger effect. ‘Autocorrelation’ is one that I chose to write about. Another (related) way to think about it is in terms of the lack of ‘statistical independence’, which is what both Nuhfer papers discuss (at great length). I challenge you to read those papers and still maintain that nothing’s wrong with Dunning and Kruger’s analysis. I don’t think it can be done.

        Now to your point. You observe that the Dunning-Kruger effect appears in my simulated data:

        in your pool of participants, the unskilled ones rate themselves randomly with an average of 50, and the skilled ones rate themselves randomly with an average of 50. And that exactly shows a very extreme Dunning Kruger effect: the unskilled participants have rated themselves exactly like the experts on average, although they should have rated them much lower!

        This is an extremely common point of confusion that cuts to the heart of why the Dunning-Kruger analysis is flawed. You are drawing an inference from data that is statistically dependent. That leads to flawed conclusions about the real world.

        You are restating what my charts (Figure 8 and 9) already show — namely that the Dunning-Kruger effect appears in the data … when we use the Dunning-Kruger analysis. I said as much in the article. The problem is that this perceived effect is spurious. It leads you to conclude that unskilled people systematically underestimate their ability. But that is a flawed conclusion, because the data came from a model in which neither skill nor self-assessment ability exist.

        Let’s break it down.

        The model

        I have created a model in which there is no such thing as ‘skill’. In the model, a person’s ‘skill’ (test score) is drawn randomly from a uniform distribution. So on one ‘test’ the person might score 100%. On the next test, they might score 5%. And so on. All scores are equally likely. Therefore ‘skill’ does not exist in any meaningful sense.

        Similarly, in this model ‘self awareness’ does not exist, for the same reasons. A person’s self assessment is a random number between 0 and 100. No one has any idea of their skill. Nor, for that matter, do they have ‘skill’ to begin with.

        The consequence of this model is that any relation between the person’s test score and their perceived ability arises purely by accident. That’s because, by design, this is a model in which skill does not exist, nor does self perception.

        Now here is the problem. Using the technique that Dunning and Kruger created, we do find an effect — a very big one. As shown in Figure 8 and 9, we find that ‘unskilled’ people over-estimate their competence, and ‘skilled’ people under-estimate their competence.

        No one disputes that we find this ‘effect’ in the data using Dunning and Kruger’s method. What we dispute is whether this apparent effect is the correct inference about human psychology. In this case, it is manifestly false. To belabor the point, that’s because we’ve designed the model so that skill and self-perception do not exist. So there is no effect to find, yet we found one anyway.

        What went wrong?

        Well, the problem is that the Dunning-Kruger method violates the principle of statistical independence, as I discussed at great length in the article. But since you don’t seem to like my arguments, let’s look at things differently.

        When we violate statistical independence, we conflate the ranking of specific data with a generalized effect. Let’s illustrate. The consequence of my model is that on average, people peg their self-assessment at 50%. That’s just the way a uniform distribution works.

        Now, for any particular sample of data, some people will score highly on the skills test (say, in the top quartile). These people, on average, will have ‘under-estimated’ their ability. But — and this is crucial — if we ran the test again, we would not find these people in the top quartile. Why? Because in my model, ‘skill’ does not exist. It’s a random number between 0 and 100. So keeping the same self assessment, but now with another measurement of ‘skill’, we find that the same people who previously ‘underestimated’ their skill (on average) now behave completely differently. That contradicts our previous inference (which is not surprising, because the inference is wrong).

        The crucial mistake in your reasoning, Nicolas, is that you cannot draw an inference from data that are statistically dependent, as is the case with the Dunning-Kruger chart. That is the key point in the Nuhfer papers, so I urge you to read them.

        If we want to make a valid inference, we must measure ‘self-assessment error’ independently from ‘skill’, as shown in Figure 11. If we were to do this for my model, we would find (just like Nuhfer) no effect. Here’s the protocol.

        Step 1: sample individuals’ test score and self-assessment (x and y)

        Step 2: calculate each person’s self-assessment error (y – x)

        Step 3: resample each persons test score (let’s call this x-prime)

        Step 4: see if self-assessment error correlates with the resampled test score: (y – x) ~ x-prime

        You will find no correlation — no Dunning-Kruger effect because x-prime will be completely different from x. Again, that’s because the model assumes that skill does not exist.

        To wrap things up, the problem with Dunning and Kruger’s method (as I’ve just shown) is that you can use it to infer something that is manifestly false. In this case, we take a model in which neither skill nor self-assessment ability exist, and conclude that both exist, and that ‘unskilled’ people under-estimate their ability.

        That’ just bad science.

      • (Oh, you could also have no DK effect when y=x and a noise model with 0-mean and a variance that is 0 and the extremes and vary with actual scores. 0 noise may not be very realistic, but at extreme values of the spectrum, the noise should be much much smaller than for medium actual scores, in such a way that statistical tests would fail at detecting any DK effect when using a moderate number of participants)

  2. Interesting that you did not include this study which confirms DK in political situations. https://onlinelibrary.wiley.com/doi/abs/10.1111/pops.12490

    This raises a far more interesting point than a thumbs-up/thumbs-down evaluation. Working with and teaching consumer behavior, we know that people respond to expectations around them while fearing being shunned by their sub-culture for not rising to expectations. These influences seem likely to be the ones which drive people to assume they have higher knowledge than they do.

    Having spoken with many of these people about politics, its interesting that their confidence in their own beliefs, despite very low awareness of how government, politics, the economy, or society works, have all the appearance of being chosen to support their beliefs OR to ensure they are not shunned by their co-believers.

    Your article is concerning because you have overstated the certainty of DK’s conclusion: “people hold overly favorable views of their abilities in many social and intellectual domains” [Kruger, J., & Dunning, D. (1999)] THEN overstated your debunking — apparently hoping to kill the idea forever even thought we see it quite often in practice.

    This truth is likely NOT nomothetic (universal truth as with the general goal of economists these days) but quite likely idiographic (individual, situation dependent) like most behavior — depending on the situation, background, etc.

    After all, the same people I meet who over-estimate their knowledge of economics would NEVER over-estimate their knowledge of cars or computers.

    As to the clever arguments put forth here, you seem to want to muddy the waters rather than comprehend and move society forward. Is it most useful to debunk DK? I don’t think so — it really does exist as is confirmed by some studies while other studies don’t find it. All of that points to the idiographic nature of the issue. And that means debunking it is a disservice to society.

    What we need is INSIGHT — to understand where and when has effect.

    • Hi Doug,

      Thanks for the comments. About the Anson paper, it uses the same flawed method described in my post, so it doesn’t constitute ‘evidence’ for a Dunning-Kruger effect.

      As to the rest of your comments, I’m not ruling our that a Dunning-Kruger effect could be part of human psychology. What I’m saying is the evidence presented by Dunning and Kruger is critically flawed to the point that it should be ignored. And when people have looked for a DK effect in a way that is statistically sound, they have not found one.

      I agree with you that individual behavior is idiosyncratic. And that seems to be why there is no universal DK effect. But what make the DK effect popular was the idea that it was universal.

      As to the clever arguments put forth here, you seem to want to muddy the waters rather than comprehend and move society forward.

      I find this statement odd. In science, pointing out a flaw in an analysis is not ‘muddying the waters’. It is part of doing science … searching for the truth. Scientific progress depends on debunking ideas that are false. Crucially, this type of debunking is an insight in itself.

  3. Well, talking about autocorrelation, it was kind of the reason why I was skeptical about one of your last article: «In Search of Sabotage» https://economicsfromthetopdown.com/2022/03/11/in-search-of-sabotage/ .

    In this article, you used a «power index» which was computed with stock price index and average wage, except that, as «national income data is more widely available than average wage data», you used national income per capita as a proxy for it. This first step look strange for me as, if I’m not wrong, national income include capital income which seems to me to be highly correlated with stock price itself.

    Further, in the same article, you search a correlation between this power index and the top 1% share of income. Here again, this look strange to me as I think it is well established that the more you are rich, the more your income are capital income.

    Didn’t it look like an autocorrelation matter to you?

    • It’s important to distinguish ‘autocorrelation’ from real-world ‘connection’. ‘Autocorrelation’ involves correlating the same data with itself. A real-world connection means that you have good reason to suspect that two sets of data (which are measured independently) should correlate.

      Regarding the power index and income inequality, the second is correct. If I earn income purely from owning stock, my income will rise when stocks rise. That said, when the World Inequality database measures income, they don’t measure stock prices. They measure realized income from capital gains. Now obviously stock prices and capital gains are connected. But they are not the same thing. They are measured independently, and that’s what matters.

      So here’s what I think you’re saying. If rich people own stocks and stock prices rise relative to wages, then we expect that the income share of the top 1% will increase. I completely agree. This is a model of why the two sets of data should correlate.

      What’s important (regarding autocorrelation) is that they don’t have to correlate. For instance, we could imagine a world in which everyone owned the same amount of stock. In this world, the power index would not correlate with income inequality, because everyone would benefit from rising stock prices. This just happens to not be the world we live in.

  4. Should someone update the Wikipedia article on the Dunning Kruger effect, which does not currently mention autocorrelation at all?

    • Yes, that would be a good idea. My guess, though, is that given how popular the Dunning-Kruger effect has become, any Wikipedia edits may be quickly undone. But I guess its worth a try.

  5. Underestimating and overestimating part of DK is clearly an illusion. Because when your scores are very high, there are not many higher scores left and you have less room to estimate it wronlgy higher than estimate it wrongly lower. Same goes for when your scores are ver low, less room to estimate it wrongly lower and more room to estimate it wrongly higher. That is simply forced by the available room for error.

    But still we can say by comparing Figure 2 and Figure 9, DK charts shows us, people from all skill levels overestimate themselves just a little bit more than a random error and for more skilled people there is less room left for overestimating.

    Indeed there is no clear correlation between over or underestimating by less or more skills. It is all determined by room left for error.

  6. But the actual acore and perceived score are different than one another. There’s no correlation and they’re not using the same source

  7. Hi Blair, I really appreciate the article and the effort you make with your site. Perhaps you could clear something up for me:

    Isn’t the whole point of DK that less skilled people tend to have less accurate self-assessments (specifically by overestimating of course) while more skilled people tend to have more accurate self-assessments?

    You seem to be working under the premise that DK is simply claiming that less skilled people tend to overestimate their abilities and more skilled people tend to underestimate their abilities. And as you rightfully point out, such an exact scenario would already hold true in a world where everyone had equally zero ability to make self-assessments (i.e. they make them at completely at random). You replicated this through uniformly distributed random data and showed how y – x is a large positive for less skilled people and a large negative for more skilled people in Figure 9. This shows how that mere fact alone doesn’t provide evidence of any difference in the self-assessment abilities of less skilled people and more skilled people. That is definitely true.

    But the Figure 9 random data is NOT the same as the actual DK data in Figure 2. Yes, they both feature y – x being positive for less skilled people and negative for more skilled people. But the key difference here lies in the size, the absolute value of y – x. While it is more or less equally large in Q1 and Q4 in Figure 9, it is notably smaller in Q4 (more skilled people) in Figure 2. Yes they are still underestimating themselves, but they don’t do so as badly as how the less skilled people overestimate themselves. Now that is a notable difference not found in random data.

    That suggests that more skilled people tend to have more accurate self-assessments than less skilled people. It still holds true to say that less skilled people tend to be overconfident while more skilled people tend to assess themselves more accurately. I think that’s the notable finding of DK.

    • Hi Josen,

      Yes, you are correct that the uniform distribution does not exactly reproduce the original evidence for the DK effect. The papers that I referenced show how you can do it better.

      First, you assume that ‘assessment’ scales exactly with ‘skill’. (So you’re assuming no Dunning-Kruger effect.) In my notation, this means y scales with x. Then you add a bunch of noise to the relation. The more noise you add, the bigger the apparent DK effect. The Nuhfer papers break this down nicely.

      Now to your point about the size of overestimating vs underestimating. The point is that the way the DK chart works, you cannot make any statistically valid inferences. So it just doesn’t matter what the chart claims to show. The data are not independent. So if you want to test the DK effect, you need a different method, like the one that Nuhfer uses.

      Now, what Nuhfer found was that there is a trend in accuracy, as I noted in the article. More educated people are more accurate at assessing their skill. But this is not the DK effect, which posited a systemic bias in assessment (i.e. over-estimating). I don’t see any good evidence for such a bias. People of all education levels over-estimate their skill as much as they under-estimate it.

    • While the autocorrelation argument is intuitive and intriguing, I do not believe that it is sufficient to debunk DK. In fact, the argument you make seems to be a bit too black and white (DK is perfect or complete nonsense).

      To illustrate my point – regression was first observed in the context of height. I can objectively see whether I’m shorter or taller than others – shouldn’t short people then in average estimate their height as correctly as tall people, and if you plot average actual and estimated height, respectively, against quartile of actual height, the two diagonal lines should be identical for a large sample? DK’s point is not that these two lines would show perfect autocorrelation – it’s about the ABSENCE of perfect autocorrelation.

      It seems to matter a lot how easy it should be for people to objectively assess their skill. At school, when the teams were picked, I and my friend ALWAYS were the last ones to get picked, so we objectively knew. Yes, if skill assessment is subjective, it is true that for very low skilled workers the average estimate of skill must be better than the actual as long as there is some error – but for the line of estimated skills to be flat, it means, as you say, that people are completely unable to rate their skills – I find that highly unlikely for many types of activities, and if it was true, it would be an important insight indeed. So if DK essentially say that people’s skill estimates are completely random, it merely is an implication that poorly skilled people overestimate. Even if that is mathematically obvious, it is still an important insight because it means that low skilled workers would not perceive a need to increase their skills. That is exactly what Socrates said: “I know that I don’t know, and therefore I know more than the citizens of Athens who don’t know that they don’t know.” And the latter got so upset that they wanted Socrates to die… Are DK having the same fate here? Criticizing the DK effect because it’s “autocorrelation” to me seems a bit like criticizing 1+1=2 because it’s obvious – DK’s work was published precisely because the editors did not feel that it’s obvious. Newton also is rightly revered as a great scientist even though gravity arguably states the obvious that babies empirically discover at age one… Psychology is not about winning the Nobel prize for mind boggling physics but about making the inner workings of the mind transparent so that we can better understand why people behave in certain ways – and how we might help them to make better choices for themselves -, and I still believe that the insight offered by DK is highly relevant and useful in that respect…

      • Hi Tobias,

        If you read the Nuhfer papers referenced in the article, you’ll see that the DK effect does emerge when skill and self-assessment are designed to scale equally. The only way it does not emerge is if there is no noise in the data. The more noise you add, the larger the DK effect. Why? Because the effect is noise.

        Criticizing the DK effect because it’s “autocorrelation” to me seems a bit like criticizing 1+1=2 because it’s obvious – DK’s work was published precisely because the editors did not feel that it’s obvious. Newton also is rightly revered as a great scientist even though gravity arguably states the obvious that babies empirically discover at age one

        I don’t understand your point. Are you saying the complexity of the human mind is as ‘obvious’ as simple math?

        Well, in this case, the DK effect is simple math, because the effect discovered by Dunning and Kruger has nothing to do with psychology.

        And regarding Newton, you’re just plain wrong. What Newton did was provide a mathematical framework for describing falling objects. And he was able to connect the projectile motion of objects on Earth to the motions of the planets in space. Yep … pretty ‘obvious’.

      • You are quick in telling others that they’re “plain wrong”! How does a child catch a ball? Because of gravity, the ball flies in a curve – the human brain actually approximates in split seconds movements that would take considerable amount of time to predict using formulas. And regarding DK, you still don’t get my point: Nuhfer says that the effect IS noise. My read of DK is that there should not be ANY noise – so the fact that there is considerable noise (visible through considerable “autocorrelation”) IS the big news of the DK effect. Do you know the joke of a mathematician and a physicist observing one person enter a house and two people coming out? When asked what happened, the physicist says “obviously a measurement error.” Says the mathematician: “I don’t know about this, but I think if one more person would go into the house, it would be empty again…” Similarly, a statistician and a psychologist looking at the same phenomenon would actually perceive very different things as noteworthy. It should also be obvious that people often have diverging perceptions of an event but most people don’t allow the possibility that other people might perceive something very different from their own perception; there’s a whole industry of marriage counselors making a living because of that… 😉 I don’t mind what you think, we don’t need to go to marriage therapy, I just thought that I could be of service to point out that DK want to tell us something which is still relevant and important for our own benefit, namely that overconfidence in our own skills can send does harm us…

      • We’ll have to agree to disagree, largely because we have don’t see eye-to-eye about what science does. If you cannot differentiate an ‘effect’ from statistical noise, there is no effect. That’s how science works.

  8. An intuitive explanation:There will be a random effect that some people test below their actual skill, and some above. Because of the direction of the error, someone who tests below actual skill but estimates that skill accurately will be more likely to be on the left half of the chart and reported as overconfident. Similarly someone who tests better than actual skill is more likely on the right and reported as underconfident. The effect is driven by testing error.

  9. Hi,

    Thanks for the post, interesting read. So it seems as though DK’s analysis (or perhaps research approach altogether) is flawed. Let’s also assume Nuhfer’s analysis suggests that there is weak-to-no evidence for the DK effect in that particular context of reasoning. It is still possible that the DK effect could occur in other contexts, no? For instance, when it comes to things people interact with regularly, but which operate according to complex processes (e.g., a toilet), people have an illusion of explanatory depth (Fernbach et al., 2013). That is, the average person thinks they can explain how a toilet works better than they actually can. In contrast, experts (i.e., plumbers) are more accurate in predicting their ability to explain how a toilet works (can’t access the paper right now but I believe that is the general idea of the illusion of explanatory depth).

    I am curious: 1) are you okay with the idea that the DK effect occurs predictably in certain contexts? 2) how would you design your research to test the DK in such a context? 3) supposing you DID find support for the DK effect in your research design, and that I analyzed that data in the incorrect way that DK do, would my graph look much different from theirs? As a meager social scientist, I struggle with statistics so I am having trouble answering #3 : )

    • Hi Luke,

      Thanks for the interesting questions. My thoughts:

      1. are you okay with the idea that the DK effect occurs predictably in certain contexts?

      All I can say is that virtually every test of the DK effect uses the flawed method discussed here. So it doesn’t count as evidence.

      1. how would you design your research to test the DK in such a context?

      To test the effect in a valid way, you need to measure skill twice for each person. We know from standardized tests like IQ that individual test scores can vary quit a bit. (We shouldn’t kid ourselves that one test identifies ‘Skill’.) When you’ve got the two tests, you take one of them and measure the self-assessment error. Then you plot this error against the other test score. Finally, you look for an effect.

      Also, you need to look at raw test scores, not percentiles. (Using percentiles creates floor/ceiling effects.) And even better, you need to design the test so that it is unbounded. (IQ tests are like this. There is no upper possible score.)

      If you do all that, and you find that unskilled people under-estimate their ability in a way that is statistically significant, then you have evidence for the Dunning-Kruger effect.

      1. supposing you DID find support for the DK effect in your research design, and that I analyzed that data in the incorrect way that DK do, would my graph look much different from theirs?

      It really depends on the particular data. The hard truth is that a DK-type plot is an illegitimate way to look for a statistical effect. You should just never use it … full stop.

  10. “To my knowledge, Edward Nuhfer and colleagues were the first to exhaustively debunk the Dunning-Kruger effect. (See their joint papers in 2016 and 2017.)”

    Here are two papers (by psychologists) that raise similar critiques, and that predate the papers you cite:

    Krueger, J., & Mueller, R. A. (2002). Unskilled, unaware, or both? The better-than-average heuristic and statistical regression predict errors in estimates of own performance. Journal of personality and social psychology.
    Burson, K. A., Larrick, R. P., & Klayman, J. (2006). Skilled or unskilled, but still unaware of it: how perceptions of difficulty drive miscalibration in relative comparisons. Journal of personality and social psychology.

    Also I don’t think the random noise/mean reversion critique can explain the totality of Dunning and Krueger’s data. If I recall correctly, they also show that low skill participants are both more miscalibrated and display worse judgment discrimination (i.e., within item prediction-outcome correlations) . Only the miscalibration part can be explained purely by random noise/mean reversion.

    Also this paper conducts a large scale replication that is able to directly test the DK effect against a random error model, and they find empirical support for the former:

    Jansen, R.A., Rafferty, A.N. & Griffiths, T.L. A rational model of the Dunning–Kruger effect supports insensitivity to evidence in low performers. Nature Humab Behavior 5, 756–763 (2021). https://doi.org/10.1038/s41562-021-01057-0

    • Hi David,

      Thanks for these papers. Yes, the idea that ‘regression towards the mean’ creates the Dunning-Kruger effect is an old idea. However, what I think is unique (and new) in the Nuhfer papers is the argument about statistical dependence — that fact that it is statistically illegitimate to compare self-assessment error to test score, because the two are statistically dependent.

      The only solution, pointed out by Nuhfer, is to measure ability twice, and use one of the measures to calculate self-assessment error.

      Looking at the Jansen paper, I see a few problems. First, they are still using the DK chart convention. Let’s not mince words — this convention is an illegitimate way to study an effect. It’s got to go.

      Second, Jansen appears to be unaware of the Nuhfer critique (they don’t site Nuhfer). And that shows in the methods. They make no attempt to separate the measurement of skill from the measurement of self-assessment error. So it’s just the same old problem.

      I’m not going to believe any evidence unless it measures skill independently from self-assessment error.

  11. I thought autocorrelation referred to the correlation of a random process with a shifted or delayed copy of itself.

    In addition, it looks like your artificial example does not really display a DK effect, because the test taker’s score estimate does not get more accurate for higher-scoring individuals. Instead, because you have chosen uniform noise for the score estimates, your graph of Y-vs-X shows Y as a roughly horizontal line, NOT something that begins to track f(x)=x as you move to the right.

    • Hi Scott,

      Technically ‘autocorrelation’ just means ‘self correlation’, so any instance of a variable correlating with itself. But you are correct that the most popular application of autocorrelation is in time series analysis.

      That’s probably because its applicable to the stock market. If you can find that a stock price time series correlates with a delayed copy of itself, you’ve essentially found a pattern that might happen again in the future (i.e. something you can bet on).

      The general version of autocorrelation that I review here is less discussed, mostly because it is a trivial error that you want to avoid.

      To your other point, you are correct that the uniform noise doesn’t exactly reproduce the original DK effect. To do that, you can assume a null hypothesis where assessment scales with test score, but then add various degrees of noise. In general, the more noise you add, the larger the apparent DK effect. See the Nuhfer papers for details.

  12. So yeah, the autocorrelation in D-K is pretty clear.

    But it seems like the failure to replicate the D-K effect in Edward Nuhfer’s study cited here, if I’m understanding it, is not exactly persuasive. I’m out of my area of expertise here so tell me if I’m wrong, but it just seems like level of educational attainment is rather poor proxy for skill level. A lot of freshmen might have had varying levels of preparation for scientific literacy in primary school such that some of them might be better prepared then even some upper classmen and some of them might be novices. And everyone knows about the slacker upperclassmen who just take a test to get a C and then quickly forget everything they learn so don’t really have the relevant skills. There’s also the psychological fact that a lot of university students studying a subject are particularly likely to suffer from imposter syndrome, or generally be more aware of gaps in their knowledge than average, and so might tend to estimate their skill level differently relative to the general population.

  13. […] “Once you read these critiques, it becomes painfully obvious that the Dunning-Kruger effect is a s… — a step by step explanation of the problem with one of psychology’s most famous findings […]

  14. […] Research Centre for Agency, Values and Ethics (CAVE), it’s on Spotify and other podcast platforms “Once you read these critiques, it becomes painfully obvious that the Dunning-Kruger effect is a s… — a step by step explanation of the problem with one of psychology’s most famous […]

  15. You “replicate” the Dunning-Kruger effect with two variables x and y that are completely unrelated. That’s the point: Dunning-Kruger shows that there is a little relation between the self-assessment of individuals and their actual skills. That is what produces a “tendency for unskilled people to overestimate their competence”.

    That is, a strong positive correlation between estimated and actual skills would be a null result for Dunning-Kruger (i.e. people assess their skills correctly); a zero or negative correlation between estimated and actual skills is a positive result for Dunning-Kruger.

    • I think that is a misreading of Dunning and Kruger’s claim. It’s a tall order to claim that a model which is based on uniform random numbers delivers an ‘effect’. If it does, then the concept of an ‘effect’ loses most of its meaning.

  16. Why do you think researchers loved this D-K paper so much? Is it because it seemed to confirm people believe anyway?

    • Think about it – ‘researchers’, and the people who read the articles about the asserted effect – see themselves on the clear-thinking and virtuous ‘if anything, underestimate their own abilities’ side of the graph. It can be phrased as ‘when you’re first learning, you overestimate yourself,’ but the opportunity to look down your nose at others is just too sweet to read it that way. I’m guessing that most of the people who nod their heads after reading an article on D-K would connect it with the Darwin Award – oh, those stupid people. It’s called motivated reasoning.

  17. Hi Blair!

    Thanks for doing such an enlightening post and for joining in the still limited written evidence on the internet that addresses the systematic problems in DK analyses. I’ve been quite excited by the growing trend of increasing numbers of independent folks getting to discuss historical problems in data manipulation and premature understanding of statistics in the social sciences.

    Being a social scientist, I have always been somewhat bothered by the propensity to torture the data with positive/magical thinking (or pressure?) among many of my field mates. That is, until they manage to find some traces of effect that would benefit their initial assumptions.

    I believe, to counterbalance the negative cultural weight that humanities are used to carry in academic circles in terms of seriousness in paradigms and methodological quality.

    Although I don’t want to discuss statistics here, from a methodological point of view two things catch my attention in the original article. And when people in the circle have historically discussed the findings without looking at the evidence.

    1) Intentions. While it is increasingly accepted that trends are inherently contextual and that it is very difficult to establish general laws in the field of human behaviour (which is always framed), the flop of DK analysis further demonstrates the fear social scientists have of assuming that their analyses will always be conditional on context. We can hardly assume general laws through a specific dataset, measured through one variable. And then we ask ourselves why choose the self-assessment of knowledge and no other variable, if not other less biased variables. Sobriety and honesty about the design of experiments is much more interesting for science than academic narcissism. The DKs effect smells of pars pro toto, in addition to being filled with floor and ceiling effects.

    2) More mathematics, less modelling. Before proceeding with the analysis, we must be very careful in order to better understand the nature of the variables, the scales they mobilise and to what extent the quantification of human phenomena is risky. It requires more thought about how scales and structures are defined. That modelling attempts with a high risk of containing autocorrelation, spurious effects and mediating variables need to be addressed more effectively, as an alternative to getting carried away in the process of data collection.

    I am particularly concerned about the willingness of some areas to carry out analyses with categorical thresholds, orders and aggregations of quantities. But it is all the more difficult to discuss as we deal with the pride of researchers, who probably profit from the statistical and “quasi-experimental” camouflage of their publications. Tell a social scientist that their assumptions about the data have some bias, and they shall come around to entangle truisms about the philosophy of science.

    Lastly, thank you very much for writing this and for sharing other texts that do the same. From now on, I intend to use this example more and more in lectures, classes and reports, trying to raise awareness against methodological wishful thinking in research. Perhaps I can even translate that article into Spanish and get the message out. Cheers!

    • Hi L.S.

      What a thoughtful response. Yes, you are welcome to translate this piece. All of my writing is licensed in the Creative Commons.

  18. Than you very much for your article.

    It would be interesting to see a graph with the original data plotting the test scores and the self-assesment

  19. It seems like an obvious fix is to divide the test data X into 2 sets (even/odd test questions, eg). Then the even questions are used to to compute the Xodd vs Xeven tautology curve, and the odd questions are used to plot the Xodd vs Y curve of actual score vs expected score. In the case of random test answers, the Xeven vs Xodd curve becomes flat (even questions are uncorrelated to odd questions), and the spurious correlation between X and (Y-X) goes away, leaving a null DK effect.

    In the case of a real test measuring genuine ability, the odd and even answers should be correlated, the Xeven vs Xodd curve becomes slanted again, and the graph measures a true DK effect.

Leave a Reply