Do large language models Dunning-Kruger?
Erin M. Buchanan & Nicholas P. Maxwell
Imagine a Scenario
Dunning Kruger: Reality
New Title
- Do LLMs miscalibrate like humans?
- Do LLMs suck as much as humans at numerical judgment?
Human Judgments
The Task
Instructions on free association
Examples of cue-response pairs
Assume 100 college students from around the nation gave responses to each CUE (first) word. How many of these 100 students do you think would have given the RESPONSE (second) word?
Human Judgments: Miscalibration
Human Judgments: JAMs
Human Judgments: JAMs
Human Judgments: JAMs
Human Judgments: JAMs
Human Judgments: JoRs
Human Study
- Participants completed relatedness judgment tasks
- Association: cue-response probabilities
- Semantic: cue-response feature overlap
- Thematic: cue-response co-occurrence in context
Human Study
- Paired each with specific variables targeting those facets of similarity
- Association: forward strength from free association
- Semantic: cosine from feature production task
- Thematic: cosine on word embeddings
- Two* variables of each type (older and newer) were used as metrics
Human Slope
Human Intercept
LLM Study
- Pre-trained models
- Masked language models: BERT-base, BERT-large, BERT-cased, RoBERTa-base, DistilBERT-base
- All models generate token-level (log) probabilities for cue–target inputs
- Rescaled log probabilities to mirror scale of similarity variables
LLM Study
- Prompt Size
- A short one-sentence version, a longer two to four sentence version, and a long version comprising a paragraph of detail
- 100 students were asked to say the first word that came to mind when they heard CUE. How many said RESPONSE
LLM Study
Direction of “judgment”
- 100 students were asked to say the first word that came to mind when they heard RESPONSE. How many said CUE
LLM Slope
LLM Intercept
Human to LLM “Judgments”
Conclusions
- Humans use:
- Heuristics (similarity, availability, typicality)
- Experiential frequency estimates (“How often does this feel like it goes with that?”)
- Local similarity neighborhoods rather than full global structure
- This produces:
- High bias (overreliance on salient but nondiagnostic cues)
- Shallow sensitivity (compressed discrimination of strength)
Conclusions
- LLMs use:
- Distributional co-occurrence statistics
- “Context” is not the same
- This produces:
- High bias (many responses treated as similarly likely)
- No/negative sensitivity (no discrimination or the opposite discrimination)
Take-aways and Thanks
- Humans make judgments using rich, multimodal, structured, and goal-directed semantic systems.
- LLMs predict tokens using co-occurrence statistics.
- These two processes can occasionally produce similar outputs (e.g., bias), but the structural basis of the judgments is fundamentally different.
- Human judgments encode relational structure; LLM predictions encode distributional likelihoods.
- Thanks for listening!