Do large language models Dunning-Kruger?

Erin M. Buchanan & Nicholas P. Maxwell

Imagine a Scenario

Dunning Kruger: Perception

https://haines-lab.com/post/2021-01-10-modeling-classic-effects-dunning-kruger/

Dunning Kruger: Reality

New Title

  • Do LLMs miscalibrate like humans?
  • Do LLMs suck as much as humans at numerical judgment?

Human Judgments

The Task

  • Instructions on free association

  • Examples of cue-response pairs

    • LOST - FOUND

    • OLD - NEW or YOUNG

    • ARTICLE - NEWSPAPER or THE or CLOTHING

  • Assume 100 college students from around the nation gave responses to each CUE (first) word. How many of these 100 students do you think would have given the RESPONSE (second) word?

Human Judgments: Miscalibration

Human Judgments: JAMs

Human Judgments: JAMs

Human Judgments: JAMs

Human Judgments: JAMs

Human Judgments: JoRs

Large Language Models

https://developer.nvidia.com/blog/how-to-get-better-outputs-from-your-large-language-model

Human Study

  • Participants completed relatedness judgment tasks
    • Association: cue-response probabilities
    • Semantic: cue-response feature overlap
    • Thematic: cue-response co-occurrence in context

Human Study

  • Paired each with specific variables targeting those facets of similarity
    • Association: forward strength from free association
    • Semantic: cosine from feature production task
    • Thematic: cosine on word embeddings
  • Two* variables of each type (older and newer) were used as metrics

Human Slope

Human Intercept

LLM Study

  • Pre-trained models
    • Masked language models: BERT-base, BERT-large, BERT-cased, RoBERTa-base, DistilBERT-base
    • All models generate token-level (log) probabilities for cue–target inputs
    • Rescaled log probabilities to mirror scale of similarity variables

LLM Study

  • Prompt Size
    • A short one-sentence version, a longer two to four sentence version, and a long version comprising a paragraph of detail
    • 100 students were asked to say the first word that came to mind when they heard CUE. How many said RESPONSE

LLM Study

  • Direction of “judgment”

    • 100 students were asked to say the first word that came to mind when they heard RESPONSE. How many said CUE

LLM Slope

LLM Intercept

Human to LLM “Judgments”

Conclusions

  • Humans use:
    • Heuristics (similarity, availability, typicality)
    • Experiential frequency estimates (“How often does this feel like it goes with that?”)
    • Local similarity neighborhoods rather than full global structure
  • This produces:
    • High bias (overreliance on salient but nondiagnostic cues)
    • Shallow sensitivity (compressed discrimination of strength)

Conclusions

  • LLMs use:
    • Distributional co-occurrence statistics
    • “Context” is not the same
  • This produces:
    • High bias (many responses treated as similarly likely)
    • No/negative sensitivity (no discrimination or the opposite discrimination)

Take-aways and Thanks

  • Humans make judgments using rich, multimodal, structured, and goal-directed semantic systems.
  • LLMs predict tokens using co-occurrence statistics.
  • These two processes can occasionally produce similar outputs (e.g., bias), but the structural basis of the judgments is fundamentally different.
  • Human judgments encode relational structure; LLM predictions encode distributional likelihoods.
  • Thanks for listening!