Can language models really understand their own understanding?
The article "Language Models Fail to Introspect About Their Knowledge of Language" by Siyuan Song, Jennifer Hu, and Kyle Mahowald critically investigates whether large language models (LLMs) possess the ability to introspect—that is, to access and report on their own internal linguistic knowledge. This concept, long debated in both cognitive science and AI, has recently gained traction with some researchers suggesting LLMs can demonstrate self-awareness by answering questions about their own outputs. Such a capacity, if confirmed, would have significant implications for how we evaluate models' grammatical understanding and their safety in real-world deployment.
To rigorously test this idea, the authors evaluate 21 open-source LLMs across two core tasks: grammaticality judgments and word prediction. These are both domains where the model’s “true” knowledge can be assessed via direct measurements of string probabilities. In both tasks, the authors compare the model’s metalinguistic responses (i.e., prompted judgments) to its internal probabilistic preferences. If a model can introspect, its prompted answers should closely align with its own string probability measurements—and more so than with those of any other model, even highly similar ones.
We do not find compelling evidence of introspection in the LLMs tested. Instead, we find that models that are inherently more similar to each other—for example, differing only in their random seed initialization—have a stronger correlation between their metalinguistically- and directly-measured behaviors... with no evidence for ‘privileged’ self-access when A = B.
What the authors found was striking. Although LLMs performed well in both grammatical and word prediction tasks when evaluated by standard metrics, they failed to demonstrate what the authors define as “introspection.” There was no evidence that models have privileged access to their own internal states; rather, the alignment between metalinguistic and direct responses could be fully explained by general model similarity. In fact, the study shows that a model’s metalinguistic responses correlate just as strongly with the direct measurements of a similar model as with its own.
This research challenges previous claims that LLMs are capable of self-awareness or introspection. By providing a robust and controlled framework, the authors demonstrate that metalinguistic prompting can reveal real information, but not introspective access. This distinction matters, especially for those evaluating linguistic capacities or ethical boundaries for AI systems. The authors caution against overinterpreting prompted responses as signs of deeper cognitive capabilities, and they call for more rigor before attributing introspective or metacognitive traits to LLMs.