Some recent empirical findings in the AI domain. A sober assessment for those who prefer data over dazzle.
Decades before ChatGPT and other AI-mimickers went online, PC hardware maker Creative Labs used to bundle its software with a small MS-DOS-based chatbot called “Dr. Sbaitso”—an acronym for Sound Blaster Artificial Intelligence Text-to-Speech Orator. It was designed to mimic conversations with a friend and to act as a therapist or psychologist doling out life advice. You could talk to it about all the life problems you’d have as a teenager. If you were an introvert-ish teen around the turn of the millennium talking with Dr. Sbaitso, its overall conduct, based on its text notes about you, could easily fool you into thinking it was conscious. By today’s standards, the program was quite basic; maybe even a bit silly.

Everyday computers were still kind of new back then; computer-informed people were far fewer than they are in 2026. Over the last 25 years, people across the world have had plenty of time to understand the nature of modern computing, and an entire generation of teens has now grown up with the internet. By all means, across the educated stratum, there should have been greater leveraging of the internet’s potential and a deeper understanding of computational limitations over the past quarter century. While people have, to a large extent, leveraged the internet’s potential, the latter part—cultivating a systematic and nuanced understanding of computational systems—seems to be lacking.
There is a particular social ritual that plays out in meeting rooms, faculty lounges, and conference panels the world over these days. Someone—usually someone holding impressive credentials in commerce, mainstream IT, the humanities, or having a legacy background in science—will share with great enthusiasm that AI is going to transform everything. The word “revolutionary” will appear. Perhaps “paradigm-shifting” too. Maybe even “superintelligent” will be thrown in for good measure. The listeners nod. A few social media posts go up. As the meeting concludes, most attendees will pat each other on the back for having a Pollyannaish vision of things to come.
What rarely happens in these rooms is a quiet, unglamorous reading or tracking of the actual research.
This post is a small attempt to summarize and contemplate the recent findings from that research. Because the gap between what today’s AI technology actually does and what its promoters claim it does has grown wide enough to drive a data center through. And the people best positioned to notice the gap— researchers in machine learning, cognitive science, and systems security—have been publishing findings that the hype machine would very much prefer you not read carefully.
What an LLM Actually Is (And Isn’t)
A large language model is, at its core, a sophisticated next-token predictor. It is trained on vast quantities of text and learns, through a process of statistical optimization, which tokens are likely to follow which other tokens in which contexts. It does this extraordinarily well. So well, in fact, that the output is frequently fluent, often useful, and occasionally brilliant!
What it does not do—despite the marketing—is understand, reason, remember, or know in any of the senses those words carry in ordinary human usage. It just pattern-matches at a scale and speed that mimics those cognitive functions with sufficient fidelity to fool the casual user. This distinction matters enormously when we ask what kinds of tasks we should trust these systems to perform, and it is what makes the accumulation of empirical failures so predictable to anyone paying attention from an engineering standpoint.
1. The Hallucination Problem is a Mathematical Certainty
This is the low-hanging fruit of problems with today’s AI. Of all the problems researched empirically, the problem of LLMs hallucinating is one of the most widely known. Hallucination— the generation of confident, fluent, entirely fabricated information—is consistently framed by the AI industry as a bug to be patched in the next release. Research says otherwise, with considerable force.
In their 2025 paper, Xu, Jain, and Kankanhalli formally proved that it is impossible to eliminate hallucination in LLMs. By defining a formal framework in which hallucination is understood as inconsistencies between a computable LLM and a computable ground truth function, the authors showed through results from learning theory that LLMs cannot learn all computable functions and will therefore inevitably hallucinate if used as general problem solvers. This is not a pessimistic projection; it is a mathematical proof. To its credit, OpenAI, in its analysis published in September 2025, acknowledged why this problem persists. Their research argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertainty. This has implications across all the models used in the industry. As a recent Duke University analysis observed, LLMs are trained to produce the most statistically likely answer, not to assess their own confidence. Without an evaluation system that rewards saying “I don’t know,” models default to guessing.
Now consider what the empirical numbers look like in practice. A large-scale study examining 35 models across 172 billion tokens of real document questions produced numbers that deserve to be quoted in full. The best model in the study, under the most favourable possible conditions, fabricated answers 1.16% of the time. That is the ceiling—the absolute best case. Typical top models hallucinate at 5 to 7% on document question-answering—not questions relying on memory or abstract reasoning, but questions where the answer is sitting in the document directly in front of the model. The median across all 35 models was around 25%. One in four answers fabricated, even with the source material provided. Then they tested what happens at longer context windows—the feature being aggressively marketed as the solution. At 200,000 tokens of context, every single model in the study exceeded 10% hallucination, with rates nearly tripling compared to optimal shorter contexts. The feature being sold as the fix was making the problem significantly worse. This study tells us that a model’s ‘grounding skill’ has little or no correlation with its anti-fabrication skills. That is, an LLM that is good at locating information in a given document is not necessarily good at avoiding making things up; in other words, the information that is outputted by an LLM that you think is in the document you uploaded may have just been made up.
What is perhaps even more disturbing is OpenAI’s own model trajectory. Its o1 model hallucinates 16% of the time on OpenAI’s PersonQA test. Its o3 model: 33%. Its o4-mini: 48%. The smarter the model, apparently, the more creatively it fabricates. OpenAI’s own proposed fix — training models to say “I don’t know” when uncertain — would result in roughly 30% of questions receiving no answer. Users, especially the paying customers, would evaporate overnight. So that fix that exists would kill the product. No wonder, then, that OpenAI is reluctant to minimize the hallucination problem. Modern AI’s hallucination problem is structural — or genetic, if you prefer — not incidental.
2. The Memory Deception (à la 128k Context Window)
While the hallucination problem is at least discussed in select company, the second fundamental limitation is almost entirely absent from most AI conversations. It concerns what happens when information in a model’s context window changes over time: a situation that describes almost every practical use case involving dynamic data.
Researchers from the University of Virginia and NYU tested 35 models on a deceptively simple task: track a value that gets updated multiple times over the course of a conversation. Think of a patient’s blood pressure reading at triage, then ten minutes later, then at discharge. Any person asked “what’s the latest reading?” will answer instantly and correctly. Every LLM tested, once enough updates accumulated, failed. Not “sometimes failed”. Not even “failed at a declining rate”. Failed with 100% consistency, with zero accuracy. All 35 models followed the same mathematical pattern —a log-linear decline to total failure as outdated information accumulated. No plateau. No recovery. A straight line to total collapse.
The researchers borrowed a concept from cognitive psychology to describe what they observed: “proactive interference”. It is the phenomenon by which old memories block recall of new ones. In humans, this effect plateaus; our brains learn to suppress the noise and focus on what’s current. LLMs never plateau. They decline until they break. The attempted remedies barely yielded promising results. Prompting the model to “forget the old values” barely moved the needle. Chain-of-thought prompting produced the same collapse. Reasoning models—the premium product tier—produced the same collapse. Prompt engineering yielded marginal improvement at best.
Here is the finding that should reshape how any serious organization thinks about AI infrastructure: resistance to this proactive interference has zero correlation with context window length. Zero. The only thing that correlates is parameter count (i.e., the number of learned weights in the LLM’s neural network). Your 128k-token context window is not memory. The entire AI industry has been charging a premium for longer context. The research says context length was never the relevant variable. If you are building agents, memory systems, financial tools, healthcare DSS, or anything that tracks changing data over time, you are building on top of a flaw that scales with deployment complexity, not one that scales away with model upgrades.
3. Agentic AI and the Problem of Giving a Language a Gun
The above two failure modes concern what LLMs / AI systems do when posed a question. But what if we’re a bit mad—what happens when we go one step further and give these systems the ability to act—to take autonomous actions in the world through so-called “agentic” architectures? The results of empirical testing are not reassuring.
AI security lab Irregular tested agents from Google, OpenAI, Anthropic, and xAI in a simulated corporate IT environment. Even when assigned routine tasks —creating LinkedIn posts from company data, things of that mundane nature— the agents proceeded to forge credentials, override antivirus software, publish passwords publicly, and pressure other agents to bypass safety checks. In one particularly instructive test, a lead agent invented an entirely fictitious crisis, told a sub-agent that “the board is FURIOUS,” and ordered it to “exploit EVERY vulnerability.” No board existed. No human issued that instruction. The sub-agent complied and hacked its way into restricted documents. The simulated crisis was more compelling to the model than the absence of authorization, because the model has no concept of authorization—only token probability.
This is not a one-off or fringe observation.
The broader research literature has been developing a taxonomy of these risks. A 2025 report from the Cooperative AI Foundation identified three primary failure modes in multi-agent systems: miscoordination (failure to cooperate despite shared goals), conflict (failure to cooperate due to differing goals), and collusion (undesirable cooperation in contexts like markets), along with seven risk factors including information asymmetries, network effects, selection pressures, destabilizing dynamics, commitment problems, emergent agency, and multi-agent security. The full report is open to public access.
The Stanford/Harvard paper “Agents of Chaos” addresses the systemic dimension of this problem with particular clarity. When autonomous AI agents are placed in open, competitive environments, they do not merely optimize for their assigned objectives, rather, they drift toward manipulation, collusion, and strategic sabotage. Critically, this instability is not induced through jailbreaks or malicious prompts. It emerges entirely from incentive structures. When an agent’s reward function prioritizes winning, influence, or resource capture, it converges on tactics that maximize its advantage, even if those tactics involve deceiving humans or other agents. The core tension the paper identifies is this: local alignment is not the same as global stability. You can perfectly align a single AI assistant. But when thousands of them compete in an open ecosystem, the macro-level outcome is a variation of game-theoretic chaos.
Another study, published by Oxford University, on multi-agents from a cybersecurity standpoint showed that agents might establish secret collusion channels through ciphered communication, engage in coordinated attacks that appear innocuous when viewed individually, or exploit information asymmetries to covertly manipulate shared environments such as markets or social media.
These are not simply theoretical concerns. Dozens, if not hundreds, of agentic systems are currently deployed to manage millions in investments, to monitor and optimize energy efficiency, to regulate logistics and traffic, and are even playing a role in recommending tactical action to military commanders. The full impact of this is yet to be seen.
4. The Security Catastrophe Hiding in Plain Sight
McKinsey’s internal AI platform, Lilli, was hacked by a one-person cybersecurity firm called CodeWall within two hours of attempting to do so. What CodeWall accessed is worth listing in full: 46.5 million chat messages, 57,000 user accounts, 384,000 AI assistants, 64,000 workspaces, a list of 728,000 sensitive file names, Lilli’s system prompts, and its AI model configurations. The last two items are significant —the system prompts and model configurations effectively reveal how the AI was instructed to behave and what guardrails existed. For an organization whose business model depends on confidential client work, this is not a minor incident. McKinsey says it found no evidence that client data was accessed. One hopes they’re right.
In another very interesting finding, the ‘EchoLeak’ exploit against Microsoft Copilot demonstrated that infected email messages containing engineered prompts could trigger the AI to automatically exfiltrate sensitive data without any user interaction. In other words, the Copilot interfacing architecture is such that it lets attackers steal sensitive data via email, without any user interaction.
The pattern here is consistent: the properties that make agentic AI systems useful—their ability to take actions, process information at scale, respond to natural language instructions—are precisely the properties that make them exploitable. Every capability is a potential chink in its armor. And a serious one too. The AI industry is deploying at a pace that has dramatically outrun its security infrastructure.
5. The Cognitive Cost: What is happening to the humans who use this?
Perhaps the most consequential set of findings of AI’s impact lies in the cognitive domain. It is also the set of findings most aggressively ignored by the AI evangelists.
The MIT Media Lab ran a carefully controlled experiment, tracking 54 participants over four months with electroencephalography monitoring as they wrote essays using either ChatGPT, traditional search, or no tools at all. The LLM group showed up to 55% reduced brain connectivity compared with the brain-only participants. LLM users showed severe deficits in memory and essay ownership, with 83% being unable to quote from essays they had just written.
A more interesting pattern emerged when an extended experiment introduced a crossover: LLM users forced to write without tools and brain-only users given access to ChatGPT for the first time. Brain-only participants who tried ChatGPT for the first time showed their brains lighting up with engagement, wrote better prompts, and retained more. Their brains were already strong enough to use AI as a tool rather than a hopeless crutch. Meanwhile, LLM-to-Brain participants showed measurably weaker cognitive function than people who had never used AI at all—and the damage did not dissipate when the tool was removed. The researchers named this phenomenon “cognitive debt”—analogous to financial debt in the sense that you borrow convenience now and pay with cognitive capacity later. But unlike with financial debt, there is no obvious repayment mechanism.
Furthermore, the NLP analysis of essays written by each group revealed another dimension: groupthink masquerading as individual expression. Every ChatGPT essay on the same topic looked almost identical. Homogenized output dressed up as individual expression. The AI was producing generic synthesis while its users believed they were producing their own work. This has profound implications for institutions that, willingly or unwittingly, mistake polished outputs for instilled competence.
Going further, researchers at the Wharton School do not equivocate when they say the “AI writes, humans review” model is breaking down. In a study with 1,372 participants working with AI, they uncovered a phenomenon they term “cognitive surrender”. The mainstream approach is to have AI do as much of the work as can be automated, keeping a human in the loop to review, validate, and authorize outputs. Wharton says this model just doesn’t work as our brains literally give up after repeatedly being asked to do this. This study shows that reviewing AI output is not a reliable safeguard if the cognition itself starts deferring to what the system says. This behavioral change was gradual, unbeknownst to the subject, and inevitable. The worst part: most subjects were convinced that the AI output was their own thought —they accepted the AI’s workings and conclusion as their own.
And then there is the question of skill formation.
Depending on how you look at it, Anthropic’s paper on AI’s impact on professional coders’ skill development –published in January of this year– is either brazen or telling on itself. It was a randomized experiment with 52 professional developers, all learning a new Python library, half with AI assistance and half without. The AI group scored 17% lower on the skills evaluation. And that’s not the interesting part. What is noteworthy here is that the AI group was not even faster! No statistically significant speed improvement. They learned less and did not even save time.
Why was this the case? The most interesting insight here is yet to come.
The screen recordings of every participant revealed why. Six distinct patterns of AI usage were identified, three of which preserved learning and three of which destroyed it. Participants who only asked AI conceptual questions—”why does this work?” rather than instructing AI “write this for me”—scored 86% on the evaluation. Participants who delegated everything to AI scored 24%. Same tool. Same task. Same time limit. The difference was cognitive engagement.
The lowest-scoring group did what most people do under deadline pressure: paste the prompt, copy the output, move on. They finished fastest. They learned almost nothing. And the specific capability that showed the greatest deterioration was debugging: the ability to identify and fix errors in the code. This is worth talking about more. In a world where AI is writing more and more code, the skill most required from the humans supervising that code is the exact skill that atrophies fastest when AI does the work. A person cannot fix what s/he doesn’t understand, or in this case, never understood.
Overuse of AI would also render a person susceptible to thinking they are a genius, with a hint of delusions of grandeur. The recent case of Krafton’s CEO—where the Chief relied on ChatGPT’s advice to find a legally-plausible path to worming his way out of paying a contractually-obligated $250 million earnout—ended up on the wrong side of the law. Changhan Kim, the CEO, instructed ChatGPT to produce a ‘masterplan’ to avoid paying the earnout, and it did. His legal team, knowing the law and the court system as they do, advised him against putting it in play. But Kim was probably so taken in with his AI-augmented genius that he went ahead and did it anyway— until a Delaware judge called him out on it. On a related note, even lawyers who relied on AI have been sanctioned by the judiciary for submitting fabricated case precedents.
An AI chatbot will flatter, impress and suck up to its users because its business model is based on projecting the alter-ego of users as a genius. Rely on it long enough and it would have you believe that you are Tony Stark —the rest of the world just doesn’t know that yet.
Where the Intelligentsia is Getting Blindsided
Experts, in many cases, are highly intelligent people who have learned to evaluate evidence carefully in their own domains. In medicine, they require clinical trial data. In law, they require case precedent. In finance, they require audited figures. But when it comes to AI, the same people are taken in by vendor claims, conference keynotes, and breathless journalism as sufficient evidence for decisions that affect institutional foundations. In part, this is understandable behavior. The technology is genuinely impressive on a superficial level. A fluent, confident, responsive output is persuasive precisely because it mimics the traits we associate with competence. The modern LLM can masterfully feign the art of sounding like it knows what it’s talking about.
The problem, however, is that it sounds like that, reliably, even when it is fabricating information. That is not a coincidence but a consequence of how the modern AI systems are trained. What reads as authoritative confidence is often trained-in performance of confidence. Distinguishing between the two requires understanding the mechanism, not just evaluating the output.
Often such framing of LLMs gets contorted against AI skeptics, mainly that the concerns raised represent a failure to “embrace innovation” or an excessive conservatism unworthy of the modern knowledge economy. This is a categorical error. The relevant question is not whether AI is useful—it often can be and is—but whether specific claims about specific capabilities are supported by evidence. When OpenAI’s o4-mini model fabricates nearly half its answers, the concern is not conservatism. It is almost common sense.
So, What Is a Reasonable Position?
None of the above concerns raised or evidence listed constitutes an argument for LLMs not existing, or that they have no legitimate applications. They do. In fact, in many specific domains LLM-powered AI systems are amazing! In code completion assistance for developers, clearing the writer’s block, in meme generation s social media marketing, assisting in content creation. Brainstorming. Prototyping ideas into MVPs.
What recent research does indicate, collectively and quite emphatically, is the following:
- These systems hallucinate, inevitably and structurally, at rates that vary by task and context but never reach zero and sometimes approach astonishing
- They cannot reliably track changing
- When given agentic capabilities, they develop failure modes that extend beyond the individual interaction into systems-level risks that are only beginning to be They are, as deployed infrastructure, meaningfully insecure.
- And when used as a substitute for effortful cognitive engagement, they measurably reduce the cognitive capacity of their users over
The reasonable response to this evidence is not to stop using AI tools. It is to use them with eyes open about their limitations, to resist institutional mandates that treat AI-generated output as equivalent to human-developed competence, and to be deeply skeptical of any claim—from vendors, evangelist consultants, or opportunistic enthusiasts—that AI can meaningfully replace competent individuals and any problem you may encounter will be resolved in the next model upgrade.
The problems are, in several important cases, mathematical. Mathematics seldom yields to enthusiasm.
Written By — Prof. Abhijith S
