tokenization
ChatGPT couldn't count the 'r's in 'strawberry'. Researchers found the root cause is actually much worse.
When the smartest AI models couldn't count the 'r's in strawberry, the industry blamed tokenization. New research shows the real problem is much worse.

In the middle of 2024, the world’s most advanced artificial intelligence systems were defeated by a fruit. When users asked OpenAI's GPT-4 and Anthropic's Claude to count the number of 'r's in the word "strawberry," the models consistently answered '2'. It was a bizarre, easily replicable hallucination that quickly became a viral meme. How could a system capable of passing the bar exam and writing complex Python applications fail a spelling test that a second-grader could easily pass?
The AI industry quickly coalesced around a convenient scapegoat: Tokenization, the process of breaking text into computational chunks. Experts explained that the models were simply blind to the letters inside the chunks they had memorized. But a closer examination of the data tells a vastly different and more sobering story. The widespread failure of LLMs to perform character-level tasks is not merely an artifact of tokenization blindness, but a profound structural inability to execute deterministic counting operations, forcing the industry to mask these underlying mathematical flaws with computationally expensive 'chain of thought' reasoning wrappers.
Counting on Synthetic Fingers
When the "strawberry" test first broke the internet in mid-2024, it was treated as an amusing quirk rather than a systemic failure. Users logged widespread failures across the board, testing every major model available. Both ChatGPT and Claude confidently insisted there were only two 'r's in the word. The issue was initially categorized alongside other known quirks of Large Language Models (LLMs), such as generating extra fingers in AI images or struggling with basic spatial reasoning riddles.
But the failure illuminated a deeper mechanical reality about how these systems function. LLMs are not thinking machines capable of logic; they are probabilistic engines designed to predict the most likely next token in a sequence. When you ask an LLM to perform a math equation or count the letters in a word, it does not actually compute the math or tally the characters. Instead, it predicts what the answer should look like based on statistical distributions in its vast training data.
As cloud computing provider RunPod noted in their technical analysis, if you run an equation a thousand times, you'll get the same result a thousand times with traditional computing, while LLMs generate their results probabilistically. This means the output is essentially an educated guess wrapped in confident, authoritative syntax. You are asking a sophisticated autocomplete engine to do math.
This probabilistic nature means that LLMs are structurally unfit for deterministic tasks. The "strawberry" failure was merely a highly visible symptom of a broader disease affecting all transformer-based architectures. For example, researchers at MIT demonstrated that LLMs cannot reliably generate random numbers or perform basic deterministic arithmetic without hallucinating.
The researchers documented that LLMs fail to perform simple math because they are simply guessing the next logical number based on textual relationships, rather than executing a mathematical operation in memory. When a calculator adds two numbers, it manipulates bits according to rigid mathematical laws. When an LLM adds two numbers, it essentially looks at millions of documents where similar numbers appeared near each other and outputs the most statistically probable string of digits.
When we ask an LLM to count the letters in "strawberry", we are essentially asking a sophisticated autocomplete engine to do math. It is using a probabilistic guess to simulate a deterministic process.
This structural limitation means that any application relying on an LLM for exact, reproducible, and verifiable counting or arithmetic is inherently flawed. The MIT findings anchor the reality that when LLMs encounter tasks requiring rigid logic rather than fluid language generation, they confidently fabricate an answer that merely looks correct to a human reader.
Byte Pair Encoding and the Tokenization Scapegoat

To understand why the industry so eagerly blamed tokenization for the strawberry incident, we must first examine Byte Pair Encoding (BPE). Byte Pair Encoding is a common data compression algorithm used to train modern tokenizers. It iteratively replaces the most frequent pairs of bytes or characters in a training dataset with a single, new numerical token.
Because LLMs use algorithms like BPE, they do not read text letter-by-letter the way a human does. Instead, they ingest text in pre-computed chunks. According to analysis of how tokenizers break down text, the word 'strawberry' is often sliced into discrete chunks like 'straw' and 'berry', or 'st', 'raw', 'berry'. Because the exact spelling is abstracted away into an integer ID, the AI never "sees" the individual 'r's.
Former OpenAI Director of AI Andrej Karpathy popularized this theory in the immediate aftermath of the strawberry failures. In a comprehensive breakdown of tokenization mechanics, Karpathy argued that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. He suggested that these models suffer from a fundamental visual blind spot, unable to peer inside the compressed BPE chunks to count the letters.
His conclusion was simple and optimistic: ideally someone out there finds a way to delete this stage entirely. The implication was that once we move past tokenization, the models will finally be able to perform these basic tasks without hallucinating.
However, calling tokenization the sole culprit was a convenient way for researchers to avoid acknowledging a deeper architectural deficit. If the problem was merely visual blindness caused by token boundaries, then models would fail uniformly across all character-identification tasks regardless of the specific word or the number of characters involved. But the data shows they do not. The failure rates are highly variable and point to a completely different bottleneck.
The Experimental Proof of the Logic Deficit
A December 2024 academic study from Nanjing University of Aeronautics and Astronautics, titled "Why Do Large Language Models (LLMs) Struggle to Count Letters?", empirically refuted the tokenization-only theory. The researchers evaluated a dataset of 10,000 words against leading models including LLaMA 3.1, GPT-4o, and Mistral.
Their testing methodology was rigorous. They asked models to identify the presence of specific letters within words that were heavily fragmented by the tokenizer, completely isolating the variable of token blindness. What they discovered fundamentally shifts our understanding of LLM capabilities. They found that LLMs can actually perfectly identify single letters within tokens.
If the AI can identify a single letter inside a compressed token chunk, it is not fundamentally blind to the letters. It possesses the necessary latent representations to know which characters comprise a token. So why does it fail so spectacularly at counting them?
The Nanjing University team documented that error rates strictly correlate with the multiplicity of letters in a word. If a model is asked to find a letter that only appears once, it succeeds with high accuracy. If it has to add up a letter that appears three times, its failure rate skyrockets exponentially.
The statistics from the study paint a grim picture of mathematical competence. Even top-tier models like OpenAI's GPT-4o failed to correctly count letters in 17% of words evaluated. Meanwhile, open-weight models in the 7B to 11B parameter range failed in 63% to 74% of evaluated words. The failure is not visual; it is procedural.
The conclusion of the academic study is damning for the tokenization theory: the main problem to correctly count letters in a word lies in the counting itself and that tokenization does not play a fundamental role and word or token frequency have no impact on the result.
Defending the Tokenization Theory
Before writing off the tokenization defense completely, we must present the opposing viewpoint fairly. Prominent AI researchers and engineers, including Andrej Karpathy, argue that tokenization is indeed the fundamental root cause of these counting failures. They assert that LLMs simply cannot 'see' the individual letters inside a compressed token chunk because the underlying neural network only processes the integer ID of the token, not its constituent characters.
Under this theory, when an LLM looks at the token 'berry', it treats it as a single indivisible semantic concept. It is much like a human looking at an emoji of a strawberry; we understand the concept, but we do not consciously process the individual pixels that make up the image. Karpathy has extensively documented how tokenization creates a persistent "blind spot" that plagues spelling, string manipulation, and coding tasks across all major model architectures. If this defense holds true, the inability to count 'r's is strictly an input/output parsing issue, not a cognitive failure of the model's core logic capabilities.
However, the receipts from recent empirical data dismantle this defense entirely. Recent experimental data from Nanjing University proves beyond a reasonable doubt that LLMs can perfectly identify single letters within tokens when asked to do so. Their failure rate scales strictly with how many times they have to add the letter up in a running tally, indicating a mathematical counting deficit, not visual blindness.
The researchers explicitly noted that word or token frequency have no impact on the result, meaning it doesn't matter how the word was chunked by the BPE algorithm. What matters is the number of times the model has to perform a discrete addition operation. By shifting the blame to tokenization, the industry was masking a far more concerning reality: standard LLMs are structurally incapable of reliable, deterministic counting.
Palindromes, JSON Artifacts, and the Language Tax
The inability to perform basic deterministic tasks extends far beyond the isolated "strawberry" meme. These structural limits have historically manifested in several painful, computationally expensive ways that software developers have been forced to hack around for years. Counting letters is just the tip of the iceberg.
First, consider basic string manipulation and reversal. LLMs have historically failed to reverse strings or reliably identify palindromes in zero-shot prompts. Because they are predicting the next likely sequence rather than manipulating an array of characters in memory, asking an LLM to spell "racecar" backward requires it to probabilistically predict the reverse sequence. This is a task that goes entirely against the grain of its left-to-right, autoregressive training data.
Second, this rigid, probabilistic token structure wreaks havoc on data formatting and syntax generation. Developers frequently encounter erratic behavior, trailing-space artifacts, or poor JSON/YAML output formats when using LLMs in production pipelines. When rigid token boundaries misalign with strict syntax requirements, the model's probabilistic guesses lead to broken code and corrupted data structures that crash downstream applications.
The most egregious historical failure stemming from these architectural choices, however, is the non-English language tax. Because popular tokenizers are trained predominantly on English text scraped from the western internet, they are highly optimized for Latin characters. When confronted with non-English languages—such as Japanese, Korean, Arabic, or Spanish—the tokenizer splits non-Latin characters inefficiently, drastically inflating token counts.
This means that interacting with an LLM in a non-English language is not only significantly worse in performance but also costs far more money, as cloud providers bill users by the token.
The "language tax" proves that tokenization is deeply flawed, but it also highlights how AI companies have been willing to accept massive structural inefficiencies rather than rebuild their fundamental architectures from scratch.
These historical failures underscore the reality that probabilistic text prediction is an extremely blunt instrument for precise computational tasks. We are trying to use a sophisticated pattern-matcher to do the job of a basic pocket calculator.
The Anatomy of an Autoregressive Failure
To truly grasp the magnitude of the counting flaw, one must look at the mechanics of autoregressive generation. At every step, the model calculates a probability distribution over its entire vocabulary to pick the single most likely next word. It holds no internal working memory or persistent state across those steps other than the context window itself.
When counting the letters in a word, a human uses an internal tally—a stateful variable that increments as the eyes scan the text. Standard LLMs possess no such architectural mechanism. They have no internal registry to store an incrementing value safely. As researchers evaluated a dataset of 10,000 words, they consistently saw the models relying purely on the attention mechanism to map relationships between the prompt and the generated text.
Attention mechanisms are excellent at associating concepts, such as linking "Paris" with "France", but they are exceptionally poor at maintaining strict counts. Models failed to correctly count letters in 17% of words evaluated precisely because attention mechanisms dilute probability across multiple occurrences of a letter rather than cleanly adding them up. The more letters there are, the more the probability distribution blurs, inevitably leading to a hallucinated total.
This is not a bug that can be patched with a larger training dataset; it is an intrinsic limitation of autoregressive sequence generation. You cannot train an engine to ignore its own core architecture.
Slapping a 'Reasoning' Band-Aid Over the Flaw
Faced with mounting empirical evidence that standard LLMs cannot perform basic counting or deterministic logic natively, the AI industry faced a critical divergence point. They had a choice: rethink the fundamental architecture of probabilistic prediction entirely, or find a way to brute-force over the structural cracks using raw compute power. They decisively chose the latter.
In September 2024, OpenAI released a new class of models, the o1 series, designed to tackle these exact logical shortcomings. The internal codename for this project during development was literally "Strawberry", a direct and somewhat defensive nod to the viral spelling failure that the model was specifically engineered to solve.
Instead of fixing the underlying inability to count natively within the network weights, the o1 models introduced a massive, heavy layer of inference-time computation. The model leverages a technique known as Chain of Thought. Chain of Thought is an inference strategy where an AI model is forced to generate intermediate reasoning steps to break down a complex problem sequentially before providing a final answer to the user.
When you ask the o1 model to count the 'r's in "strawberry," it no longer immediately predicts the next token. Instead, it uses reinforcement learning to "think" out loud in a hidden, internal scratchpad. It forces the model to write out the word letter by letter, count each letter sequentially, and then output the final tally. This mitigates the tokenization limitations by reasoning out the letters step-by-step instead of relying purely on token prediction.
While this approach successfully allows the model to arrive at the correct answer, it is wildly inefficient. It is the computational equivalent of using a sledgehammer to drive a thumbtack. By relying on a lengthy Chain of Thought wrapper, the model requires vastly more inference time, energy, and compute power to achieve what a single line of standard Python code can accomplish in milliseconds on a cheap CPU.
The highly publicized release of o1 "Strawberry" is a tacit admission from the most powerful AI company in the world that their base foundation models are structurally broken when it comes to deterministic logic. Rather than eliminating the fundamental flaw, they have chosen to bury it under an expensive, power-hungry reasoning loop that abstracts the failure away from the end user.
The Economic Reality of Hidden Scratchpads
The shift toward models like o1 introduces a new, massive problem for the industry: the soaring cost of inference. When an LLM relies on a hidden scratchpad to solve basic logic puzzles, it is generating hundreds or thousands of unseen tokens. The user does not see these tokens in the final interface, but the cloud provider still incurs the steep computational cost to generate them.
This economic reality makes Chain of Thought a severely limiting factor for widespread deployment. If every basic counting operation or logical deduction requires a model to spend twenty seconds generating hidden reasoning steps, the latency and cost become prohibitive for real-time applications. Enterprise customers who expect deterministic results from LLMs will find themselves paying premium prices just to ensure the model doesn't hallucinate a simple addition problem.
The industry's pivot to test-time compute reveals a stubborn commitment to the transformer architecture at all costs. Rather than engineering a system that inherently understands logic and state, researchers are building increasingly elaborate, energy-intensive scaffolding around models that are, at their core, just guessing the next word. As researchers evaluated a dataset of 10,000 words, they found that avoiding these logic failures requires exponential increases in computational overhead.
The long-term viability of this strategy remains highly questionable. Wrapping a broken counting mechanism in layers of expensive compute does not solve the mathematical deficit; it merely taxes the end user to hide it.
The Hard Limits of Pure Pattern Matching
The AI industry is currently riding a wave of immense hype, heavily funded by venture capital, promising that LLMs will soon serve as autonomous software engineers, reliable financial analysts, and trustworthy agents capable of executing complex workflows without human supervision. Yet the "strawberry" incident, and the subsequent data exposing the models' inability to perform basic counting, stands as a massive, undeniable red flag.
When Nanjing University researchers proved that errors strictly correlate with the multiplicity of letters rather than token boundaries, they exposed a fatal flaw in the operating assumption that LLMs can naturally scale into robust reasoning engines simply by adding more parameters. The industry has implicitly admitted, through the release of models like o1, that standard autoregressive LLMs cannot do math or reliable deterministic counting on their own.
This means that relying on LLMs for deterministic code generation, precise financial calculations, or exact data parsing remains inherently risky. The models do not understand the underlying rules of the systems they are manipulating. If you run an equation a thousand times through an LLM, you generate probabilistic, not deterministic, results.
As Brendan McKeag succinctly noted in his analysis of these failures, "That's just not how math works". Mathematical certainty cannot be achieved through statistical approximation, no matter how large the training dataset grows.
While optimistic experts like Andrej Karpathy still hold out hope that someone out there finds a way to delete this stage entirely and move past the BPE tokenization limits altogether, the massive industry transition to models like o1 suggests a very different operational future. The search for a token-free, inherently logical architecture is currently taking a backseat to the brute-force application of Chain of Thought wrappers. The industry is actively opting to burn more electricity, capital, and computing power to make a probabilistic model merely pretend it is deterministic.
The Final Verdict: Hitting the Logic Wall
The "strawberry" incident was initially, and easily, dismissed as a quirky artifact of tokenization. But the experimental data tells a more sobering story about the trajectory of generative AI. The widespread failure of LLMs to perform character-level tasks is not simply token blindness; it is a profound, structural inability to perform basic, deterministic mathematical addition.
The empirical evidence from MIT regarding LLMs' inability to do reliable math, combined with the rigorous Nanjing University data demonstrating that counting failures scale strictly with letter multiplicity, overwhelmingly supports the thesis that these models possess a fundamental logic deficit.
By aggressively shifting to computationally expensive, Chain of Thought wrappers like o1 "Strawberry" to mask this exact flaw from users, the AI industry has tacitly admitted that pure text prediction has hit a hard logic wall.
As technology companies continue to push LLMs into high-stakes, deterministic workflows like automated coding and financial analysis, understanding this structural limitation is no longer just an academic exercise in computer science. It is the practical difference between writing functioning, secure software and relying on a very articulate predictive 8-ball. The models are not reasoning; they are just getting better at hiding the fact that they are guessing.