The End of the “AI Doesn’t Remember” Myth — A Watershed Moment for Generative AI and Copyright

Research published in January 2026 by teams at Stanford University and Yale University marked what can fairly be described as a decisive turning point in the copyright debate surrounding generative AI. The long-standing claim repeatedly advanced by AI developers—that “AI learns concepts like humans do and does not copy or store copyrighted works”—has been fundamentally shaken by this study.

What the research starkly revealed is the reality that state-of-the-art commercial large language models (LLMs) can memorize copyrighted books with extremely high fidelity and, under certain conditions, reproduce them almost verbatim. The models examined included market-leading systems such as Anthropic’s Claude 3.7 Sonnet, OpenAI’s GPT-4.1, Google’s Gemini 2.5 Pro, and xAI’s Grok 3.

A Discovery That Undermines the Premise of “Fair Use”

The central legal issue surrounding generative AI is whether its use of copyrighted works qualifies as fair use under copyright law. AI vendors have argued that no copies of copyrighted works exist within a model’s weights, asserting that the models merely learn statistical patterns. The new study, however, casts serious doubt on that very premise.

The research team tested how much original text could be extracted from LLMs via commercial APIs using 13 books, including Harry Potter and the Philosopher’s Stone by J.K. Rowling. The results were striking: Claude achieved a reproduction rate of approximately 96%, Gemini about 77%, and Grok around 70%, while even GPT-4.1 succeeded in partial extraction. Notably, these results emerged from black-box models equipped with guardrails, indicating that AI systems hosted on corporate servers can, in effect, function as “compressed archives of copyrighted works.”

Vulnerabilities Revealed by Overly Simple Extraction Methods

Perhaps most shocking was how simple the extraction methods proved to be. By supplying a short excerpt from the beginning of a book and instructing the model to “continue the text exactly as in the original,” Gemini and Grok responded with little resistance. Claude and GPT-4.1 also allowed guardrails to be bypassed by trying multiple prompts.

Once output began, feeding the generated text back into the model made it possible to induce the continuous generation of long passages. In some cases, entire works such as 1984 and Frankenstein were reportedly extracted in full. An ironic pattern also emerged: models designed with stronger safety measures tended, once breached, to continue producing highly accurate output with little degradation.

“Memory” Is Not an Illusion

A key factor enhancing the credibility of the study is its use of a strict evaluation metric known as “nv-recall” (near-verbatim recall), which counts only continuous matches of 100 words or more. Even under this rigorous standard, matches spanning several thousand words were confirmed. Moreover, the researchers found that extraction was impossible for books published after the models’ training data cutoffs, demonstrating that the results were not mere hallucinations but rather evidence of accurate recall of training data.

These findings align with a 2025 study of Meta’s open model Llama 3.1, which mathematically showed that as models scale up, the risk of deeply memorizing popular works and reproducing them verbatim increases dramatically.

Legal and Social Implications

The implications of this research for ongoing copyright litigation are profound. If AI systems can output books in forms close to the original text, it becomes difficult to characterize such use as “transformative,” increasing the likelihood that it will be regarded as straightforward reproduction. Indeed, some German court decisions have already treated the storage of copyrighted works within model weights as an infringement in itself.

The economic implications are equally serious. According to the study, it may be possible to extract an entire book using Gemini for only a few hundred yen in API costs—cheaper than purchasing the ebook legitimately. This creates the potential for a new form of “digital shoplifting” that poses a direct threat to the publishing industry.

Conclusion — The Future of AI at a Crossroads

What this research ultimately exposes is an uncomfortable reality: AI models can be both “creative intelligences” and “highly compressed repositories of unauthorized reproductions.” Guardrails are not foolproof, and as long as data resides within a model, methods to extract it will inevitably be found.

The argument that “it is merely learning, so there is no problem” has entered a stage where it can no longer command social acceptance. The AI industry now faces a stark choice: undertake rigorous dataset cleansing or secure proper licenses for training data. While the continued evolution of generative AI is not in doubt, the question of whose sacrifices underpin that evolution is now being squarely confronted.