Did Generative AI Break Copyright — or Did It Break the “Human-Scale Assumption”? | KOTOBUKI PATENT & TRADEMARK OFFICE OFFICIAL WEB SITE

In recent years, debate over copyright issues surrounding generative AI has intensified rapidly. Questions such as “Is it permissible to use copyrighted works as training data?” and “If an output resembles an existing work, who bears responsibility?” are repeatedly discussed in news media and on social platforms.

Tech blogger Jason Willems offers an intriguing perspective on these debates. His argument is that generative AI has not so much created entirely new problems as it has dismantled the long-standing, implicit assumption of “human scale” on which copyright law has relied — thereby exposing the ambiguity that was always there.

This article uses that insight as a starting point to organize the relationship between generative AI and copyright.

The “Drawing Sonic” Example and the Legal Gray Zone

Willems offers the example of “drawing Sonic the Hedgehog at home.”

If you draw a picture of Sonic the Hedgehog at home purely for personal enjoyment, it is unlikely, in practical terms, to become a legal issue. However, if you post that drawing on social media, it could be interpreted as the public transmission of an unauthorized derivative work.

The key point here is that copyright enforcement has never operated in strict black-and-white terms. Instead, it has long functioned through a combination of tacit tolerance, customary practice, and prosecutorial discretion. As long as individuals created small amounts of derivative works as a hobby, rights holders often had little incentive to pursue legal action.

In other words, copyright law has functioned — albeit with built-in ambiguity — on the assumption that humans occasionally handle limited quantities of creative material.

Generative AI Has Destroyed the “Scale” Assumption

Generative AI fundamentally destabilizes this assumption.

AI systems can generate content at near-zero marginal cost and at massive scale. In theory, tens or hundreds of thousands of images or texts can be produced in a single day — orders of magnitude beyond what a human hobbyist could create.

What once remained a manageable gray zone within human-scale activity has expanded into a matter directly tied to large-scale economic interests. As a result, the tacitly tolerated system can no longer withstand the pressure.

The Core Question: At Which Stage Should Regulation Occur?

To clarify the debate, Willems suggests focusing on where enforcement should take place. Broadly, there are three stages:

The training stage
The generation stage
The distribution stage

Regulating the Training Stage

One seemingly straightforward proposal is to prohibit training on datasets that include copyrighted works.

However, the internet contains vast amounts of commentary, reporting, parody, and other content that may qualify as fair use while referencing copyrighted works. These materials are legally published.

Even if a model were trained solely on such lawful materials, it is difficult to rule out the possibility that a character’s appearance or distinctive traits might still be reflected in the model’s outputs. Moreover, given the immense size of training datasets, it is not realistically feasible to prove after the fact precisely what was learned and from which source.

In short, defining and verifying “completely clean training” is extraordinarily difficult.

Regulating the Generation Stage

Another approach is to regulate outputs at the point of generation.

Here, however, the issue of intent becomes central. Did the user explicitly instruct the model to imitate a specific existing work, or did the similarity arise incidentally from a vague prompt? Mechanically distinguishing between the two is extremely challenging.

It is possible to create lists of prohibited terms or implement filtering systems, but these measures often devolve into cat-and-mouse games and raise concerns about overreach and chilling effects.

Furthermore, statutory damages in copyright law were designed under the assumption that humans occasionally infringe. In a context where AI can generate content at massive scale, damages could theoretically accumulate to astronomical levels, potentially destabilizing the legal framework itself.

Assigning Responsibility at the Distribution Stage

Traditionally, copyright enforcement has been strongest at the point of distribution or public release.

If AI-generated content is created privately and never shared, it generally does not substitute for market goods or damage brand value. However, once published on platforms such as YouTube or social media, the situation changes.

Yet assigning responsibility at this stage imposes substantial burdens on platforms, which would need to handle enormous volumes of AI-generated content. This pressure often shifts the focus back to model developers — returning the debate to the question of who ultimately bears responsibility.

Borders and the Emergence of a Two-Tier System

Even if strict regulation is implemented in one country, AI models are distributed across borders. If, for example, the United States were to impose stringent rules, usage might shift toward overseas providers or open-source alternatives.

This could lead to a bifurcated ecosystem: commercially compliant AI emphasizing safety and regulation, alongside loosely regulated “wild” AI systems. In such a scenario, domestic firms might lose competitiveness without the underlying problem being resolved.

The Difficulty of a Blanket Licensing Alternative

Willems also mentions compensation schemes or comprehensive licensing systems for AI companies as a possible alternative.

However, designing such a system raises complex questions: Who should be compensated? How much? At which stage? Administrative costs could be enormous, and ensuring transparency and fairness would introduce further complications.

At present, there appears to be no simple, decisive solution.

The Shifting Assumption of the “Fixed Work”

Perhaps Willems’s most significant insight is that the very nature of content itself is changing.

If a news article is no longer a single fixed page but instead becomes dynamically generated — varying in length and tone for each reader — when does the “work” become fixed? Which version should be archived? Which version would serve as the basis for proving infringement?

Copyright law was built on the premise of fixed expressions. Generative AI, however, is transforming content into an experience generated in real time. This shift may represent a more fundamental challenge than debates about training data or similarity.

Reactions on Hacker News

The issue has also been debated on Hacker News.

Some have pointed out that technologists who were previously critical of copyright now appear concerned about infringement in the context of AI. In response, others argue that this is not a change in principle; rather, it reflects concern that when large corporations either invoke or disregard copyright, the societal consequences are substantial.

Some commentators draw parallels to Google’s book-scanning project (Google Books), noting that the tech industry has historically expanded legal interpretations to push boundaries. Generative AI, they suggest, may simply be the next extension of that pattern.

Conclusion: What Is Really Being Reconsidered?

The questions “Can copyrighted works be used for training?” and “Who is responsible if outputs resemble existing works?” are undeniably important.

Yet from Willems’s perspective, these may be transitional debates — attempts to apply old frameworks to a rapidly changing reality.

Generative AI may not be breaking copyright itself. Instead, it may be dismantling foundational assumptions: human-scale activity, fixed works, and limited distribution.

If that is the case, what is required is not piecemeal fixes to individual issues, but a fundamental redesign of how creation and distribution function in the digital age.