Reflections on Meta’s Secret Meeting and Copyright Issues in AI Training

Recently, The New York Times reported that Meta executives and lawyers were considering the use of copyright-protected content for AI training, even in light of litigation risks. This report highlights the immense demand for AI training data and the ethical and legal challenges surrounding its acquisition.

Training AI models requires vast amounts of data. For example, OpenAI’s GPT-3 was trained using over three trillion tokens from web pages, book scans, and social media posts collected since 2007. Such data is crucial for enhancing AI performance. However, the anticipated scarcity of high-quality data is pushing companies toward increasingly extreme data collection methods.

According to The New York Times, Ahmad Al-Dahle, Meta’s Vice President for Generative AI, considered using copyrighted content without authorization due to concerns about falling behind OpenAI. Discussions included paying a flat licensing fee per new book or acquiring major publishers to amass data. Additionally, methods like hiring African contractors to summarize copyrighted works without permission were also debated. Some argued for absorbing more works despite litigation risks. An attorney expressed ethical concerns about stripping artists of their intellectual property rights, but these concerns were met with heavy silence.

In light of this situation, we need to deeply consider how to balance intellectual property protection with AI development. Unauthorized use of copyrighted content infringes on creators’ rights and can undermine their motivation to create. On the other hand, the advancement of AI technology can greatly benefit society, making data collection important.

Addressing this dilemma requires revisiting legal frameworks and establishing new rules. Possible measures include developing a licensing system that ensures proper compensation for copyright holders while using their content for AI training, and setting guidelines to ensure transparency in data collection.

In conclusion, the issue of data collection for AI development is becoming increasingly significant. From a legal perspective, we must explore solutions to this problem and contribute to the healthy development of AI technology.