Considerations on the Issue of Using YouTube’s Transcription Data by OpenAI: Between Copyright and AI Training

Recently, The New York Times reported that the latest model of OpenAI’s chat AI service, GPT-4, was trained using transcription data from 1 million hours of YouTube videos. This report raises questions once again about the sources of data used for AI training.

YouTube CEO Neal Mohan stated in an interview with Bloomberg that “using YouTube videos and their transcriptions for AI training violates the service’s terms of use,” and this statement has sparked further debate following the report. Indeed, the operation and training of AI require vast amounts of power and data, with a constant shortage of study material to make the models smarter.

According to The New York Times report, the informant is a member of a team at OpenAI, which includes OpenAI President Greg Brockman. The informant claimed to have been involved in collecting YouTube videos. If true, this would inevitably raise issues of copyright and privacy.

Regarding the data used for AI training, companies must take a cautious approach from a copyright and privacy perspective. However, it is also a reality that language models have an insatiable appetite for knowledge, and there is always a shortage of data available for training.

Currently, OpenAI and Google are rivals in the field of ChatGPT and Gemini. If the report by The New York Times is true, it could lead to significant legal issues between the two companies. Since it is prohibited to download content uploaded to YouTube without permission or use it for other purposes, this issue could potentially constitute copyright infringement.

OpenAI CTO Mira Murati avoided making a definitive statement about whether YouTube videos were used in training the video generation AI Sora in an interview with The Wall Street Journal. However, in response to the recent report, a Google spokesperson commented that it was “an uncertain report.”

The progress of AI technology is remarkable, and its benefits are immense, but ethical and legal challenges also arise. Especially issues related to copyright and privacy are becoming increasingly important topics of discussion. Companies are now in an era where greater transparency and adherence to laws are required in handling data used for AI training.

Through this issue, what we should reconsider is the balance between technological advancement and ethics. While maximizing the potential of AI development, it is necessary to aim for a future where legal frameworks are respected, and society as a whole can enjoy the benefits of technology with confidence.