OpenAI's GPT-4 trained on scraped YouTube videos?

OpenAI, the company known for its innovative AI tools like DALL-E and ChatGPT, seems to have taken a shortcut once again; This time while training their latest model GPT-4. Following a New York Times report, it appears OpenAI may have violated YouTube’s requirements by using its own speech recognition software Watcher to transcribe videos lasting more than a million hours. Unlike the 3,000-word transcripts, these appear to be directly input into GPT-4’s training.

This raises ethical concerns. Additionally, the size of the training data set as well as the technology used to generate it likely enhanced the performance of the AI, yet, the developers did not receive explicit approval from YouTube content creators, whose content was used for training. The process was done. To expand this thread further, we have to consider that Gemini AI, owned by Google, has used YouTube content in the training processes that it owns, although these practices have received permissions.

This case shows that this process has many aspects related to AI research and copyright. They always seem to be worried when tech giants release innovative AI technologies that have no ethics in practice. OpenAI can have negative consequences in online trusts, despite legal or reputational harm to creators