It was inevitable. Almost all data on the web is being trained by AI giants with the help of 3rd party dataset generators and without any permission. PoofNews.org reveals that even your YouTube videos are being used for that purpose and without any consent.
Everybody’s work is being exposed to AI dataset generators
As stated by Proof News: “Apple, Nvidia, Anthropic, and other big tech companies used thousands of swiped YouTube videos to train AI. Creators claim their videos were used without their knowledge”. The research site claims that tech companies are turning to controversial tactics to feed their data-hungry artificial intelligence models, vacuuming up books, websites, photos, and social media posts, often unbeknownst to the creators.
Remaining in secrecy
Proof News adds that AI companies are generally secretive about their sources of training data, but an investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube’s rules against harvesting materials from the platform without permission. “Our investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, Nvidia, Apple, and Salesforce” the site adds. Proof News also found material from YouTube megastars, including MrBeast (289 million subscribers, two videos taken for training), Marques Brownlee (19 million subscribers, seven videos taken), Jacksepticeye (nearly 31 million subscribers, 377 videos taken), and PewDiePie (111 million subscribers, 337 videos taken). Some of the material used to train AI also promoted conspiracies such as the “at-Earth theory.”
Layers on layers of data: Irreversible process
It’s important to emphasize that the process of training datasets is irreversible since every layer of data is based on other layers. Practically, AI image generators can not undo it (=remove a specifically trained video) since it would interfere with the AI calculation, and thus, datasets are being stitched together.
YouTube and Sora
Moreover, OpenAI executives have repeatedly declined to publicly answer questions about whether it used YouTube videos to train its AI product Sora, which creates videos from text prompts. Earlier this year, a reporter with The Wall Street Journal put the question to Mira Murati, OpenAI’s chief technology officer. “I’m actually not sure about that,” Murati replied. That means the answer is ‘Yes’. So next time you upload your Porsche to YouTube, be aware that Sora will be trained on it and without your consent. According to those dataset generators, the utilization of YT content for AI train purposes can be defined as ‘Fair Use’. Yeah, you heard right. AI image generators think that taking your videos to train on them is Fair Use. Oh, and without any compensation – means you are getting nothing for it. I have a question: WHERE ARE THE LAWYERS?
Possible solution: Marking and money!
First, YouTube needs to address that ASAP, by clarifying and explaining to creators in case their videos are being trained without their permission. Second, trained videos should be marked, as well as AI-generated imagery. Every AI video must be marked ‘Made by AI’. Third, creators should get compensated twice: By datasets generators, and by the AI giants who trained those datasets. Therefore, those ‘voluntary’ dataset generators will understand the consequences of harming those who feed them (creators), by paying them money. It’s time to stop this circus. Take an example from Blackmagic.