The dataset contains transcripts of videos uploaded to YouTube by some of the most popular content creators on the network.
A new investigation conducted by Proof News has discovered that a number of the most prominent technology corporations in the world trained their artificial intelligence models on a dataset that contained transcripts of more than 173,000 films uploaded to YouTube without obtaining permission. EleutherAI, a nonprofit organization, is responsible for the creation of the dataset, which includes transcripts of videos from YouTube that were uploaded to more than 48,000 channels. The dataset was utilized by a variety of organizations, including Apple, NVIDIA, and Anthropic. A disturbing reality about artificial intelligence is brought to light by the results of the study, which are that the technology is largely constructed on the backs of data that has been stolen from creators without their knowledge or recompense.
The dataset does not feature any videos or images from YouTube; however, it does include video transcripts from some of the most popular creators on the platform, such as Marques Brownlee and MrBeast, as well as transcripts from major news publishers, such as The New York Times, the BBC, and ABC News. There are additional subtitles included in the dataset that are taken from videos that belong to newtechmania.
Brownlee wrote on X that Apple has obtained data for their artificial intelligence from a number of different companies. To add insult to injury, he stated that “one of them scraped tons of data and transcripts from YouTube videos, including mine.” “For a considerable amount of time, this is going to be a problem that progresses.”
Apple has sourced data for their AI from several companies
— Marques Brownlee (@MKBHD) July 16, 2024
One of them scraped tons of data/transcripts from YouTube videos, including mine
Apple technically avoids "fault" here because they're not the ones scraping
But this is going to be an evolving problem for a long time https://t.co/U93riaeSlY
According to a representative for Google, prior statements made by YouTube CEO Neal Mohan, in which he stated that businesses that use YouTube’s data to train artificial intelligence models would be in violation of the platform’s terms and service, are still taken into consideration. It was requested by newtechmania that Apple, NVIDIA, Anthropic, and EleutherAI provide a statement; however, none of these companies responded.
To until point, artificial intelligence businesses have not been forthcoming about the data that they use to train their algorithms. Apple Intelligence is the company’s own take on generative artificial intelligence, and it will be available on millions of Apple devices this year. Artists and photographers have attacked Apple for omitting to share the source of training data for Apple Intelligence earlier this month.
In instance, YouTube, which is the largest collection of videos in the world, is a treasure trove of not only transcripts but also audio, video, and images, which makes it an appealing dataset for the purpose of training artificial intelligence models. At the beginning of this year, Mira Murati, the chief technical officer of OpenAI, avoided answering queries from The Wall Street Journal on whether or not the company used movies from YouTube to train Sora, the next artificial intelligence video generation tool that OpenAI is developing. The statement that Murati made at the time was as follows: “I’m not going to go into the details of the data that was used, but it was data that was licensed or publicly available.” The Chief Executive Officer of Alphabet, Sundar Pichai, has also stated that businesses who use data from YouTube to train their artificial intelligence models will be in violation of the terms of service of the site.
Head on over to the lookup tool provided by Proof News if you are interested in determining whether or not the subtitles from your YouTube videos or from the channels that you enjoy the most are included in the dataset.