Apple, Nvidia, Anthropic Accused Of Using YouTube Videos To Train AI Models Without Creators' Consent: 'This Is Going to Be An Evolving Problem For A Long Time,' Says MKBHD

Apple Inc. AAPL has been accused of using Alphabet Inc.‘s GOOGL, GOOG subsidiary YouTube videos to train its AI models without the creators’ consent.

What Happened: Tech YouTuber Marques Brownlee, also known as MKBHD, took to social media to voice his concerns about Apple’s use of YouTube content for AI training.

Brownlee revealed that Apple sourced data from various companies, one of which scraped data and transcripts from YouTube videos, including his own. The companies are not at fault for the scraping, but this issue is likely to persist, Brownlee noted.

“Apple technically avoids “fault” here because they’re not the ones scraping But this is going to be an evolving problem for a long time,” Brownlee wrote.

Apple has sourced data for their AI from several companies

One of them scraped tons of data/transcripts from YouTube videos, including mine

Apple technically avoids "fault" here because they're not the ones scraping

But this is going to be an evolving problem for a long time https://t.co/U93riaeSlY
— Marques Brownlee (@MKBHD) July 16, 2024

MKBHD wrote in another post, “Fun fact, I pay a service (by the minute) for more accurate transcriptions of my own videos, which I then upload to YouTube’s back-end. So companies that scrape transcripts are stealing paid work in more than one way. Not great.”

Fun fact, I pay a service (by the minute) for more accurate transcriptions of my own videos, which I then upload to YouTube's back-end. So companies that scrape transcripts are stealing *paid* work in more than one way. Not great.
— Marques Brownlee (@MKBHD) July 16, 2024

9to5Mac’s report, which Brownlee shared, disclosed that several tech giants, including Apple, trained their AI models using subtitle files downloaded by a third party from over 170,000 videos. This data included transcripts of videos from creators like Brownlee, MrBeast, PewDiePie, Stephen Colbert, John Oliver, and Jimmy Kimmel.

Proof News investigation revealed that EleutherAI‘s dataset, known as the Pile, was used by major companies like NVIDIA Corp. NVDA and Salesforce Inc CRM for AI training.

Companies pursued this practice despite YouTube's regulations prohibiting the unauthorized harvesting of materials from the platform.

Apple, Nvidia, Google, and Anthropic did not immediately respond to Benzinga's request for comment.

Why It Matters: The issue of unauthorized content scraping for AI training has been a growing concern in the tech industry. Recently, OpenAI and Anthropic were reported to be ignoring web scraping rules, stirring controversy. These companies have allegedly bypassed the robots.txt protocol, which is designed to prevent automated scraping of websites.

In response to such practices, Reddit Inc. RDDT recently updated its platform to block automated content scraping. This policy change led to a nearly 9% surge in Reddit’s stock value, highlighting the market’s sensitivity to data privacy issues.

Earlier, Meta Platforms Inc. META also faced challenges with data scraping, which led to legal actions against a Chinese company. This incident underscores the widespread nature of the problem across various social media platforms.

Additionally, Elon Musk has cited AI scraping as a reason for implementing tweet paywalls on X, Inc. (formerly Twitter Inc.). Users now need an account to read tweets, and those who wish to view more than 600 posts per day must pay for Twitter Blue access.

Image Via Shutterstock

This story was generated using Benzinga Neuro and edited by Kaustubh Bagalkote

AAPLApple Inc

$208.001.66%