OpenAI And Anthropic Allegedly Ignore Web Scraping Rules, Stirring Controversy

Two of the world’s leading AI startups, OpenAI and Anthropic, are reportedly disregarding requests from media publishers to cease scraping their web content for free model training data.

What Happened: OpenAI and Anthropic are either ignoring or bypassing a well-established web rule, known as robots.txt, which is designed to prevent automated scraping of websites, reported Business Insider.

This has been brought to light by TollBit, a startup aiming to facilitate paid licensing deals between publishers and AI companies.

Despite public statements from OpenAI and Anthropic that they respect robots.txt and blocks to their specific web crawlers, TollBit’s findings suggest otherwise.

A report by Forbes underscored that Nvidia Corp. and Jeff Bezos' Amazon.com Inc.-backed Perplexity AI is also disregarding the instructions in robots.txt files of publishers, like OpenAI and Anthropic.

Subscribe to the Benzinga Tech Trends newsletter to get all the latest tech developments delivered to your inbox.

Microsoft Corp.-backed OpenAI, the company behind the popular chatbot ChatGPT, has previously struck deals with publishers for access to content, including Axel Springer and News Corp. The U.S. Copyright Office is expected to update its guidance on AI and copyright later this year.

Why It Matters: The alleged actions of OpenAI and Anthropic are in line with a broader trend of AI companies seeking high-quality data for their models. This has led to a disregard for established web rules, such as robots.txt, and has sparked controversy within the AI and publishing industries.

Earlier in May, OpenAI made headlines for its multiyear partnership with News Corp, which granted OpenAI access to the media company’s news content. This move was seen as a significant step in the AI industry’s quest for high-quality training data.

However, OpenAI’s alleged disregard for robots.txt and similar rules raises questions about the ethical and legal implications of using web content for AI training data, especially content that is under copyright or owned by creators.

Check out more of Benzinga’s Consumer Tech coverage by following this link.

Disclaimer: This content was partially produced with the help of Benzinga Neuro and was reviewed and published by Benzinga editors.

Photo courtesy: Shutterstock

Market News and Data brought to you by Benzinga APIs

OpenAI And Anthropic Allegedly Ignore Web Scraping Rules, Stirring Controversy

Comments