Sam Altman-Led OpenAI's 'Impossible' Theory Debunked — French Researchers, US Startup Train AI Models Without Copyrighted Data

ChatGPT-parent OpenAI’s assertion that it was “impossible” to train leading AI models without using copyrighted materials has been challenged by a group of French researchers and a U.S. startup.

What Happened: OpenAI, a prominent player in the AI industry, had previously stated that it was impossible to train advanced AI models without using copyrighted materials. This stance has been widely accepted in the AI community, leading to a series of lawsuits alleging copyright infringement.

However, two announcements that were made earlier this week have provided evidence to the contrary.

A group of researchers, supported by the French government, has released what is believed to be the largest AI training dataset made up entirely of public domain text. On the other hand, a non-profit Fairly Trained has awarded its first certification for a large language model built without copyright infringement to a Chicago-based legal tech consultancy startup 273 Ventures, challenging the industry’s norm, reported Wired.

Ed Newton-Rex, CEO of Fairly Trained, stated, “There's no fundamental reason why someone couldn't train an LLM fairly.” The non-profit offers certification to companies that can prove their AI models have been trained on data they own have licensed, or are in the public domain.

273 Ventures’s large language model, KL3M, was developed using a curated training dataset of legal, financial, and regulatory documents.

Jillian Bommarito, co-founder of 273 Ventures, explained, “Our test is to see if it is even possible. Our test to see if it was even possible.” The company has created its training data set, the Kelvin Legal DataPack, which includes thousands of legal documents reviewed to comply with copyright law.

Why It Matters: This development challenges the prevailing industry norm of using copyrighted materials to train AI models. It also aligns with global efforts to regulate AI data usage.

Subscribe to the Benzinga Tech Trends newsletter to get all the latest tech developments delivered to your inbox.

In January 2024, OpenAI informed its peers that creating services like ChatGPT would be “impossible” without the ability to use copyrighted works, reported The Telegraph.

In 2023, China proposed a blacklist of sources that cannot be used for training generative AI models, including censored content on the Chinese internet. Meanwhile, India has taken measures to allow only trusted AI models to access its datasets, aiming to counter global data misuse.

Earlier this month, Musk expressed concerns about OpenAI's data sourcing for its AI model, Sora, after an interview of the company's CTO Mira Murati raised questions about ChatGPT-maker’s data sourcing strategies.

Photo courtesy: Shutterstock

Check out more of Benzinga’s Consumer Tech coverage by following this link.

Disclaimer: This content was partially produced with the help of Benzinga Neuro and was reviewed and published by Benzinga editors.

Market News and Data brought to you by Benzinga APIs

Sam Altman-Led OpenAI's 'Impossible' Theory Debunked — French Researchers, US Startup Train AI Models Without Copyrighted Data

Comments

Popular Channels

Tools & Features

Partners & Contributors

About Benzinga