OpenAI states that it is "impossible" to create ChatGPT without copyrighted content.

General
OpenAI states that it is "impossible" to create ChatGPT without copyrighted content.

OpenAI, just weeks after being sued by the New York Times for allegedly copying and using "millions" of copyrighted news articles to train a large-scale language model, told the House of Lords' Select Committee on Communications and Digital (via The Guardian) that the copyrighted said that the system needed to be built using protected material.

The large-scale language model - LLM - underlying AI systems like OpenAI's chatbot ChatGPT collects large amounts of data from online sources to "learn" how it works. This becomes problematic when copyright issues come into play. For example, the Times lawsuit states that Microsoft and OpenAI are "attempting to free-ride on the Times' massive investment in journalism by using it to build alternative products without permission or payment."

Microsoft is not alone in objecting to this approach: a group of 17 writers, including John Grisham and George RR Martin, sued OpenAI in 2023, accusing it of "massive systematic theft."

In a presentation to the House of Lords, OpenAI stated that it does not deny the use of copyrighted material, but that it is all fair use and there is no choice anyway. He said, "It is impossible to train today's leading AI models without using copyrighted material, since copyright today covers virtually every kind of human expression, including blog posts, photographs, forum posts, pieces of software code, government documents, and so on.

"Limiting training data to public domain books and drawings created more than 100 years ago may yield interesting experiments, but will not provide AI systems that meet the needs of today's citizens.

I don't find this a particularly compelling argument. For example, if I were arrested for robbing a bank and told the police that it was the only way to provide money to meet my needs, I don't see how that would mean much to them. This is certainly a bit of a simplistic view, and it is possible that an OpenAI attorney could successfully argue that the unauthorized use of copyrighted material to train law masters is within the scope of fair use. To my ears, however, the justification for using copyrighted material without the original author's permission ultimately boils down to "but we really, really wanted to."

Fair use is central to Openay's position that the use of copyrighted works does not in fact violate any rule. In its submission to the House of Lords, OpenAI states that "OpenAI complies with the requirements of all applicable laws, including copyright law," and the update released today addresses this point in more depth.

"Training AI models using publicly available Internet materials is fair use, as supported by longstanding and widely accepted case law," OpenAI wrote.

"We see this principle as fair for creators, necessary for innovators, and critical to U.S. competitiveness.

"The principle that AI model training qualifies as fair use is supported by a wide range of academics, library associations, civil society organizations, startups, major U.S. companies, creators, and authors who recently submitted comments to the U.S. Copyright Office. Other regions and countries, including the European Union, Japan, Singapore, and Israel, also have laws permitting training models with copyrighted content.

Open AI also took a hard-line stance against the New York Times lawsuit, essentially accusing the Times of ambushing it during partnership negotiations. Perhaps taking a lesson from Twitter, which accused Media Matters of manipulating "inorganic combinations of ads and content" to display pro-Nazi ads next to posts from major advertisers, OpenAI also found that a central element of the Times' complaint against AI manipulated the prompts, which often included lengthy excerpts of articles, to make the content and style "replicate our model." [Even when such prompts are used, our models usually do not behave as the New York Times claims. This suggests that they either instructed the model to re-describe itself or that it is an example culled from many attempts," OpenAI wrote.

In its submission to the House of Lords, OpenAI states that it "continues to develop additional mechanisms to allow rights holders to opt out of training" and is working on deals with various agencies such as the one it signed with the Associated Press in 2023. But to me, that seems like a "permissions instead of permissions" approach: since OpenAI is already scrapping this stuff anyway, the agencies and the press would be better off signing some kind of deal before the courts rule that AI companies can do whatever they want.

.

Categories