ViciosGames

A lawsuit filed by The New York Times in Manhattan federal court last week alleges that defendants Microsoft and OpenAI used millions of articles to train and create its Large Language Models (LLM) and other products The Times is seeking billions of dollars scale damages, but no specific figures have been disclosed.

However, if it prevails, it will be seeking a very large payout.

"The law does not recognize systematic and competitive infringement such as that committed by the defendants," the official complaint (pdf warning) states."

"This lawsuit seeks to hold defendants liable for billions of dollars in statutory and actual damages owed for the illegal copying and use of the Times' own valuable works.

The lawsuit states that the New York Times had been negotiating with the defendants "for months" and was trying to reach an agreement "in accordance with its history of working productively with large technology platforms to permit use of their content in new digital products." The idea presented in the court documents was to get fair value from The Times' contributions to the training because of the emphasis placed on The Times' content during the training and to "promote the continuation of a healthy news ecosystem in a responsible manner that benefits society and supports a well-informed public. The idea was to both "support the development of GenAI technology.

Meanwhile, a statement from OpenAI spokesperson Lindsey Held, quoted in The New York Times article itself, said that the company believes the negotiations were constructive and is "surprised and disappointed" by the lawsuit.

"We are hopeful, as we have done with many other publishers, that we can find a mutually beneficial way to work together," they are quoted as saying.

One of the most intriguing parts of the lawsuit, and one that undoubtedly infuriated The Times, is that Open AI appears to place particular emphasis on publisher content during LLM training.

In particular, the GPT-3 training used nearly 210,000 New York Times proprietary URLs in one of its key data sets, a data set weighted as a high-quality set, which the lawsuit states represents 1.23% of all sources in the data set The lawsuit states that this represents 1.23% of all sources in the data set.

However, the largest and most weighted dataset used to train GPT-3 contains "at least 16 million unique records of Times content across News, Cooking, Wirecutter, and The Athletic."

Furthermore, OpenAI itself states that the datasets it considers to be of the highest quality are sampled more frequently during model training. As OpenAI itself admits, "High quality content, including content from The Times, was more important and valuable for training the GPT model compared to content obtained from other lower quality sources."

This is not the first lawsuit against OpenAI for copyright infringement in LLM's training; 17 authors, including George RR Martin and John Grisham, have filed suit against the company for "massive systematic plagiarism" and against the generative AI image maker " Stability AI, creator of Stable Diffusion, also has a lawsuit filed by Getty for using the company's images to train its models, The Times noted.

And this is unlikely to be the last lawsuit against an AI maker. But given AI companies' seeming reluctance to address the issues of copyright infringement and just compensation for training their multi-billion dollar products, legal proceedings are likely to be one of the few ways to rein them in.

NY Times Sues OpenAI and Microsoft: "We Owe Billions of Dollars for Piracy and Use of the Times' Own Valuable Works

Categories

Comments

About Us

Navigation