Cover Image for OpenAI Accidentally Deleted Evidence That Could Be Relevant in the Copyright Lawsuit Against the NY Times.
Thu Nov 21 2024

OpenAI Accidentally Deleted Evidence That Could Be Relevant in the Copyright Lawsuit Against the NY Times.

In a document filed with the court, the lawyers for The NY Times and Daily News argue that OpenAI accidentally deleted evidence that could be relevant against them.

The lawyers for The New York Times and Daily News have filed a lawsuit against OpenAI, accusing the company of using their content to train its artificial intelligence models without proper authorization. Recently, they stated that OpenAI engineers accidentally deleted data that could be relevant to the case. This fall, OpenAI agreed to provide two virtual machines so that the publishers' lawyers could search for their protected content in the training datasets of the artificial intelligence. Virtual machines are software-based computers that operate within the operating system of another computer and are frequently used for testing, data backup, and application execution.

In a letter to the court, the lawyers for the publishers reported that they and the hired experts have dedicated more than 150 hours since November 1 to exploring OpenAI's training data. However, on November 14, OpenAI engineers deleted all the information from the editors' search history that had been stored on one of the virtual machines. This event was mentioned in the letter submitted to the U.S. District Court for the Southern District of New York.

OpenAI attempted to recover the information and, while it had some success, the folder structure and file names were "irretrievably lost," meaning that the recovered data "cannot be used to determine where the copied articles from the publishers were used in building OpenAI's models," according to the document. Representatives of the publications expressed that they have been forced to restart much of their work, resulting in a significant expenditure of time and computing resources. The letter highlights that the legal team and experts had to double their efforts because they only recently learned that the recovered data was unusable and that a whole week of work needed to be repeated.

The plaintiffs' legal team states that they have no reason to believe the data deletion was intentional but emphasize that this situation demonstrates that OpenAI "is in the best position to search its own datasets" for any content that might be infringing copyright, using its own tools. A spokesperson for OpenAI did not provide comments on the matter.

In various cases, OpenAI has defended that training its models with publicly available data, including articles from The Times and Daily News, constitutes fair use. This means that by creating models like GPT-4, which learn from billions of examples of books, essays, among others, OpenAI argues that it does not need to license or pay for these examples, even if it generates revenue from such models. Nevertheless, it has established licensing agreements with an increasing number of new publishers, including Associated Press, Axel Springer, Financial Times, Dotdash Meredith, and News Corp. Although OpenAI has not made the terms of these agreements public, it has been reported that Dotdash receives at least $16 million a year. The company has neither confirmed nor denied that it has used copyrighted works to train its artificial intelligence systems without proper authorization.