Mark Zuckerberg authorized Meta's Llama team to train with copyrighted works, filing claims.
Meta's CEO, Mark Zuckerberg, has authorized the company's Llama team to conduct training using copyrighted documents, according to a recent court filing.
The plaintiffs' lawyers in a copyright case against Meta have claimed that the company's CEO, Mark Zuckerberg, authorized the team responsible for the Llama artificial intelligence models to use a dataset that includes pirated e-books and articles for training. The case, Kadrey v. Meta, is one of many against tech companies developing artificial intelligence, in which they are accused of training models using copyrighted works without proper permission.
The defendants, including Meta, have defended their position under the fair use doctrine, which allows the use of protected works to create something new as long as it is sufficiently transformative. However, many content creators dispute this defense.
In recently unredacted documents submitted to the U.S. District Court for the Northern District of California, the plaintiffs, which include bestselling authors Sarah Silverman and Ta-Nehisi Coates, recount a testimony from Meta last year, where it was revealed that Zuckerberg approved the use of a dataset known as LibGen for training Llama models. LibGen, which describes itself as a "link aggregator," provides access to protected works from publishers like Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education. This platform has faced numerous lawsuits and has dealt with shutdown orders and multimillion-dollar fines for copyright infringement.
According to Meta's testimony, the plaintiffs' lawyers assert that Zuckerberg authorized the use of LibGen despite concerns from Meta's AI executive team and others within the company. The documents cite Meta employees referring to LibGen as a "dataset we know is pirated" and that its use "could undermine [Meta's negotiating position with regulators]." Additionally, a memorandum addressed to decision-makers at Meta AI indicated that, after "the escalation to MZ," the Meta AI team received approval to use LibGen (MZ is quite an obvious abbreviation for Mark Zuckerberg).
These details appear to align with previous reports indicating that Meta had taken shortcuts to amass data for its AI. At one point, the company hired individuals in Africa to summarize books and considered acquiring the publisher Simon & Schuster, as Meta executives assessed that negotiating licenses would take too long, opting instead to rely on the fair use argument.
Furthermore, the latest allegations suggest that Meta may have attempted to conceal its alleged infringement by removing attribution from the LibGen data. According to the plaintiffs' lawyers, a Meta engineer, Nikolay Bashlykov, wrote a script to remove copyright information from the e-books on LibGen. Separately, it is alleged that Meta also deleted copyright markers from scientific journal articles and "source metadata" in the training data used for Llama.
The new statement indicates that, during interrogations, Meta revealed that it used torrenting to download from LibGen, raising concerns among some of the company's research engineers. The use of torrenting implies that the downloader must simultaneously "seed" the files they are attempting to obtain. The plaintiffs' lawyers argue that Meta engaged in an additional form of copyright infringement by torrenting LibGen and thus contributing to the dissemination of its content.
Meta has also been accused of downplaying the number of files it uploaded, claiming that Ahmad Al-Dahle, Meta's head of generative AI, "shortened the path" for the use of LibGen torrents, disregarding Bashlykov's reservations about the potential legal implications. The plaintiffs' advocates wrote that "if Meta had purchased the plaintiffs' works at a bookstore or borrowed them from a library and trained its Llama models without a license, it would have committed copyright infringement."
Meta's legal situation is still developing. At this time, it refers to the earlier Llama models and not the more recent versions, and it is possible that the court could rule in favor of Meta if it accepts its fair use argument. However, the allegations do not paint the company in a favorable light. The judge presiding over the case, Vince Chhabria, noted that Meta's request to redact parts of the file seemed more aimed at avoiding bad publicity than at protecting sensitive business information.