By Mayank Chhaya
The New York Times’s lawsuit against Microsoft and OpenAI for copyright infringement illustrates a remarkable lack of sagacity among leading tech players about the importance of owning original content themselves.
At the onset of the research and development of artificial intelligence (AI) there should have been clear recognition about the need to either create or own original content to train their systems as part of their large-language models (LLMs). It should have been anticipated that training algorithms using LLMs would inevitably run into copyright infringement challenges given that AI tech companies do not own any original content themselves.
The Times appears to be on a firm ground to claim in its complaint this: “Defendants seek to free-ride on The Times’s massive investment in its journalism” even while accusing Microsoft of “using The Times’s content without payment to create products that substitute for The Times and steal audiences away from it.”
It is undeniable when it comes to training their algorithms in current affairs that AI companies such as OpenAI depend heavily on the materials produced by leading news organizations such as The New York Times. The paper invests large amounts in its journalists within the United States and around the world to offer firsthand news and perspectives. These efforts require a great deal of persistent journalistic expertise often acquired over decades coupled with professional credibility.
The Times’s claim is also illustrated by the fact that while the paper’s market capitalization as of this month is 7.58 billion dollars after 172 years of existence, OpenAI, which was founded on December 11, 2015, is now valued by investors at over $80 billion. At least some of that value has accrued from having gained from high-quality content created by news organizations like the Times.
The paper points out in its complaint, “Independent journalism is vital to our democracy. It is also increasingly rare and valuable. For more than 170 years, The Times has given the world deeply reported, expert, independent journalism. Times journalists go where the story is, often at great risk and cost, to inform the public about important and pressing issues. They bear witness to conflict and disasters, provide accountability for the use of power, and illuminate truths that would otherwise go unseen. Their essential work is made possible through the efforts of a large and expensive organization that provides legal, security, and operational support, as well as editors who ensure their journalism meets the highest standards of accuracy and fairness.”
“Defendants’ generative artificial intelligence (“GenAI”) tools rely on large-language models (“LLMs”) that were built by copying and using millions of The Times’s copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more. While Defendants engaged in widescale copying from many sources, they gave Times content particular emphasis when building their LLMs—revealing a preference that recognizes the value of those works,” the complaint says.
The 69-page complaint filed in the Federal District Court in Manhattan has the potential to fundamentally define the extent of AI systems’ ability to offer accurate and comprehensive data. Without resorting to original materials such as provided by the Times OpenAI and other such services would find it impossible to create a sophisticated AI resource unless they themselves had such massive proprietary materials of their own.
The Times has argued to Microsoft and OpenAI have “refused” to recognize” protection offered by Copyright law. “Powered by LLMs containing copies of Times content, defendants’ GenAI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples,” it has said.
Inevitably, the complaint also points out how deploying Times-trained LLMs has boosted Microsoft’s market cap. “Using the valuable intellectual property of others in these ways without paying for it has been extremely lucrative for Defendants. Microsoft’s deployment of Times-trained LLMs throughout its product line helped boost its market capitalization by a trillion dollars in the past year alone. And OpenAI’s release of ChatGPT has driven its valuation to as high as $90 billion,” it says.
With that as the backdrop, it is extraordinary that AI entrepreneurs such as OpenAI do not appear to have paid much attention to the question of what they would do when it came to drawing from published materials to train their LLMs. It is not clear whether in conjunction with developing AI, any major corporation had plans to acquire news organizations with a proven record of credibility.
“The Times objected after it discovered that Defendants were using Times content without permission to develop their models and tools. For months, The Times has attempted to reach a negotiated agreement with Defendants, in accordance with its history of working productively with large technology platforms to permit the use of its content in new digital products (including the news products developed by Google, Meta, and Apple). The Times’s goal during these negotiations was to ensure it received fair value for the use of its content, facilitate the continuation of a healthy news ecosystem, and help develop GenAI technology in a responsible way that benefits society and supports a well-informed public,” the paper says in its complaint.
It also makes an important point about tech companies claim of fair use. “These negotiations have not led to a resolution. Publicly, Defendants insist that their conduct is protected as “fair use” because their unlicensed use of copyrighted content to train GenAI models serves a new “transformative” purpose. But there is nothing “transformative” about using The Times’s content without payment to create products that substitute for The Times and steal audiences away from it. Because the outputs of Defendants’ GenAI models compete with and closely mimic the inputs used to train them, copying Times works for that purpose is not fair use.”
The lawsuit also has the potential to significantly slow down the rapid evolution of AI in recent months to the extent that it may even impact AI companies’ business viability.
The Times story about the lawsuit by Michael M. Grynbaum and Ryan Mac quoted Lindsey Held, an OpenAI spokeswoman, as saying, while “surprised and disappointed” by the lawsuit, the company had been “moving forward constructively” in conversations with The Times.
She said, “We respect the rights of content creators and owners and are committed to working with them to ensure they benefit from A.I. technology and new revenue models. We’re hopeful that we will find a mutually beneficial way to work together, as we are doing with many other publishers.”
Microsoft did not comment on the lawsuit.