OpenAI Wins First Round in Copyright Lawsuit vs. Raw Story & AlterNet Media
The case alleged OpenAI & ChatGPT violated the DMCA by removing CMI data
OpenAI recently achieved a preliminary legal victory in a copyright lawsuit brought by news organizations Raw Story Media Inc. and AlterNet Media Inc.
The case, filed in February 2024, focused on allegations that OpenAI violated the Digital Millennium Copyright Act (DMCA) by removing Copyright Management Information (CMI) from thousands of articles used to train its AI model, ChatGPT.
While the court ultimately dismissed the case, the decision leaves room for potential further action.
Lawsuit Details: OpenAI vs. Raw Story & AlterNet
What is CMI?
CMI, or Copyright Management Information, includes essential details that help identify and manage ownership rights for a work.
CMI typically includes author names, article titles, copyright notices, and licensing information.
Removing CMI without authorization can hinder the ability to trace back or attribute content correctly, which is why it’s protected under the DMCA.
Raw Story and AlterNet alleged that OpenAI improperly removed CMI from their articles during the training of ChatGPT.
They argued that OpenAI’s removal of CMI violated Section 1202(b)(i) of the DMCA, which prohibits the removal of CMI with the intent to facilitate or conceal copyright infringement.
The plaintiffs sought legal relief based on the concern that their content might be reproduced or reflected in ChatGPT outputs without attribution or compensation.
Court’s Ruling: Dismissal on November 7, 2024
On November 7, 2024, U.S. District Judge Colleen McMahon dismissed the case in favor of OpenAI.
The court cited 2 primary reasons for this decision:
Lack of Concrete Harm
The court ruled that the plaintiffs did not demonstrate a “cognizable injury” from OpenAI’s alleged removal of CMI. Judge McMahon noted that without evidence of unauthorized dissemination or financial loss, the claim of harm was “too abstract” to meet the DMCA’s standards.
This sets a high bar for future cases, implying that plaintiffs would need specific evidence of harm beyond the mere removal of CMI.
Speculative Claims of Future Harm
Judge McMahon found the plaintiffs’ arguments about potential future harm speculative.
The court concluded that the risk of ChatGPT reproducing articles verbatim was “remote,” and the plaintiffs failed to prove a “substantial risk” of direct plagiarism.
The court suggested that without tangible examples of copyright infringement, claims about future violations were not sufficient grounds for legal remedy.
Judge’s Key Observation
Judge McMahon indicated that the core issue might not be the removal of CMI, but rather the use of the articles without compensation.
This observation highlights a potentially different concern about AI training practices and raises questions about compensation for content creators.
However, the DMCA provision cited by the plaintiffs does not directly address compensation, which was beyond the court’s purview in this case.
Likelihood of Case Continuation
While the case was dismissed, it was done without prejudice, meaning the plaintiffs have the option to amend and refile their complaint.
Matt Topic, the plaintiffs’ attorney, expressed confidence that they could address the court’s concerns in a revised complaint, suggesting a high likelihood that the case could continue. If refiled, the plaintiffs would likely focus on:
Demonstrating Specific Harm: Future arguments may include examples of financial losses or unauthorized dissemination directly attributed to OpenAI’s actions.
Showing Evidence of Plagiarism: Plaintiffs would need to prove that ChatGPT’s responses contain substantial portions of their copyrighted work, showing a concrete risk of reproduction without attribution.
Broader Implications & Future Lawsuits
This ruling has significant implications for similar lawsuits that OpenAI and other AI developers face from media outlets like The New York Times and The Intercept.
Judge McMahon’s decision highlights the difficulty of proving direct injury from CMI removal alone, which could shape the outcomes of similar cases.
However, the ruling leaves room for potential compensation-based claims, which may open new paths for copyright protection in the AI industry.
Estimated Litigation Costs
Specific litigation costs for this case have not been disclosed.
Factors impacting the cost likely include:
Duration and Complexity: Since the case was dismissed early, litigation costs may have been relatively lower than they would be for a more extended case.
Attorney Fees: Costs also vary based on the rates of the law firms involved and the complexity of the arguments.
Early-stage dismissals typically incur lower costs, but without official disclosures, precise figures remain unknown.
How Raw Story & AlterNet Media Suspected OpenAI Used Their Data
In February 2024, Raw Story Media and AlterNet Media filed a lawsuit against OpenAI, suspecting that their content had been used to train ChatGPT.
Unlike other similar lawsuits, the plaintiffs did not provide specific examples of stories they claimed were copied.
Instead, they based their claims on "recreations"—outputs from ChatGPT that, in their view, suggested their content had been used in the training data.
Legal Strategy & Approach
Raw Story and AlterNet took a unique approach compared to traditional copyright lawsuits:
DMCA Violations: Rather than arguing copyright infringement, the lawsuit focused on violations of the Digital Millennium Copyright Act (DMCA) concerning the removal of Copyright Management Information (CMI)—such as author names, titles, and copyright notices.
Evidence Basis: Their claims were rooted in an “extensive review” of publicly available information, which they argued indicated “thousands” of their copyrighted works had been used in OpenAI’s datasets.
Key Distinctions from Other Cases
In contrast to cases like that of The New York Times, where millions of specific articles were identified, the approach by Raw Story and AlterNet:
Lacked Concrete Evidence: They did not provide specific examples of their content within ChatGPT’s training data.
Focus on CMI: The primary concern was the alleged removal of copyright information, rather than the actual reproduction of content itself.
Impact of the Lack of Specific Evidence
This absence of concrete proof may have contributed to the case’s dismissal.
Judge Colleen McMahon noted that the plaintiffs failed to demonstrate any “concrete injury” resulting from OpenAI’s actions, emphasizing that without tangible examples or evidence of content misuse, the alleged harm was too abstract to support their claim.
Can OpenAI just hide the fact that they trained their AI on websites & media sources (assuming illegal)?
Yes, OpenAI and similar companies could technically employ several methods to make it difficult to determine the exact sources used in training, even if specific copyrighted data was part of the dataset.
Here are some ways this could be done, and why, despite these methods, suspicion could still arise:
1. Aggregation and Data Mixing
By training models on massive, aggregated datasets sourced from across the internet, specific sources become diluted.
This aggregation makes it challenging to pinpoint individual sources, as the model doesn’t "remember" documents in full but learns patterns, phrasing, and general knowledge distributed across many sources.
2. Data Anonymization
OpenAI could preprocess training data to strip metadata, such as author names, publication titles, and copyright notices.
This would make it difficult to trace outputs back to specific sources.
It’s a common technique to anonymize data to avoid direct associations with copyrighted material.
3. Transformation of Text Data
Some companies could apply transformations, such as rephrasing or paraphrasing, to reduce verbatim text within datasets.
Even if original data were part of the model's training, the AI would output responses less likely to exactly replicate any one source.
4. Parameter Constraints and Filtering
AI companies can also control the model’s behavior by setting constraints to avoid generating responses that too closely mimic training data.
For example, they could filter high-risk outputs that too closely resemble known publications.
Overall: Despite these methods, suspicion could still arise if the AI consistently reflects the language style, structure, or specific phrasing characteristic of certain sources.
Moreover, as models become larger and handle enormous datasets, it’s hard to guarantee that distinctive patterns from prominent sources won’t unintentionally surface.
Additionally:
Textual Similarity Detection: If an AI generates outputs that are particularly similar to existing articles in tone, structure, or content specificity, journalists and legal teams can sometimes identify overlaps. Techniques for detecting textual similarity can make even anonymized or transformed text identifiable.
Legal Scrutiny and Transparency: The legal push for transparency about training data sources may force companies to disclose more information, especially for datasets involving substantial copyrighted material.
Thus, while OpenAI can make it difficult to detect specific sources, the scale of its data use means there are often indirect signs that can prompt copyright holders to investigate.
Is it legal to use content from websites & media sources to train AI models?
The legal status of using articles for AI training is still a gray area, and it largely depends on how courts interpret "fair use" under copyright law, which hasn’t been fully defined for AI training cases.
1. Fair Use Doctrine
Fair use is a legal principle that allows limited use of copyrighted material without permission for purposes like commentary, criticism, research, and transformative uses. If using articles to train an AI like ChatGPT is deemed "transformative" (changing the purpose or nature of the original work), it might be covered under fair use.
The argument for fair use in AI training is that models don’t store or replicate entire articles but learn patterns, language structure, and information that are used in novel ways when generating responses. Proponents argue that this transformation is comparable to a person reading and learning without direct reproduction.
2. Commercial Use vs. Transformative Purpose
Opponents of unlicensed data use argue that even if AI outputs don’t directly copy text, the model is trained to perform tasks that generate commercial value, which makes it different from casual or academic reading.
However, because the AI doesn’t sell or reproduce specific articles, it’s more like using information to generate novel outputs rather than selling original content. This is similar to how a person might learn from a textbook but then apply that knowledge in various ways without copying it verbatim.
3. Lack of Precedent in Courts
Since AI training involves the complex interaction of vast amounts of data, courts haven’t yet solidified a stance on whether this is a "fair use" or a violation of copyright. Current cases, like those against OpenAI, could set important precedents.
Courts may need to determine whether using copyrighted articles for AI training represents a “transformative use” or whether it oversteps into direct copyright infringement because the AI generates commercial outputs.
4. Public Perception and Legal Risk
While some compare AI training to a form of fair use, akin to reading and learning, others argue it’s different because it scales up that process commercially, potentially impacting content creators by decreasing the need for original writing.
Pending lawsuits could push AI companies to either negotiate licensing agreements or clarify fair use boundaries, especially as copyright holders push back against unlicensed use of their content.
The Legality of Copyrighted Content to Train AIs
The legality of using copyrighted content for AI training remains unsettled, and upcoming legal rulings could reshape AI training practices.
If courts decide it’s akin to fair use, AI companies could continue without licensing.
But if they rule it infringes copyright, companies like OpenAI might need to license content for training, redefining industry standards for training large language models.