B
BTC $117,896 ↑ 1%
E
ETH $4,215 ↑ 1%
X
XRP $3.24 ↓ 1.9%
U
USDT $1.00 ↑ 0%
B
BNB $809.31 ↑ 0.3%
S
SOL $180.58 ↑ 0.3%
U
USDC $1.00 ↑ 0%
S
STETH $4,204 ↑ 0.9%
D
DOGE $0.24 ↑ 0.7%
T
TRX $0.34 ↑ 0.9%
A
ADA $0.81 ↓ 0.2%
W
WSTETH $5,096 ↑ 1.2%
B
BTC $117,896 ↑ 1%
E
ETH $4,215 ↑ 1%
X
XRP $3.24 ↓ 1.9%
U
USDT $1.00 ↑ 0%
B
BNB $809.31 ↑ 0.3%
S
SOL $180.58 ↑ 0.3%
U
USDC $1.00 ↑ 0%
S
STETH $4,204 ↑ 0.9%
D
DOGE $0.24 ↑ 0.7%
T
TRX $0.34 ↑ 0.9%
A
ADA $0.81 ↓ 0.2%
W
WSTETH $5,096 ↑ 1.2%

The Intellectual Property Dilemma in AI Training

Generative AI technologies have evolved rapidly, ushering in an era of unparalleled creativity, opportunities, and challenges. Since text generators such as ChatGPT and image generators like Midjourney and DALL·E are now revolutionizing industries including publishing, movie creation, marketing, education and software development, it is important to understand a controversial and still largely unresolved legal and ethical issue lingering beneath the surface of these technological innovations that is using and applying copyrighted and proprietary content in training these AI models.

This article examines the inherent tension between IP holders and AI companies. It discusses the legal, ethical, and commercial issues with using protected works for training AI, surveys new lawsuits and verdicts, and highlights international differences in perspective. As the number of lawsuits rises and public opinion continues to develop, an essential question remains: can AI innovation continue without the rights of the creators being violated, or are we approaching an inevitable litigious confrontation?

The Mechanics of AI Training and Usage of IP

At the base of generative AI, is machine learning (ML) – more specifically a subset of ML called deep learning. For LLMs (large language models) and generative systems, the datasets must be big and consist of text data like books and articles. It can also include multimedia data like images, audio and video. Everything is loaded into neural networks that learn linguistic patterns, the features of images, and structural representations.

The controversy surrounding this whole debate revolves around the type of datasets being used. Some data proposed by companies are either publicly available or licensed data. However, mounting evidence shows that many companies scrape content indiscriminately from the internet, including a significant amount of copyrighted or otherwise protected material. These datasets often include articles from news sites where journalists create original IP, research from scientific journals published without the authors’ consent, content from subscription-only platforms behind paywalls, and books from pirated repositories.

Training datasets are not usually disclosed in full by AI developers, citing proprietary concerns. The lack of transparency can fuel the suspicion and anger of the content creators who see their own work, or styles, reflected in the outputs of AI without any credit or compensation.

Fair Use and the Legal Dynamic in the United States

The U.S. copyright regime includes a doctrine called “fair use,” which allows unlicensed uses of copyrighted content in certain cases. Courts consider a number of factors in determining whether a use is fair:
– The purpose and character of the use (commercial/educational, transformative/derivative).
– The nature of the copyrighted work.
– The amount and substantiality of the portion used; and
– The effect of the use upon the market for or value of the original work.

Proponents of AI developers argue that training constitutes a fair use because models do not reproduce the original content but analyse patterns to create new text or images. They take the position that it is very much a transformative use. Critics argue that the scale and commercial nature of the models and consideration of market harm tips the balance against fair use.

Discussion of Noteworthy U.S. Lawsuits

There are several lawsuits in the U.S. that have incentivized the legal debate on these issues:

Silverman et al. v. Meta Platforms (2023–2025)

In 2023, authors Sarah Silverman, Christopher Golden, and Richard Kadrey sued Meta. Their claim was based on the assertion by Meta that its LLaMA AI model was trained on pirated copies of their books from LibGen, a notorious platform that monetizes unauthorized digital content. Later, internal documents from Meta revealed that Meta had, in fact, knowingly used data systems which consumed pirated material.

In June of 2025, U.S. District Judge Vince Chhabria granted Meta for summary judgment, because the authors had not shown any evidence of direct market damage. While acknowledging likelihood of LLaMA likely trained on transformative uses, the court emphasized could not render blanket immunity for AI developers. Chhabria reiterated that in the future, cases that have clearer evidence of commercial damage could result in a different outcome.

Authors v. Anthropic (Claude model)

Anthropic, the developer of Claude, was also sued for using a shared library of pirated books. Judge William Alsup found that although the act of training might be fair use, to use millions of infringing works in an organized way and store it in an organized way violated copyright law. He allowed the case to go to trial to determine remedies and said in his rulings it was a more cautious approach than in the case of Meta.

Meta and Unauthorized Pornography

A different lawsuit against Meta hit the news in 2025 when it came to light that the company purportedly used unauthorized adult films, even content from producers like Tushy and Vixen, to train its AI on erotic and human body images. The plaintiffs claim that Meta had seeded and downloaded adult videos through torrenting and used those to construct datasets to train its visual models.

This case raises more legal and ethical dimension:

  • Whether adult content creators have the right to control derivative outputs for their work.
  • Whether companies can be held to account for training on sexually explicit material without a clear licensing or consent.
  • What are the reputational risks of use without consent in sensitive situations?

The New York Times vs. OpenAI (2024-Present)

The New York Times sued OpenAI in late 2024, alleging that ChatGPT’s training data relied on the Times’ articles without permission. The Times even provided examples of GPT-4 and GPT-4o generating near-verbatim outputs of the Times’s articles. Unlike earlier lawsuits, this one focuses on the competitive risk posed by AI’s ability to summarize and paraphrase news content. The case remains ongoing but will likely determine whether scraped journalistic material qualifies for fair use and whether AI-generated outputs that closely resemble the originals constitute infringement.

Global Legal and Regulatory Approaches

Apart from the U.S., countries have pursued a variety of options on this issue.

European Union: The EU AI Act and the Digital Services Act set an explicit framework for the development of AI Models. The Text and Data Mining (TDM) exception meant that AI companies could use copyrighted work unless creators opted out their copyrighted work through machine-readable metadata. Companies could choose to opt-out their copyright work, gave creators control, and has also ignited calls for a version of the same framework in other countries.

United Kingdom: The current government in the UK came out and signalled support for a TDM exception, but after an immediate backlash from creators, they woefully changed their opinion. AI developers must now have explicit licenses for any commercial data mining they conduct.

India: Legal measures are still being developed. In late 2024, ANI (Asian News International) sued OpenAI for using its news articles without permission. This case emphasized that this issue can easily play out globally.

Canada: Authors and publishers from Canada have also joined U.S. lawsuits, based on the allegations that OpenAI scraped their digital publications. Canadian copyright law is typically more conservative when it comes to fair use.

Ethical and Social Implications

Intellectual Property issue involves some significant ethical and societal issues beyond the courtrooms.

Compensation to Creators: Authors, artists, journalists, and filmmakers all complain that AI companies are profiting off their work, developing billion-dollar tools on the backs of unpaid creative labor.

Market Substitution: AI’s ability to replicate unique styles, one example is AI being able to replicate unique styles from a certain artist. This opens possibilities for creators to be displaced altogether.

Cultural Erasure: The creation of content by AI in ways without attribution can erase creative contribution from different forms of creators from marginalized and underrepresented communities.

Consent and Privacy: The modelling of AI on adult content and personal data raises concerns of exploitation, data misuse and reputational harms.

Commercial Risk and Business Model Changes

For AI companies, the mounting backlash poses reputational, legal, and financial risk. With increased public scrutiny, some companies are starting to change course:

Licensing Agreements: OpenAI recently signed contracts with multiple publishers and news companies to license their contents.

Disclosure of Data Sources: Policymakers, creators, and watchdog groups increasingly urge AI companies to disclose the exact sources of their training datasets.

Synthetic Data: Some developers are investigating the use of synthetic or simulation-based data to reduce copyright risk.

However, these approaches also come with trade-offs. Licensing can be expensive and cumbersome. Synthetic data can be less varied or realistic. And transparency may expose companies to even more lawsuits.

Towards a Balanced Framework: Possible Solutions

To find balance between promoting innovation and protecting rights will require innovative solutions:

Opt-Out Registries: A global registry where creators can list content which they do not want used as training material for their AI models

Compulsory Licensing: Regulators could implement compulsory licensing schemes that compensate creators through royalties, similar to how the music industry handles licensing.

Fair Use Guidelines: Courts and legislators could help define the limits that will set the boundaries of transformative use of AI.

Dataset Audits: Regulatory agencies could establish independent audits to confirm that AI models conforming to licensing requirements.

Open Data Standards: Creating standards to encourage AI developers to only develop their models on content licensed under permissive or open terms.

The Road Ahead

The debate over copyright infringement in AI training is not just a legal matter, it is a defining ethical issue for the future of technology. As generative AI becomes more powerful, the urgency to address the rights of creators intensifies. Recent rulings in the U.S. have recognized the transformative nature of AI training, but they also signal that courts may impose future restrictions and penalties, particularly when parties demonstrate market harm or when companies use clearly pirated source material.

The Meta porn case reveals how deep this problem runs and how broad its implications may be. Stakeholders increasingly contest the line between permissible use and exploitation, from adult content to children’s books.

If courts, legislatures, and companies cannot forge a workable framework, the tech industry may find itself mired in litigation, regulatory intervention, and public backlash.

The path forward must include transparency, fair compensation, ethical foresight, and a willingness to compromise. Only then can AI development proceed in a way that truly respects and uplifts human creativity, rather than appropriating it without acknowledgment or reward.

Sign Up to Our Newsletter

Be the first to know the latest updates