Ai and the problems with copyright
”A copyright is a type of intellectual property that gives its owner the exclusive legal right to copy, distribute, adapt, display, and perform a creative work”.
What is copyright infringement?
”As a general matter, copyright infringement occurs when a copyrighted work is reproduced, distributed, performed, publicly displayed, or made into a derivative work without the permission of the copyright owner.” –US Copyright office
”Public availability does not mean free to use. Copyright and licensing still applies to works published online” – Harvard University
The latest rulings have determined Ai training on unauthorized works to not fall under fair use.
USA: https://www.pbwt.com/publications/ai-training-using-copyrighted-works-ruled-not-fair-use
EU: Copyright Opt-Out: Specifically, the act extends an opt-out mechanism originally intended for text and data mining (TDM) to AI training. This means copyright holders can choose to prevent their work from being used in AI training, even if it’s publicly available.
Why this is the correct ruling:
In order to make the Ai models, you currently need to download the works locally in order to turn the images into embeddings and train the model. This is simply copyright infringement as copying is a reproduction without premission. Downloading to RAM / browser cache doesnt make an infringing copy, because that is what the owner of the work has approved of (if the work isnt stolen and then put up to internet). But current Ai models for efficiency copy the works into local memory.
Though even if the Ai companies perform data laundering*, and don’t copy locally, they are still breaking licensing terms of the works. This is due to how the owners of the works have intended the works to be viewed: Exclusively by humans (and necessary machines like hosting service and google search** bot to enable people to find the work), but not by bots or software that feeds the works into Ai training. This means the Ai scrapers aren’t allowed to even view the works according to how the owners have intended to make the works to be available.
Licensing differently for different audiences is completely normal thing in other fields as well, and 100% the right of the copyright owner.
Software: Student licensing: free for student use, paid for companies and commercial work.
Museums: Children attend for free, adults must pay a fee.
When it comes to ethics this is an obvious case:
People should be paid for their work. Ai models are currently not compensating for one of the main components of their software development process and building blocks of their model, which is simply exploitation.
Furthermore, Ai’s output competes and devalues the original work. They do not grant any benefits to the owners of the works in the dataset (like google search provides links to the owner), but instead induce costs to them by crawling their websites and Ai works flooding the platforms reducing their visibility even more.
There is no ethical reason why building for profit software should be enabled to exploit the work of other people without compensating them, when it at the same time causes them harm.
Similar consent and use limiting laws apply to marketing and IoT data handling already.
*(Data laundering is the act of having a third party do the downloading / turning the works into a reprhased dataset in the name of non-commercial use, and then having a commercial company use that dataset to train the model.)
**Google image search is a symbiotic relationship where google points to your content thus providing you a service. You can also opt out of google image search if you want.
Common arguments debunked:
Humans are allowed to learn so should be Ai. Simply no.
Like above pointed even for humans there can be different licensings. Furthermore, Ai isn’t sentient and wishing to see those things or exposed to them organically. Ai use datasets spesifically selected and crawled to be used in software development.
When you abstract and focus only on the important part (learning) it should be fair use. Simply no.
Abstracting to a very narrow argument from the whole Ai development to try to argue fair use, is meaningless when assessed in the real world and in the wider context of reality.
The work isnt really used so training is okay! Simply delusional take. When value is extracted from the work by ingesting it’s core and essense (as that is the whole purpose of training) it certainly is used in the process. When this is done in structured and mass environment it no longer falls under consumer licensing.
sources:
-
TechCrunch – “Commercial image-generating AI raises all sorts of thorny legal issues” (July 2022) – notes that OpenAI’s DALL·E 2 was trained on ~650 million image–text pairs scraped from the internettechcrunch.com.
-
PetaPixel – “OpenAI Claims it is Impossible to Train AI Without Using Copyrighted Content” (Jan 2024) – reports OpenAI’s admission that virtually all modern AI training relies on copyrighted material from the webpetapixel.com.
-
Reuters – “Getty Images lawsuit says Stability AI misused photos to train AI” (Feb 2023) – reveals Getty’s claim that Stability AI scraped 12 million Getty Images photos without a license to train Stable Diffusionreuters.comreuters.com.
-
Frost Brown Todd (law firm blog) – “Midjourney Faces Disney Lawsuit…” (June 2025) – summarizes Disney’s suit and notes Midjourney’s founder acknowledged using web-scraped images in training (quote: “no way to get 100 million images and know where they’re from”)frostbrowntodd.com, effectively admitting unlicensed data use.
-
PetaPixel – “Meta Scraped Every Australian Adult’s Public Photos to Train AI” (Sept 2024) – details a hearing where Meta admitted to scraping all public Facebook/Instagram posts since 2007 for AI trainingpetapixel.com.
-
VentureBeat – “Meta launches AI image generator trained on your FB, IG photos” (Dec 2023) – reports that Meta’s Emu model was trained on 1.1 billion Instagram/Facebook images (public user photos)venturebeat.com.
-
PetaPixel – “Photographer Sues Google for Using Her Photos to Train AI” (May 2024) – covers the artists’ lawsuit against Google, noting Google’s use of the LAION-400M scraped image dataset for Imagen and a Google statement that their AI is trained on “publicly available” web contentpetapixel.competapixel.com.
-
Ballard Spahr (law firm alert) – “Google Facing New Copyright Suit Over AI Image Generator” (May 2024) – confirms that Google’s Imagen was initially trained on the public LAION-400M dataset, which included the plaintiffs’ copyrighted works scraped from the internetballardspahr.comballardspahr.com.