While copyright infringement from illegal downloading isn’t technically classified as “theft” under the law—since it falls under copyright infringement section—it still is theft in practice as you’re taking something of value without paying for it or having permission. Essentially, you’re depriving the creator of potential earnings, much like stealing physical goods.
When it comes to training the models, it most likely falls into fair-use. The models learn little like a person learns from watching paintings in an art gallery: as their aim is to adapt not replicate the data. But if you do not pay the fee to enter the gallery in the first place you are stealing the experience. This is the case with Ai as “theft” has occured in the process of creating or aquiring a dataset that contains copyrighted data without proper licensing or permission from its owner.
Downloading images or other copyrighted data from internet is copyright infringement even if they are publicly available: https://ogc.harvard.edu/pages/copyright-and-fair-use
Exception being the fair-use, but that is rarely the case.
“fair use law allows someone or a company to use copyrighted material without consent as long as certain conditions are met – for instance, if it’s used for teaching or research or criticism or news reporting. You know, this law is intended to encourage freedom of expression, but there are real limits on it. For instance, the Supreme Court has said that if copyrighted material is used to make something new and that new thing competes with the original copyrighted work, that is not fair use.” –BOBBY ALLYN
Source: The Effect of the Use Upon the Potential Market
What does this have to do with Ai companies?
Data needs to be locally accessible in order to train an Ai model. This means the copyrighted data must be downloaded.
cases where OpenAi is sued for illegally using copyrighted data:
- https://www.businessinsider.com/openai-lawsuit-copyrighted-data-train-chatgpt-court-tech-ai-news-2024-6
- https://www.johnpobrienesq.com/openai-sued-for-using-copyrighted-material-to-train-chatgpt/
- https://www.npr.org/2023/08/18/1194562272/openai-is-facing-lawsuits-over-copyrighted-materials-it-uses-to-train-chatgpt -Bobby Allyn link
OpenAi admitting wide use of copyrighted data:
https://www.theguardian.com/technology/2024/jan/08/ai-tools-chatgpt-copyrighted-material-openai
under this article also other Ai companies are getting sued:
“Getty Images, which owns one of the largest photo libraries in the world, is suing the creator of Stable Diffusion, Stability AI, in the US and in England and Wales for alleged copyright breaches. In the US, a group of music publishers including Universal Music are suing Anthropic, the Amazon-backed company behind the Claude chatbot, accusing it of misusing “innumerable” copyrighted song lyrics to train its model.”
Many Ai companies use data laundring to hide their copyright infringements:
- https://waxy.org/2022/09/ai-data-laundering-how-academic-and-nonprofit-researchers-shield-tech-companies-from-accountability/
- https://medium.com/discourse/is-big-tech-using-data-laundering-to-cheat-artists-ccf1a8c87b91
- https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/
Some of the datasets only contain links to the data. But in order to verify the links they have needed to download and process the copyrighted material. Some of them claim that only shortly downloading them is fair-use as they then delete them right away. But considering they receive funding from parties that profit from these datasets this is just pure data laundering. More importantly Ai companies to utilize the links they themselves then have to download the files.
More about scraping, Data Laundering, Lawsuits, and Generative Ai:
https://www.createdontscrape.com/
https://www.createdontscrape.com/pretrainingfine-tuning-why-you-need-to-know
There are Ai companies that seek ethical and correctly licensed datasets to train their models. But it would be ridiculously naive to think that the most succesfull models would not be illegaly or maliciously obtaining copyrighted works into their datasets in order to gain an advantage when it comes to model performance.
The problems for owners of copyrighted works
The problem with works being downloaded against their owners consent and laundered into Ai’s is quite the problem.
1: The people are not being compensated when their works are downloaded without proper licensing to create commercial projects against their will. These copyrighted works are a defacto requirement for the training of an Ai model that is commercially used to make profit in the future, and thus should be licenced accordingly. Downloading ie. aquiring copyrighted data for this use is not fair use, even if the training itself might be.
2: The works produced by Ai’s compete with the original pieces. When a person (even if a person would not be a customer for any original pieces) generates data from the copyrighted works and makes it available to others: They flood and saturate the market while also skewing the algorythm that would otherwise lead paying customers to the original owner of the copyrighted material.
Now the original owner and creator of the copyrighted work must compete with works that adapt and imitate their works.
These result in stagnation and loss of innovation as people no longer get compensated for their work, nor see a point doing it for free when it gets lost in the void.
These unethical methods of using copyrighted data without consent lead to short term gains with massive negative long-term effects for both humans and the whole Ai field.