Ai and the problems with copyright
”A copyright is a type of intellectual property that gives its owner the exclusive legal right to copy, distribute, adapt, display, and perform a creative work”.
What is copyright infringement?
”As a general matter, copyright infringement occurs when a copyrighted work is reproduced, distributed, performed, publicly displayed, or made into a derivative work without the permission of the copyright owner.” –US Copyright office
”Public availability does not mean free to use. Copyright and licensing still applies to works published online” – Harvard University
The latest rulings have determined Ai training on unauthorized works to not fall under fair use.
USA: https://www.pbwt.com/publications/ai-training-using-copyrighted-works-ruled-not-fair-use
EU: Copyright Opt-Out: Specifically, the act extends an opt-out mechanism originally intended for text and data mining (TDM) to AI training. This means copyright holders can choose to prevent their work from being used in AI training, even if it’s publicly available.
Why this is the correct ruling:
In order to make the Ai models, you currently need to download the works locally in order to turn the images into embeddings and train the model. This is simply copyright infringement as copying is a reproduction without premission. Downloading to RAM / browser cache doesnt make an infringing copy, because that is what the owner of the work has approved of (if the work isnt stolen and then put up to internet). But current Ai models for efficiency copy the works into local memory.
Though even if the Ai companies perform data laundering*, and don’t copy locally, they are still breaking licensing terms of the works. This is due to how the owners of the works have intended the works to be viewed: Exclusively by humans (and necessary machines like hosting service and google search** bot to enable people to find the work), but not by bots or software that feeds the works into Ai training. This means the Ai scrapers aren’t allowed to even view the works according to how the owners have intended to make the works to be available.
Licensing differently for different audiences is completely normal thing in other fields as well, and 100% the right of the copyright owner.
Software: Student licensing: free for student use, paid for companies and commercial work.
Museums: Children attend for free, adults must pay a fee.
When it comes to ethics this is an obvious case:
People should be paid for their work. Ai models are currently not compensating for one of the main components of their software development process and building blocks of their model, which is simply exploitation.
Furthermore, Ai’s output competes and devalues the original work. They do not grant any benefits to the owners of the works in the dataset (like google search provides links to the owner), but instead induce costs to them by crawling their websites and Ai works flooding the platforms reducing their visibility even more.
There is no ethical reason why building for profit software should be enabled to exploit the work of other people without compensating them, when it at the same time causes them harm.
Similar consent and use limiting laws apply to marketing and IoT data handling already.
*(Data laundering is the act of having a third party do the downloading / turning the works into a reprhased dataset in the name of non-commercial use, and then having a commercial company use that dataset to train the model.)
**Google image search is a symbiotic relationship where google points to your content thus providing you a service. You can also opt out of google image search if you want.
Common arguments debunked:
Humans are allowed to learn so should be Ai. Simply no.
Like above pointed even for humans there can be different licensings. Furthermore, Ai isn’t sentient and wishing to see those things or exposed to them organically. Ai use datasets spesifically selected and crawled to be used in software development.
When you abstract and focus only on the important part (learning) it should be fair use. Simply no.
Abstracting to a very narrow argument from the whole Ai development to try to argue fair use, is meaningless when assessed in the real world and in the wider context of reality.
The work isnt really used so training is okay! Simply delusional take. When value is extracted from the work by ingesting it’s core and essense (as that is the whole purpose of training) it certainly is used in the process. When this is done in structured and mass environment it no longer falls under consumer licensing.