80 points | by artninja198819 小时前
Today it's not just one industry - Western IP laws are slowing progress across multiple tech frontiers. While companies navigate complex IP restrictions (In EU and US), China's development is following a sharp exponential curve. You can already see it clearly in robotics, electric vehicles, and now these last weeks with AI.
While in the west you deal with predatory licensing (try talking with Siemens, Oracle, or Autodesk), and everyone keeps working on barriers and moats; other nations that allow a more collaborative approach (voluntary or not) are on an accelerating trajectory.
IP law is clearly no longer suitable for purpose - we need a system that encourages collaboration more directly. A complete free for all isn't ideal either - and I certainly don't advocate that- but even that appears to be better than what we have now.
Versus the content owners?!?
What kind of collaboration are we talking about? Open source people work for free, big tech steals their works and launders them for a $100 subscription fee?
Laundering is literally the only purpose of LLMs.
If the Chinese steal from the West and internally, sanction them and bring the industries back.
The way I think about IP is that if you grew up with something, by the time you're an adult it should be possible to remix it in any way you like, because it's part of your culture. Nobody should get to lock down an idea for their lifetime.
If your book hasn't made you rich in the first 14 years after publishing, I'm sorry, audience just isn't that into you.
(I am aware movie deals can languish for many years before finally landing a deal, and with 14 years studios could just wait you out and there's no incentive to write a script anymore, but copyright should be 14 years after publishing right? And movie scripts are not generally published before they get made into a movie, if ever.)
> Copyright in a work created on or after January 1, 1978, subsists from its creation ... [1]
And "creation" basically means "written down" [2]. IANAL.
[1] https://www.law.cornell.edu/uscode/text/17/302 [2] https://www.law.cornell.edu/wex/fixed_in_a_tangible_medium_o...
It won't ever change, though. No chance ever. Other than to be actually made infinite. The nature of "intellectual property" and money being speech in US politics locks that in.
GRRM isnt included on ASOIAF content because hes the copyright holder, but because his influence makes people want to see the shows
I think 20 or 30 years is fairer. But IIRC it's not about fair, the Constitution says that it should only be to encourage innovation, and long copyright inhibits innovation.
If you want to make an official adapatation, hire the author. If not, it's not official. This takes care of itself for authors who cultivate their fan base.
This is a nice compromise in that it disincentivizes simply squatting on properties--you have to pay money to maintain later copyright so you have a strong push to make money on it.
From the post:
"""
Our first recommendation is straightforward: shorten the copyright term. In the US, copyright is granted for 70 years after the author’s death. This is absurd. We can bring this in line with patents, which are granted for 20 years after filing. This should be more than enough time for authors of books, papers, music, art, and other creative works, to get fully compensated for their efforts (including longer-term projects such as movie adaptations).
"""
not sure I agree. a lot of work only get recognized broadly long after published.
Information/access to data/works should be totally free and there should be other ways to support the creators.
For example I could easily download MP3s of music and MP4s of series/movies but I don't: simply because of two reasons:
- I want to support the artist (to an extent as possible) - Using Spotify/Apple Music/Netflix is much more convenient with a totally acceptable monthly fee.
I know the article is not about entertainment but a library, same rules should apply.
And if one wants to train an LLM, let them: at its essence it's just a person who has read all the books (and access to information should be free), just the person is a machine instead of a biological human being.
If I gzip a couple hundred thousand books and distribute them freely, can I also claim it's just a person who has read those books and avoid a massive lawsuit?
Please, stop anthropomorphising machine learning models.
Netflix, Spotify, and Valve (Steam) didn’t succeed because of copyright enforcement. They won because they made paying for content easier, faster, and better than piracy.
Piracy isn’t hard, but these services solved the friction: instant access, high quality, fair pricing, and features that free alternatives couldn’t match. That’s why they still thrive today.
[1] https://www.escapistmagazine.com/valves-gabe-newell-says-pir...
PopcornTime would eat their lunch if it were allowed to work [1].
Steam is an unusual case, because games are running software and can't be trivially reproduced in their unencoded form. The publishers can include copy protection, network connection requirements, or even run essential parts of game logic on their own servers. So free downloads became a much worse experience over time.
I question this statement. First two hits:
* https://getd.libs.uga.edu/pdfs/welter_brennan_s_201212_ma.pd... Adding a movie to Netflix reduces piracy directly
* https://ideas.repec.org/a/eee/jeborg/v209y2023icp334-347.htm... Removing movies from Netflix increases piracy.
There's plenty more where that came from. Netflix actually reduces piracy. Not the other way around.
Netflix built their entire streaming business model during a time when piracy was so widespread it was almost as good as legal. They succeeded precisely by proving that people would pay for good service even when free options were readily available. They're a textbook example of a business that thrives by being better than free!
Despite huge investments in enforcement, movie piracy never waned. The reason it declined? Netflix. Why is it now seeing a bit of a resurgence? Also Netflix, actually, or rather the fact that people have splintered the streaming landscape.
Here's some articles from Forbes at the time. [1] [2] [3] and an interview with the Netflix CEO [4]
People following the Netflix/piracy story at the time saw it like this: Netflix doesn't necessarily need to care if piracy is legal or not, because it removes most of the incentive to pirate. People tried a lot of things against piracy, until Netflix came along and that was the thing that actually worked. Piracy goes down where Netflix is available. I've also provided enough sources to explain why: Piracy is a service problem [3]. Netflix provides the missing service, so people don't feel like they need to pirate anymore.
In a world where piracy was fully legal, Netflix would still exist, and still drive down piracy. This is Netflix's entire reason for success!
[1] https://www.forbes.com/sites/matthickey/2013/05/07/netflix-w...
[2] https://www.forbes.com/sites/insertcoin/2014/01/24/whatever-...
[3] https://www.forbes.com/sites/insertcoin/2012/02/03/you-will-...
[4] https://www.stuff.tv/news/netflixs-ted-sarandos-talks-arrest... (under the heading "what are you doing to combat piracy?" )
Netflix just made it that much easier to find what you want[2] and just watch it. People were and are willing to pay for that.
[1] for movies and other large files you can use a torrent tracking site: type what you want, click the result with most seeds, download, play.
[2] Not necessarily as easy as it used to be a few years ago. So piracy is going back up.
Do you think these diverse streaming services would admit to what they're doing, or will they try to double down on enforcement again?
It should also be de-criminalized.
The new guidelines say that AI prompts currently don’t offer enough control to “make users of an AI system the authors of the output.”(AI systems themselves can’t hold copyrights.) That stands true whether the prompt is extremely simple or involves long strings of text and multiple iterations. “No matter how many times a prompt is revised and resubmitted, the final output reflects the user’s acceptance of the AI system’s interpretation, rather than authorship of the expression it contains,”
They are suppose to come out with guidance regarding the first question in a month or so.
Shining a torch at a plane is usually fine, shining a laser at them usually is a crime.
Blake Lemoine, 2.5 years ago, hired a lawyer to make this exact argument: https://www.businessinsider.com/suspended-google-engineer-sa...
Myself, every time this topic comes up, I point to the fact that the philosophy of mind has 40 different definitions of "consciousness" which makes it really hard for any two people to even be sure they're arguing about the same thing when they argue if any given AI does or doesn't have it.
(Also: They can be "persons" legally without being humans, cf. corporate personhood; and they can have rights independently of either personhood or humanity, cf. animal welfare).
What I meant ("none" is obviously a hyperbole, though not by much) is that people argue that "AI" (they always use this term rather than the more descriptive ML or LLM or generative models) is somehow special and either that the mixing of input material is sufficient to defeat copyright or that it somehow magically doesn't apply for reasons they either cannot describe or which include the word "intelligence".
Courts operate on provability (and for good reasons).
However, reality is I have used someone else's work and pretended it's my own. Now, ironically, there are cases where the act of masking it can be more time consuming than writing it from scratch. That is still plagiarism, although it might not be provable.
At least I sometimes also get replies but many of them use fallacious arguments to the point of feeling like trolling. No idea if the same people commenting are also downvoting but I am starting to think that votes should not be anonymous.
There's absolutely no reason rich people owning ML companies should be getting richer by stealing ordinary people's work.
But practicality trumps morality. The west needs to beat China and China doesn't give a fuck about copyright or individual people's (intellectual) property.
The ML algos demand to be fed so we gotta sink to their level.
- Web pages; hard to argue that royalties are due since these are publicly available for free
- Scientific papers; these do cost money but the copyright is typically owned by scientific publishers
- Github, Stack Exchange, HN (yes); these are freely available, sometimes by license, so hard to argue for royalties
- Wikipedia, Project Gutenberg; these are also free by license
So the actual consequence of what you're proposing (or at least the realistically-enactable version of it) is the big AI firms paying scientific publishers a lot of money. Is this actually good? Is Elsevier, a basically pure rent-seeker, really more worthy than AI labs, which maybe you don't like but at least do something valuable?
If you're going to ignore the existence of copyright and licenses we should extend it to everything that's ever been posted on the internet, not just "web pages". Why shouldn't all books and films count as free too?
I'm actually open to the idea of just abolishing copyright but it's kind of silly to act like it's only about Elsevier. Lots of creatives depend on copyright in order to earn a living, similarly to how patents fund a lot of important research despite how noxious the patent system has become.
If we fixate on examples like Elsevier or Martin Shkreli in order to argue for completely abolishing the copyright or patent systems we risk destroying the framework that enables valuable creative works or new technologies to be developed in the first place. This is part of why people are so upset by AI companies arguing that they should just be able to ignore the whole framework in order to enrich themselves; once you allow the for-profit AI companies to do it, other groups are going to line up to also demand a free ride.
I think we are starting to see not just historical empirical data but also strategic intelligence that this assumption might be somewhat inaccurate. - which is what Anna's archive is pointing out here in the context of AI development and copyright restrictions.
Now, it boils down to the typical issue of a current system being bad and clearly not serving its stated purpose so the choice is whether to abolish it or whether to reform it (and how).
And just like any other time people have to agree on something, the simplest-to-describe solution (abolishing) will get a huge following because the more fair solutions (reform so it actually protects and rewards creators) are way, way more complex.
I fundamentally think if you build on other people's work, they deserve 1) credit 2) compensation according to what percentage of your work is based on theirs.
We could for example compare the performance of the same model depending on which parts of the training data you omit. If you omit all copyrighted work and the result is useless, we know the copyrighted work is responsible a large part of the value.
That’s not a naturally occurring principle—it’s a legal and social construct that people have come to see as an inherent entitlement. If the current system isn’t working, maybe that's the actual bit that's broken and needs reform?
The new owner is also usually a large company which has fundamentally more bargaining power than an individual so even if you get a propertional payment, the ratio of effort to reward of the company vs individual is unfair (if you put a 1000 man-days of work into your game and Steam takes 5%, do you think it's actually putting 50 man-days of work into selling it?).
As for naturally occurring principles... there are some, ehm, methods of negotiating, which you could consider natural but the state you live in would generally not approve.
1) (minor nitpick) I don't see how HN is in the same category as GitHub or Stack Overflow.
2) "sometimes by license" or "free by license" imply you don't understand how copyright works. Code that is not accompanied by a license is proprietary. Period. [0] And if it has a license, then you have to follow it. If the license says that derivative works have to give credit and use the same license, then LLMs using that code for training and anything generated by those models is derivative and has to respect the license.
3) Arguably i didn't say this in the OP but the idea that the publisher owns copyright is absolutely nuts and only possible through extensive lobbying and basically extortion. It should be illegal. Copyright should always belong to the person doing the actual work. Fuck rent-seekers.
4) If western ML companies thought they can produce models of the same quality without for example stealing and laundering the entirety of Github, they would have. They don't so clearly GH is a very important input and its builders should either be compensated or the models and ML companies should only use the subset that they can without breaking the license.
Please don't get offended but I've seen this argument multiple times by proponents of A"I" and their entire argument hinged on the idea that the current ML models are so large that nobody can understand them and therefore a form of "intelligence" which they are clearly not (unless proven otherwise, at which point, they should get their own personhood but the fact no big ML company is arguing for that makes it obvious nobody really considers them intelligent).
[0]: https://opensource.stackexchange.com/questions/1720/what-can...
Many of us publish our open source work under GPL or AGPL with the intention that you can profit from it but you have to give back what you built on it under the same license. LLMs allow anyone to launder copyleft code and profit without giving anything back.
If people who downvote bothered to reply, they'd probably say that by being "absorbed" into LLM weights, the code served that purpose and is available for everyone to use. That forgets 2 critical points:
- LLMs give no attribution. I deserve to be credited for my fractional contribution to the collective output of humanity.
- LLMs are not intelligences, they don't suddenly make intellectual work redundant. Using them still requires work to integrate their output and therefore companies build (for-profit) products on top of my work without compensating me, without crediting me and without giving anyone the freedom to modify the code.
I disagree, respectfully. You’re unable to accurately credit the myriad of people, sources, and works that’ve all educated and influenced you along the way. Suppose even if you could, you’d not be able to resolve their dependencies ad infinitum. Nor would it even make sense—arguably, as a species we’re defined by the compounding and shared nature of the knowledge we learn, implicitly and explicitly. Shoulders of giants, and all that.
If your contributions were a significant portion of the entire training corpus, then perhaps we’d have a separate discussion to have, but there’s no single person for which you can reasonably argue that’s true. Maybe for authors of hyper-niche topics or cutting edge research?
It's just the rich getting richer by taking from everyone so little that no individual bothers to fight back.
2) It's not just ML companies but anyone using their products.
A while ago everyone was upset that chatGPT regurgitated fast inverse square root from Quake's GPL code verbatim including a comment, clearly violating the license unless the program it generated it into was also under GPL.
I am sure since then they've made sure the copyrighted material used as training data gets misex up a bit more thoroughly so it doesn't produce it in the output verbatim and is therefore harder to detect.
So what if i spend a couple days writing an algorithm, give it a nice doc comment and tests, publish it under AGPL, it gets used as training data and a random programmer in a random company working on a random for-profit product runs into the same problem but instead of writing the algo himself, he asks a generative model and it produces my code just a little mixed up so it's not immediately recognizable but clearly based on my code? I deserve to be paid, that's what happens.
Yes, current AIs just remix their training data in a rather direct way. But in the end how different is that to how humans create? I would suggest we should embrace this new way of creating things while finding laws to empower all creators and not only those who were hired by deep pockets.
Laws generally don't encode what is right but a compromise between the state's interests, lobbyists and the general population making enough ruckus if too unsatisfied.
> But in the end how different is that to how humans create?
1) Scale. Some strategies that are socially acceptable when done by individuals but not when done at a massive scale. For example because individuals have very limited time and can invest very limited effort. Looking at a website is perfectly OK. Making thousands of requests a second might be considered an attack. Human memory is limited. Similar principles apply to humans looking at code.
2) Source of data. Much of human "input" is viewing the real world (not copyrighted material) through their senses. Much of learning is from teachers or documentation, both of which voluntarily give me information.
I don't know about you but when I wanna know how to use a particular function, I don't go looking through random GH repos to see how other people use it, I go to the docs.
> finding laws to empower all creators and not only those who were hired by deep pockets
That is not even the only issue. When I publish something under AGPL, my users have the right to modify the code, even if my code gets to them through some third party. LLMs allow laundering code and taking that right from (my) users.
> Laws generally don't encode what is right but a compromise between the state's interests, lobbyists and the general population making enough ruckus if too unsatisfied.
of course, but I actually think that this is the correct moral stand. Patenting algorithms is like patenting thoughts.
> I don't know about you but when I wanna know how to use a particular function, I don't go looking through random GH repos to see how other people use it, I go to the docs.
I always look into sources. Usually I look into the code I want to call first. But this is probably also because I mainly use Java.
So if I read your AGPL code and implement something similar in another programming language after also reading other implementations of the algorithm, is that something I have to attribute you for? Isn't code just an executable documentation of an algorithm? Especially if the algorithm is well known, I don't see any injustice here - "Die Gedanken sind frei". If I copy your code via copy and paste, then this is an infringement, but just retelling a similar story should not be affected by copyright.
OK, I agree there, I should have written "function" or "module" something similar. Something that takes nontrivial amounts of work and although it is based on some general principles which should not be patentable/copyrightable, their particular implementation is novel/unique enough that it would take nontrivial amounts of work to replicate the functionality without seeing the original.
> is that something I have to attribute you for
Depends how closely you follow my implementation.
If you use my code as the only reference and translate it verbatim (whether manually or using a tool), then you should credit me. If you look at many implementations, form an _understanding_ of the algorithm in general, then write your own implementation based on that understanding, then probably not.
The question is where LLMs stand. They mix enough sources that crediting all of them would be impractical and in practise they end up crediting none. But their proponents (who always call them AI, sometimes even using pronouns like "he" to refer to the models) argue that the models also form an _understanding_ rather than just regurgitating a mix of inputs. And I have to disagree, what I see is an imitation of reasoning/understanding which is sometimes convincing due to how complex statistics are being used inside the models. But they are still just statistical models of existing content and we see that every time somebody releases a new model, HN upvotes it to the top and a few hours later we inevitable see people giving it trivial questions which it fails to answer correctly.
My other two points:
- Even if an ML company made a model that is actually intelligent, the burden of proof should be on them, otherwise or until them, it's just a remix of existing work. BTW this reminds me an interesting comparison is remixes vs cover songs in music.
- Code is famously harder to read than write. If a human takes time to understand a piece of code and reimplement it not verbatim, then he generally does not get ahead by much. An LLM can do this at scale and speed unattainable by humans.
Let's say two products compete (purely on features and quality instead of marketing - for the sake of argument). One is written first, is novel and written fully by humans. The other is written by training a model on the first product's code and using the model to generate the same product, all within hours or days instead of months or years. The other puts in less actual work but gets the same result. It is clearly parasiting on the first, benefiting from their work without giving them credit or compensation.
---
Bottom line is copyright is meant to protect authors who invest effort into creating. Whether it succeeds in that can sometimes be questionable. But using an algorithm (even a very complex one) to take a bit of everyone's work and redistribute is for free without crediting or compensating them does not benefit authors.
I hate analogies but if I write banking software and send 0.000000001% of every transaction to my account, none of the individuals thusly affected probably care that much but I am still going to prison.
I am not sure about that. As long as they also can benefit from it, it just accelerates creation of new things.
> I hate analogies but if I write banking software and send 0.000000001% of every transaction to my account, none of the individuals thusly affected probably care that much but I am still going to prison.
As long as the amount goes to everybody, I don't think this "tax" would be a problem. Human beings can only survive in a collective after all.
Totalitarian systems of ruling have always devolved into suffering on a massive scale and although we're seeing a dictatorship prosper, probably for the first time in history, in China, it's just on add case that goes against the trend. It does not disprove the trend, dictatorships have always devolved into abuse of power, China is already promising massive suffering externally by threatening to invade Taiwan (which is a continuation of the legitimate Chinese government which was illegally overthrown by communist rebels in mainland China). And I have no doubt, China will devolve into internal suffering once its current dictator dies and a new one takes his place.
The west is only throwing away its values (such as individual freedom) because it is afraid of China getting a short-term lead.
To reply point by point:
> they also can benefit from it
One person puts in work and everyone benefits equally, sounds good on the surface because it only uses positive words, yet when you look at the meaning, I could equally use it to justify taking your entire salary away from you and redistributing it to every person on the planet equally. Would you be OK with that, jst because you get your equal share?
You might be tempted to answer yes if everyone suffered the same treatment. Now consider if you were OK with it if only 1% of the population worked and was forced to feed the rest.
> I don't think this "tax" would be a problem
I describe theft and you call it a tax...
> Human beings can only survive in a collective after all.
That absolutely does not mean that you should be forced to become a part of a collective against your will or that you have no say who is in the collective with you.
Publishing code and data would lead to abolishing copyright.
They are called large language _models_ for a reason. They are just statistical models of existing work.
If anybody seriously thought they were intelligent, they'd be arguing for giving them personhood and we'd see massive protests akin to pro-life vs pro-choice. People (well, ML companies) only use the word "intelligence" as long as it suits their marketing purposes.
That is actually an intriguing idea and at least aligned with the reasons I use AGPL for my code.
If we could extract correct attribution and even licensing out of work that was produced with AI, I don't think it would help that much. I would even assume that especially in this case the rich would profit the most. They wouldn't care having to pay thousands of artists for the pixels they provided to their AI generated blockbuster movie. It would effectively exclude the poor from using AI for compliance reasons. Or even worse rich corps monopolize the training data and then they can create content practically for free, while indies cannot use AI because they would have to pay traditional prices or give the rich corps money for indirectly using their training data.
I still don't think that's enough to be fair. If their work is used to produce value ad infinitum, them any one-time payment is obviously less than what they deserve.
The payment should be fractional compared to the produced value. And that is very hard to do since you don't know how much money somebody made by using the model.
> It would effectively exclude the poor from using AI for compliance reasons.
Again, this only an issue if you're thinking in terms of one-time fixed payments.
If it was possible to train AI models on just the public domain, them I am sure ML companies would have because it's less effort than lobbying and risking lawsuits (though I am surprised how well creators have accepted that their work is used by others to profit without any compensation, I expected way more outrage).
Virtually all code relevant to training code-completion LLMs is written by people still alive or dead for way less than 70 years. We can try to come up with a better system over the next decades but those people's rights are being violated right now.
At least for Java there are search engines to look for code that call libraries etc. Models could probably be trained on free code and then be fed with the results of these search engines on demand even through the client who calls the LLM:
Client -> LLM -> Client automation API -> Code search on client machine to fill context -> Code generation w/o model that is trained on the code that was found through the code search, but merely used as context
Even if they only feed code into it, that is freely given, I think the difference in quality of output would approach the current quality and better over time, especially when using RAG techniques like the above.
Companies can also buy code for feeding the model after all. So beside the injustice you directly experience right now over your own code probably being fed into AI models, do you fear/despise anything more than that from LLMs?
You could make separate versions of an LLM depending on the license of its output. If it has t produce public domain code, it can only be trained on the public domain. If if has to produce permissive code (without attribution), then it can be trained on the public domain and permissive code. If copyleft, then those two and copyleft (but it does not solve attribution).
> Companies can also buy code for feeding the model after all.
We've come a long way since the time slavery was common (at least in the west) but we still have a class system where rich people can pay a one time fee (buying a company) and extract value in perpetuity from people putting in continuous work while doing nothing of value themselves. This kind of passive income is fundamentally unjust but pervasive enough that people accept it, just like they accepted slavery as a fact of life back then.
Companies have a stronger bargaining position than individuals (one of the reasons beside defense why people form states - to represent a common interest against companies). This is gonna lead to companies (the rich) paying a one time fee, then extracting value forever while the individual has to invest effort into looking for another job. Compensation for buying code can only be fair if it's a percentage of the value generated by that code.
> So beside the injustice you directly experience right now over your own code probably being fed into AI models, do you fear/despise anything more than that from LLMs?
Umm, I think theft on a civilization-level scale is sufficient.
As long as everybody can also benefit from it, I see it as some kind of collective knowledge sharing.
As you stated in the paragraphs before unless the wealth distribution changes, LLMs may lead to an escalating imbalance, unless the models are shared for free as soon as a critical mass of authors is involved, regardless of who owns the assets.
1) The current system is that the rich get richer faster than the poor.
2) You're proposing a system where everybody gets richer at the same rate at the best of times (but probably devolves into case 1 IMO)
3) I am proposing a system where people get richer at the rate of how much work they put in. If a rich person does not produce value, he doesn't get richer at all. If a poor person produces value he gets richer according to how many people benefit from it. If a poor person puts in 1000 units of work and a rich person puts in 10 units of work to distribute the work to more people (for example through marketing), they get richer at comparative rates 1000:10.
My system is obviously harder to implement (despite all the revolutions in history, societies have always devolved to case 1, or sometimes case 2 that devolved into 1 later). It might be impossible to implement perfectly but it does not mean we should stop trying to make things at least less unfair.
---
We're in agreement that forcing companies who take (steal) work from everyone to release their models for free is better than letting them profit without bounds.
However, I am taking it way further. If AI research leads to fundamental changes in society, we should take it as an opportunity to reevaluate the societal systems we have now and make them more fair.
For example, I don't care who owns assets. I care about who puts in work. It's a more fundamental unit of value. Work produces assets after all.
---
And BTW making models free does not in any way help restore users' rights provided by AGPL. And I have yet to come across anybody making a workable proposal how to protect these rights in an age where AGPL code is remixed through statistics into all software without making it also AGPL. In fact, I have yet to find anybody who acknowledges it's a problem.
EU Digital Single Market Directive (2019/790) Art 3 and 4 allow text and data mining. Art 3 for scientific purposes, Art 4 more in general.
Now, some people argue that AI models are somehow compressed databases of the data that was crawled; but that seems patently ridiculous to me - so this should be sufficient.(at least mathematically) (IANAL) (famous last words)
The only question then is if the models have some kind of additional value ("intelligence") beyond being compressed databases.
My take is that either no, or the burden of proof is on those making the claim. Until they prove it, they are just databases and therefore derivative work of their input and their output is also derivative work.
* LLM's aren't databases, you can’t query them for exact stored records, and they can’t reconstruct (most of) their training data.
* But they also don’t reason or understand exactly like humans do either.
They're something else: to wit, Transformer models.
1) I don't think being able to query them and reconstruct input 1:1 are requirements. If i build a shitty db with a buggy query language that retrieves incomplete data and occasionally mixes in data i didn't ask for, then it's still a db, just a shitty one.
If i populate it with copyrighted material and put it online, whether I am gonna get sued is likely based on how shitty it is, if it's good enough that people can get enough value from it that they don't buy the original works, then the original authors are not gonna be pleased.
2) Yes, comparisons to humans are not always useful though I'd say they don't reason or understand at all.
Either way the discussion should be about justice and fairness. The fact is LLMs are trained on data which took human work and effort to create. LLMs would not be possible without this data (or ML companies would train on just the public domain and avoid the risk of a massive lawsuit). The people who created the original training data deserve a fair share of the value produced by using their work.
So the real question to me is how much so they deserve?
My point is that any situation where person A puts in a certain amount of work (normalized by skill, competence, etc.), person B uses person A's work, puts in some work of his own but less than A, then gets more reward than A, is fundamentally unfair.
LLMs are this, just at a massive scale.
But to be honest, this is where the discussion went only after thinking about it for a while. My real starting point was that when I publish my code under AGPL, I do it because I want anyone who builds on top of it to also have to release their code so users have the freedom to modify it.
LLMs are used to launder my code, deprive me of credit and deprive users of their rights.
Can we agree this is harm?
I also believe than unfairness is fundamentally the same as harm, just framed a bit differently.
> deprive me of credit
> and deprive users of their rights.
> Can we agree this is harm?
I might consider it if any of those claims were true.
I think the opposite is true -- especially with Open-Weight models which expand user freedoms rather than restricting them. I wonder if we can get the FSF to come up with GPL compatible Open-Weight licenses.
At this point in time I'm not entirely convinced they even need to. But if future lawsuits turn out that way, it might solve issues with some models.
Please step back and consider what you are replying to.
> LLMs are used to launder my code
If an LLM was trained only on AGPL code, would it have to be licensed under AGPL? Would its output?
> deprive me of credit
They _obviously_ deprive me of credit. Even if an LLM was trained entirely on my code, nobody using its output would know. Compare to using a library where my name is right there in the license.
> and deprive users of their rights.
I appeal to you again, re-read my comment. I am not talking about users of the model but users of the software that is in part based on my AGPL code. If my code got there traditionally by being included in a library, the whole software would have to be AGPL and users would have the right to modify. If my code is laundered through an LLM, users of my code lose that right.
> Can we agree this is harm?
So all of those things are true. And this is clearly harm.
Stealing a little from everyone is morally no different than stealing a lot from one person. Whenever you think about ML, consider extreme cases such as training on data under one license and all the arguments to pretend it's not copyright infringement fall apart. (And if you don't think extreme cases set precedent to real cases, then please point out where exactly you draw the line. Give me a number.)
Spreading the harm around means everyone is harmed similarly but that is not the kind of fairness I had in mind.