82 points | by miki1232117 hours ago
If Meta (or anyone) had approached publishers with a “we want to buy one copy of every book you publish”, that doesn’t seem technical or business difficult.
Certainly Amazon would find that extremely easy.
Hypothetical: If the only way we could build AGI would be to somehow read everyone's brain at least once, would it be worth just ignoring everyone's wish regarding privacy one time to suck up this data and have AGI moving forward?
But it’s not at all a similar dilemma to “should we allow the IP empire-building of the 1900’s to claim ownership over the concept of learning from copyrighted material”.
If it matched human intellectual productivity capacity, that ensures that human intelligence will no longer get you more money than it takes to run some GPUs, so it would presumably become optional.
Would you trust a businessman on that?
The fact that this is an active question is depressing.
The suspicion that, if it were possible, some tech bro would absolutely do it (and smugly justify it to themselves using Rokkos Basalisk or something) makes me actually angry.
I get that you're just asking a hypothetical. If I asked "Hypothetical: what if we just killed all the technologists" you'd rightly see me as a horrible person.
Damn. This site and its people. What an experience.
You seem like you set aside any critical thinking to come to "this website" looking for a reason to seethe over complete strangers about whom you know very little and whose motives you belligerently misrepresent all the while making exaggerated and extremist statements, and no doubt embracing worse thoughts.
You're the type of person destroying the planet _I_ live on.
This isn't a defense of technologists, it's a plea to stop tripping over yourself to see the worst in everyone.
tbh human rights are all an illusion especially if you are at the bottom of society like me. no way I will survive so if a part of me survives as training data I guess better than nothing?
imo the only way this could happen is a global collaboration without telling anyone. the AGI would know everything about all humans but its existence has to be kept a secret at least for the first n generations so it will lead to life being gameified without anyone knowing it will be eugenics but on a global scale
so many will be culled but the AGI would know how to make it look normal to prevent resistance from forming a war here a war there, law passed here etc so copyright being ignored kind of makes sense
I don't understand.
Facebook and Google spend billions on training LLMs. Buying 1M ebooks at $50 each would only cost $50M.
They also have >100k engineers. If they shard the ebook buying across their workforce, everyone has to buy 10 ebooks, which will be done in 10 minutes.
Absolutely no way. Yup.
> Buying millions of ebooks online would take a lot of effort, downloading data from publishers isn't a thing that can be done efficiently
Oh no, it takes effort and can't be done efficiently, poor Google!
How can this possibly be an excuse? This is such a detached SV Zuckerberg "move fast and break things"-like take.
There's just no way for a lot of people to efficiently get out of poverty without kidnapping and ransoming someone, it would take a lot of effort.
I think this is a dangerous road with little upside for anyone outside of IP aggregators.
youtube, etc classifiers definitely do read others material though.
was thinking if i train my model on my private docs for instance finance how does one prevent the model from sharing that data verbatim
"It would take a lot of effort to do it legally" is a pathetic excuse for a company of Meta's size.
They could also simply buy controlling stakes in publishers. For scale comparison, Meta is spending upwards $30B per year on AI, and the recent sale of Simon & Schuster that didn't go through was for a mere $2.2B.
Surely the author only licenses the copyright to the publisher for hardback, paperback and ebook, with an agreed-upon royalty rate?
And if someone wants the rights for some other purpose, like translation or making a film or producing merchandise, they have to go to the author and negotiate additional rights?
Meta giving a few billion to authors would probably mend a lot of hearts, though.
I totally agree. But since when has that stopped companies like Meta. These big companies are built on breaking/skirting the rules.
Defending themselves with technicalities and expensive lawyers may be financially viable.
Zero ethics but what would we expect from them?
While it's plausible someone downloaded a bunch of torrents and tossed them in the training directory...again, under who's authority? Like if this happened it would be one overzealous data scientist potentially. Hardly "them".
People lean on collective pronouns to avoid actually thinking about the mechanics of human enterprise and you get extremely absurd conclusions.
(it is not outside the bounds of thinkable that an org could in fact have a very bad culture like this, but I know people who work for Meta, who also have law degrees - they're well aware of the potential problems).
These newly unredacted documents reveal exchanges between Meta employees unearthed in the discovery process, like a Meta engineer telling a colleague that they hesitated to access LibGen data because “torrenting from a [Meta-owned] corporate laptop doesn’t feel right ”. They also allege that internal discussions about using LibGen data were escalated to Meta CEO Mark Zuckerberg (referred to as "MZ" in the memo handed over during discovery) and that Meta's AI team was "approved to use" the pirated material.
https://www.wired.com/story/new-documents-unredacted-meta-co...Zuckerberg appeared to know Llama trained on Libgen - https://news.ycombinator.com/item?id=42759546 - Jan 2025 (73 comments)
Zuckerberg approved training Llama on LibGen [pdf] - https://news.ycombinator.com/item?id=42673628 - Jan 2025 (191 comments)
Zuckerberg Approved AI Training on Pirated Books, Filings Say - https://news.ycombinator.com/item?id=42651007 - Jan 2025 (54 comments)
Would love if any lawyers here can speculate.
[1] https://timesofindia.indiatimes.com/technology/tech-news/whe...
> “Put another way, by opting to use a bit torrent system to download LibGen’s voluminous collection of pirated books, Meta ‘seeded’ pirated books to other users worldwide.”
It is possible to (ab)use the bittorrent ecosystem and download without sharing at all. I don't know if this is what Meta did, or not.
I started my piracy journey on Napster. I’ve done all the other biggies. I’ve done off-the-beaten-path stuff like IRC piracy channels. Private trackers. I have a soft spot for Windowmaker and was dumb enough to run Gentoo so long that I got kinda good at the “scary” deep parts of Linux sysadmin. I can deal with fiddliness and allegedly-ugly UI.
Usenet piracy defeated me.
Everyone uses „pirated“ content, but some are better at hiding it and/or not talking about it.
There is no other way to do it.
> If the Visitor uses copyrighted material from this site (Hereafter: Site-Content) to train a Generative AI System, in consideration the Visitor grants the Site Owner an irrevocable, royalty-free, worldwide license to use and re-license any output or derivative works created from that trained Generative AI System. (Hereafter: Generated Content.)
> If the Visitor re-trains their Generative AI System to remove use of the Site-Content, the Visitor is responsible for notifying the Site Owner of which Generated Content is no longer subject to the above consideration. The Visitor shall indemnify the Site-Owner for any usage or re-licensing of Generated Content that occurs prior to the Site-Owner receiving adequate notice.
_________
IANAL, but in short: "If you exploit my work to generate stuff, then I get to use or give-away what you made too. If you later stop exploiting my work and forget to tell me, then that's your problem."
Yes, we haven't managed to eradicate a two-tiered justice system where the wealthy and powerful get to break the rules... But still, it would be cool to develop some IP-lawyer-vetted approach like this for anyone to use, some boilerplate ToS and agree-button implementation guidelines.
Google is currently being sued by journalist Jill Leovy for illegally downloading and using her book "Ghettoside" to train Google's LLMs [1].
However, her book is currently stored, indexed and available as a snippet on Google Books [2]. That use case has been established in the courts to be fair use. Additionally, Google has made deals with publishers and the Author's Guild as well.
So many questions! Did Google use its own book database to train Gemini? Even if they got the book file in an illegal way, does the fact that they already have it legally negate the issue? Does resolving all the legal issues related to Google Books immunize them from these sorts of suits? Legally, is training an LLM the same as indexing and providing snippets? I wonder if OpenAI, Meta and the rest will be able to use Google Books as a precedent? Could Google license its model to other companies to immunize them?
Google's decade-long Books battle could produce major dividends in the AI space. But I'm not a lawyer.
1. https://www.bloomberglaw.com/public/desktop/document/LeovyvG...
This whole thing copyright thing reminds me of when Mark Zuckerberg was mad that someone posted photos of the interior of his house or something.