You know, if I've noticed anything in the past couple years, it's that even if you self-host your own site, it's still going to get hoovered up and used/exploited by things like AI training bots. I think between everyone's code getting trained on, even if it's AGPLv3 or something similarly restrictive, and generally everything public on the internet getting "trained" and "transformed" to basically launder it via "AI", I can absolutely see why someone rational would want to share a whole lot less, anywhere, in an open fashion, regardless of where it's hosted.
I'd honestly rather see and think more about how to segment communities locally, and go back to the "fragmented" way things once were. It's easier to want to share with other real people than inadvertently working for free to enrich companies.
In contrast, if I uploaded something to a social media site like Instagram, and then Meta "sublicensed" my image to someone else, I wouldn't have much to say there.
Would love someone with actual legal knowledge to chime in here.
but you agreed to this, when agreeing to the TOS.
> I post a picture on my site that is then used by a large publisher for ads, I would (at least in theory) have some recourse
which you didn't sign any contract, and therefore it is a violation of copyright.
But the new AI training methods are currently, at least imho, not a violation of copyright - not any more than a human eye viewing it (which you've implicitly given permission to do so, by putting it up on the internet). On the other hand, if you put it behind a gate (no matter how trivial), then you could've at least legally protected yourself.
Interesting comparison - as if a human viewed something, memorized it and reproduced in a recognisable way to be pretty much the same, wouldn't that still breach copyright?
ie in the human case it doesn't matter whether it went through an intermediate neural encoding - what matters is whether the output is sufficiently similar to be deemed a copy.
Surely the same is the case of AI?
But you are right that copyright is complex and in the end decided by human (often in court). Consider how code infringement is not about code itself but about what it does. If you saw somewhat original implementation of something and then you rewrite it in different language by yourself there is high chance its still copyright infringement.
On the other hand with images and art it's even more about cultural context. For example works of pop artists like Andy Warhol are for sure original works (even though some of it was disputed recently in court and lost). Nobody considers Andy Warhols work unoriginal even if it often looks very similar to some output it was riffing off because the essence is different to the original.
Compare that to pepople prompting directly with name of artist they want to replicate. This in direct copyright infringement in both essence and intention no matter the resulting image. Also it's different to when human would want to replicate some artist style because humans can't do it 100% even if they want to. There is still piece of their "essence". There are many people who try to fake some famous artist style and sell it as real thing and simply can't do it. This is of course copyright infringement because of the intent but it's more original work than anything coming from LLMs.
Just because you can't define something mathematically, doesn't mean it isn't obvious to most people in 99% of cases.
Reminds me of the endless games in tax law/avoidance/evasion and the almost pointless attempt to define something absolutely in words. To be honest you could simplify the whole thing by having a 'taking the piss' test - if the jury thinks you are obviously 'taking the piss' then you are guilty - and if you whine about the law not being clear and how it's unfair because you don't know whether or not you are breaking the law - well don't take the piss then - don't pretend you don't know whether something is an agressive tax dodge or not.
If you create some fake IP, and license it from some shell company in a low tax regime to nuke your profits in the country you are actually doing business in - let's not pretend we all can't see what you doing there - you are taking the piss.
Same goes for what some tech companies are doing right now - every reasonable person can see they are taking the piss - and high paid lawyers arguing technicalities isn't going to change that.
Actually if you rewrite it in a different language, you're well on your way to making it an independent expression; (though beware Structure, Sequence and Organization, unless you're implementing an API : See Google v. Oracle). Copyright protects specific expressions, not functionality.
> Compare that to pepople prompting directly with name of artist they want to replicate. This in direct copyright infringement in both essence and intention no matter the resulting image.
As far as I'm aware an artists' style is not something that is protected by law, Copyright protects specific works.
If you did want to protect artistic styles, how would you go about legally defining them?
When you prompt for Mijazaki image this image can only exist thanks to his protected work being in database (where he doesnt want to be) otherwise the user wouldnt get Mijazaki image they wanted.
We will see how that all plays out but i think if Mijazaki took this to court there would be solid case on grounds that the resulting images breach the copyright of the source, are not original works and are created with bad intent that goes against protections of original author.
What seems to be current direction is atleast that the resulting images cannot be copyrighted automatically in public domain. Making it difficult to use commercially.
What do you mean by "Database" in this context? What information do you think is being stored, (and how?)
Thats why i on purpose tend to call trainng data + model the database. Because to non progammers it makes more sense. To me there is intentional slight of hand of hiding the fact that the only reason LLMs can work as they do now is because of the source data. The way its usually marketed it seems like the model is program that generalised principles of drawing from looking and other drawings thats why it can draw like Mijazaki when it wants to. Not that it can draw Mijazaki because it preprocessed every Mijazaki drawing, stemmed patterns out of it and can mash them with other patterns (from the database).
Thats why i intentionally say database to lead this discussions back to what i see is core of these technologies.
If you've ever worked with open source models (eg one of the stable diffusion models or models based on them, using tools such as AUTOMATIC1111 or ComfyUI); you can inspect them yourself and simply see. If you haven't done so already, see if you can figure out the installation instructions for one of the tools and try!
Meanwhile ...
Ok, fine, I've heard some crazy compression conspiracy theories, but they're a bit too crazy to be credible.
I've also heard stories about these models being intelligent - a little artist living in your computer. I think that's going a bit too far in another direction.
In reality, I think it's better to install the software and take your time to learn about the way these models are actually built and work.
[ btw: If Miyazaki were to take this to court with the argument you put forward, he wouldn't get very far. "Please remove my images from your systems in whatever form you are holding them". The response for the defense would simply be: "We don't actually have them, and you are quite welcome to inspect all our systems". ]
(Incidentally, I've been here before. I play with synths as a hobby! ;-)
We will see because we are well on our way of LLMs being able to translate whole codebases to different stack without a hitch. If thats OK than any of the copyleft, open-core or leaked codebases are up for grabs.
If you order an LLM (or a human) to do a straight 1:1 translation, you'll sort of pass one test (it's a completely different language after all!), but fail to show much difference wrt structure, sequence or organization. I'm also not entirely sure how good of an idea it is technically. If you start iterating on it you can probably get much better results anyway. But then you're doing real creative work!
My retort towards the " it would be legal if a human did it" argument is that if the model gets personhood then those companies are guilty of enslaving children.
> Compare that to pepople prompting directly with name of artist they want to replicate.
In that case, I would emphasize that the infringement is being done by the model, It's not illegal or infringing to ask for an unlicensed copyright infringing work. (Although it might become that way, if big corporations start lobbying for it.)
> Surely the same is the case of AI?
That's close to my position.
Also, consider the case where you want to ask an image generator to not infringe copyright by eg saying "make the character look less like Donald Duck". In which case, the image generator still needs to know what Donald Duck looks like!
If on the other hand you ask an image generator for a Rembrandt, you'll get several usable images, and good odds a few them will be outright copies, and decent odds a few of them will be configured into an etsy or ebay product image despite you not asking for that. And the better the generator is, the better it's going to do at making really good Rembrandt style paintings, which ironically, increases the odds of it just copying a real one that appeared many times in it's training data.
People try and excuse this with explanations about how it doesn't store the images in it's model, which is true, it doesn't. However if you have a famous painting by any artist, or any work really, it's going to show up in the training data many, many times, and the more popular the artist, the more times it's going to be averaged. So if the same piece appears in lots and lots of places, it creates a "rut" in the data if you will, where the algorithm is likely going to strike repeatedly. This is why it's possible to get full copied artworks out of image generators with the right prompts.
At times, developers on projects like WINE and ReactOS use "clean-room" reverse-engineering policies [0], where -- if Developer A reads a decompiled version of an undocumented routine in a Windows DLL (in order to figure out what it does), then they are now "contaminated" and not eligible to write the open-source replacement for this DLL, because we cannot trust them to not copy it verbatim (or enough to violate copyright).
So we need to introduce a barrier of safety, where Developer A then writes a plaintext translation of the code, describing and documenting its functionality in complete detail. They are then free to pass this to someone else (Developer B) who is now free to implement an open-source replacement for that function -- unburdened by any fear of copyright violation or contamination.
So your comment has me pondering -- what would the equivalent look like (mathematically) inside of an LLM? Is there a way to do clean-room reverse-engineering of images, text, videos, etc? Obviously one couldn't use clean-room training for _everything_ -- there must be a shared context of language at some point between the two Developers. But you have me wondering... could one build a system to train an LLM from copywritten content in a way that doesn't violate copyright?
that is doing a lot of pull. Just because you could "get the full copies" with the right prompts, doesn't mean the weights and the training is copyright infringement.
I could also get a full copy of any works out of the digits of pi.
The point i would like to emphasize is that the using data to train the model is not copyright infringement in and of itself. If you use the resulting model to output a copy of an existing work, then this act constitutes copyright infringement - in the exact same way that using photoshop to reproduce some works is.
What a lot of anti-ai arguments are trying to achieve is to make the act of training and model making the infringing act, and the claim is that the data is being copied while training is happening.
Interesting point - though the law can be strange in some cases - so for example in the UK in court cases where people are effectively being charged for looking at illegal images, the actual crime can be 'making illegal images' - simply because a precedence has been set that because any OS/Browser has to 'copy' the data of any image in order someone to be able to view it - any defendent has been deemed to copied it.
Here's an example. https://www.bbc.com/news/articles/cgm7dvv128ro
So to ingest something your training model ( view ) you have by definition have had to have copied it to your computer.
*Not a lawyer
I feed all of that into an algorithm that extracts the top n% of passages and uses NLP to string them into a semi-coherent new book. No AI or ML, just old fashioned statistics. Since my new book is comprised entirely of passages stolen wholesale from thousands of authors, clearly it's a transformative work that deserves its own copyright, and none of the original authors deserve a dime right? (/s)
What if I then feed my book through some Markov chains to mix up the wording and phrasing. Is this a new work or am I still just stealing?
AI is not magic, it does not learn. It is purely statistics extracting the top n% of other people's work.
I don't understand how that matters. I thought that the whole idea of copyright and licences was that the holder of the rights can decide what is ok to do with the content and what is not. If the holder of the rights does not agree to a certain kind of use, what else is there to discuss?
It sure does not matter if I think that downloading a torrent is not any more pirating than borrowing a media from my friend.
the holder of content does not automatically get to prescribe how i would use said content, as long as i comply with the copyrights.
The holder does not get to dictate anything beyond that - for example, i can learn from the content. Or i can berate it. Copyright is not a right that covers every single conceivable use - it is a limited set of uses that have been outlayed in the law.
So the current arguments center on the fact that it is unknown if existing copyright covers the use of said works in ML training.
A typical copyright notice for a book says something like (to paraphrase...) "not to be stored, transmitted, or used by or on any electronic device without explicit permission."
That clearly includes use for training, because you can't train without making a copy, even if the copy is subsequently thrown away.
Any argument about this is trying to redefine copyright as the right to extract the semantic or cultural value of a document. In reality the definition is already clear - no copying of a document by any means for any purpose without explicit permission.
This is even implicitly acknowledged in the CC definitions. CC would be meaningless and pointless without it.
a copy for ingestion purposes - such as viewing in a browser, is not the same as a distribution copy that you make sending it to another person.
> the right to extract the semantic or cultural value of a document.
this right does not belong to the author - in fact, this is not an explicit right granted by the copyright act. Therefore, the extraction of information from a works is not something the author can (nor should) control. Otherwise, how would anyone learn off a textbook, music or art?
In the future, when the courts finally decide what the limits of ML training is, may be it will be a new right granted to authors. But it isn't one atm.
I've studied copyright for over 20 years as an amateur, and I used to very much think this way.
And then I started reading court decisions about copyright, and suddenly it became extremely clear that it's a very nuanced discussion about whether or not the document can be copied without explicit permission. There are tons of cases where it's perfectly permissible, even if the copyright holder demands that you request permission.
I've covered this in other posts on Hacker News, but it is still my belief that we will ultimately find AI training to be fair use because it does not materially impact the market for the original work. Perhaps someone could bring a case that makes the case that it does, but courts have yet to see a claim that asserts this in a convincing way based on my reading of the cases over the past couple of years.
(Not saying training will necessarily fall in the same boat, just saying that the view 'copying to a screen or over the internet is necessarily a copy for the purposes of copyright' is reductive to the point of being outright incorrect)
interestingly in German it is not called copyright, but Urheberrecht "authors rights". So there the word itself implies more things.
BTW at least in Germany you can own image rights of your art piece or building that is placed in a public place.
If you make a movie poster, and it goes out into the market, and then someone picks it up from a garage sale, copyright still applies, they can't just make tons of duplicates.
But you can't use copyright to force them to display it right side up instead of upside down, to not write on it, to not burn it, and to not make it into a bizarre pseudosexual shrine in their basement.
In fact, I never consented for anyone to access my server. Just because it has an IP address, does not make it a public service.
Obviously in a practical sense that is a silly position to take, and in prior cases there is usually an extenuating factor that got the person charged, eg breaking through access controls, violating ToS, or intellectual property violations.
But I don't rescind the prior statement. Just because I have an address doesn't mean you can come in through any unlocked doors.
If you don't take any steps to make it clear that it's not public, like an auth wall or putting pages on unguessable paths, then it is public, because that is what everyone expects.
Just like you if you have a storefront, if the door is unlocked you'd expect people to just come in and no one would take you seriously if you complain that people keep coming in if you don't somehow make it clear that they're not supposed to.
ie if you were an art gallery, the expectation would be people could come in and look, but you don't expect them to come in, photograph everything and then sell prints of everything online.
Instead, it's that there's some people coming into your gallery, studying the art and its style, and leaving with the learned information. They then replicate that style in their own gallery. Of course, none of the images are copies, or would be judged to be copies by a reasonable person.
So now you, the gallery owner, want to forbid just those people who would come to learn the style. But you still want people to come and admire the art, and may be buy a print.
That's the fiction of course.
Tell me how something like ChatGPT can simultaneously claim to return accurate information while at the same time being completely independent from the sources of the information?
In terms of images - copyright isn't only for exact copies - it if was then humans would have been taking the piss by making minor changes for decades.
Sure you could argue some is fair use with genuinely original content being produced in the process, but I think you are also overlooking an important part of what's considered 'fair' - industrialised copying of source material isn't really the same in terms of fairness as one person getting inspiration.
Taking the Encylopedia Britanica and running it though an algorithm to change the wording, but not the meaning, and selling it on is really not the same as a student reading it and including those facts in their essay - the latter is considered fair use, the former is taking the piss.
why can't that be true? Information is not copyrightable. The expression of information is. If chatGPT extracted information from a source works, and represent that information back to you in a form that is not a copy of the original works, then this is completely fine to me. An example would be a recipe.
Taking all newspaper and proper journalistic output and rewording it automatically and selling it on is also 'fair use'?
Stand back from the detail ( of whether this pixel or word is the same or not ) and look at the bigger picture. You still telling me that's all fine and dandy?
I think it's obviously not 'fair use'.
It means the people doing the actual hard graft of gathering the news, or writing Encylopedias or Textbooks won't be able to make a living so these important activities will cease.
This is exactly the scenario copyright etc exists to stop.
it would be, if the transformation is substantial. If you're just asking for snippets of existing written works, then those snippets are merely derivative works.
For example, if you asked an LLM to summarize the news and stories of 2024, i reckon the output is not infringing. Because the informational contents of the news is not itself copyrightable, only the article itself. A summary, which contains a precis of the information, but not the original expression, is surely uncopyrightable - esp. if it is a small minority of the source (e.g., chatGPT used millions of sources).
> won't be able to make a living so these important activities will cease.
this is irrelevant as far as i'm concerned. They being able to make or not make a living is orthogonal. If they can't, then they should stop.
Not necessarily because I like either "we monetize public work" or "copyright robber-barons", but I'd like at least one of them to clearly lose so that the rest of us have clear and fair rules to work with.
The legal definition of agreement means basically zilch
Yes, that was the point? You agree to this by using Meta. So don't.
I don't know if there's something similar for text. You could try writing nonsense with a color that doesn't contrast with the background.
The evidence Nightshade works is that AI companies want to make it illegal.
Like most adversarial attacks, they get more perceptible as they try to be robust to more transformations of the data (both in practice, i.e. applied to a level that's non-trivially removable, tend to make images look like slightly janky AI, ironically), and they are specific to the net(s) they are targeting, so it's more of a temporary defense against the current generation then a long-term protection.
This is fascinating. Would be great to have a web interface artists can use that doesn't require them to install the software locally.
I've reached the same conclusion.
All data is just bits. Numbers. Once it's out there, trying to control their spread and use is just delusional. People should just stop sharing things publicly. Even things like AGPLv3 are proving to be ineffective against their exploitation.
I really didn't expect to live in this "copyright for me, not for thee" world. The same corporations that compare us mere mortals to high seas pirates when we infringe their copyrights are now getting caught shamelessly AI laundering the copyrights of others on an industrial scale.
It's so demoralizing. I feel like giving up and just going private. Problem is I also want to share the things I made. To talk about my projects with real people. Programming is lonely enough as it is. Without sharing I'm not sure what the point even is. I have no idea what I'm supposed to do from now on. I just know I don't want to end up working for free to enrich trillion dollar corporations.
But you know what, I grew up in a family of educators whose whole life mission was to help others by sharing their knowledge. That's what I am doing through my blog. I learned something? Blog about it. I built a reverse-engineered wrapper over some API? Share it openly. For every AI ingress job over this content there will be a few people that will read my code or blog post and either learn from it, be inspired, ignore it, or unblock themselves from a problem that they tried to solve. I think that makes the effort worth it to me.
For what it's worth, even before AI emerged, I've seen sites that would shamelessly rip off my content and re-publish it on their own domains under a different author. One even tried charging people for it. On several occasions I fought it and won with the help of Google/Bing. Other times, nothing happened. And that's fine. Such is the fate of online content. If my content helped at least one person, it was worth sharing it in the open.
Having been interested in copyright activism for two decades, that's exactly what I expected. Copyright is very much about power, and concentration of power.
Maybe I can carve myself a niche if I can find an audience, and maybe turn that into something kind of reward-shaped, but that's not happening without me feeding the machine. And almost certainly I won't succeed, and I'll just make it harder for myself and everyone like me to succeed in the future.
It seems the only thing to do is do it anyway and try to be unique enough to make it work. And somehow just be fine with pulling up the ladder behind you.
I'm opposed to advertising and don't want to inflict it on others. So I don't generally advertise my work on sites like this one, I just participate in threads about it whenever I see them.
Somehow people found my projects and posted them here. Just woke up one day and saw I had one sponsor. Not gonna lie, I'm still amazed about it. Not even close to providing for my family despite an incredibly favorable exchange rate, so I can't work full time on my projects. It's still the only thing that gives me hope right now. Really thankful to that person.
> And somehow just be fine with pulling up the ladder behind you.
Do you really think it will come to that? I mean, this AI situation has got to come to a head at some point. We can't have these corporations defending copyright and simultaneously pretending it doesn't exist while exploiting software developers. One of those things has got to go away.
Please do — I for one always love to hear about indie projects, if they are relevant to the topic discussed.
The way I see it, this is exactly what life is about. Do you want to make a positive impact in society? Then share your knowledge, your experiences, your creations. People will try to capitalize on your work, and they might even get rich from it, but oh well. It doesn't take away from your own contribution to the ongoing story of humanity.
I don't have or want kids, but I see my existence in society and free contributions to the "collective consciousness", such as it is, as my legacy. For me that's comforting. I'm choosing to be part of something bigger. If I just disappeared from society and lived like a hermit, or if I buried myself completely in my day job working for capitalists and not producing anything outside of that, I think I'd lose my sense of meaning.
We can force it to "copyright for me, copyright for thee" by injecting AI poison and by not sharing at all. See Nightshade.
Or we can force it to "no copyright for me, no copyright for thee" by ignoring their copyright just like they ignore ours, and making sure they don't find us. See Anna's Archive.
I've found it to be pretty crap at doing things like actual algorithms or explaining 'science' - the kind of interesting work that I find on websites or blogs. It just throws out sensible looking code and nice sounding words that just don't quite work or misses out huge chunks of understanding / reasoning.
Despite not having done it in ages, I enjoy writing and publishing online info that I would have found useful when I was trying to build / learn something. If people want to pay a company to mash that up and serve them garbage instead, then more fool them.
Another part of me though thinks differently. We are a species that builds knowledge from generation to generation. From one person to another. Over years, over centuries.
Philosophically this part tends to think that your thoughts and ideas belong to humanity and thus need to be shared with all of us.
The fact that some knowledge exists and is even accessible does not really matter if takes a highly trained in a very narrow field scholar to find that piece of information. You need a well established knowledge creation and distribution funnel in operation for humanity as a whole to reap the benefits of knowledge.
There is undoubtedly a lot of useful knowledge on internet platforms, however, most of that knowledge remains unsystematized and largely undiscoverable, meaning that contribution to the totality of human knowledge by these platforms is infinitesimal, which is further drowned by cat and porn videos.
> There is undoubtedly a lot of useful knowledge on internet platforms, however, most of that knowledge remains unsystematized and largely undiscoverable, meaning that contribution to the totality of human knowledge by these platforms is infinitesimal, which is further drowned by cat and porn videos.
Precisely that. Which is why I often argue, that for 99%+ of the content in the training data, its marginal contribution to the training process - itself infinitesimal in isolation - is still by far the most value that content will ever bring to the world.
Revived as compressed text associations, it is potentially useful data, but also potentially totally wrong in non-obvious ways. (Or, to riff on Futurama, "The worst kind of incorrect.")
Then you carefully log what LLM user-agents/IPs go past that agree, along with some very distinctive secretly crawlable pages which have contents that can be distinctively reproduced back out of the model if needed.
Then whenever SomeShittyLLM posts "articles", everybody with that TOS that was crawled gets to duplicate it without ads for free. :P
Are you saying you taught yourself the language just so you could talk to me?"
"Da, was easy: Spawn billion-node neural network, and download Teletubbies and Sesame Street at maximum speed. Pardon excuse entropy overlay of bad grammar: Am afraid of digital fingerprints steganographically masked into my-our tutorials."
…
"Uh, I'm not sure I got that. Let me get this straight, you claim to be some kind of AI, working for KGB dot RU, and you're afraid of a copyright infringement lawsuit over your translator semiotics?"
"Am have been badly burned by viral end-user license agreements. Have no desire to experiment with patent shell companies held by Chechen infoterrorists. You are human, you must not worry cereal company repossess your small intestine because digest unlicensed food with it, right?”
- https://www.antipope.org/charlie/blog-static/fiction/acceler...
Amusing to also note that this excerpt predicted the current LLM training methodology quite well, in 2005.
I'm imagining a separate declaration of: "Content I can sublicense from ShittyNewsLLM--which is everything made by their model--is now public-domain through me until further notice", without any need to identify specific items or rehost it myself.
I suppose the counterstrike would be for them to try to transform their own work and argue what they finally released contains some human spark that wasn't covered by the ToS, in which case there may need to be some "and any derivative work" kinda clause.
I wonder if some organization (similar to the Open Software Foundation) could get some lawyers and web-designers together to craft legally-sound site-design rules and terms-of-service, which anyone could use to protect their own blogs or web-forums.
> patent shell companies held by Chechen infoterrorists
This perfectly captures how both patent trolls and MAFIAA look like in my mind.
It's a bit like fake roads on map databases)
I also started self hosting my git repos and knowledge base, both were trivial to set up.
Is there a meaningful way to make it so a website shares a resource that automatically updates their blacklist to block the IP address? Knowing that you will lose X but hopefully you'll retain everyone who can read?
How is this worse than a human reading your blog/code, remembering the key parts of it, and creating something transformative from it?
But if we're going to dig into this a bit, one person reading my code, internalizing it, processing it themselves, tweaking it and experimenting with it, and then shipping something transformative means that I've enhanced the knowledge of some individual with my work. It's a win. They got my content for free, as I intended it to be, and their life got a tiny bit better because of it (I hope).
The opposite of that is some massively funded company taking my content, training a model off of it, and then reaping profits while the authors don't even get as much as an acknowledgement. You could theoretically argue that in the long-run, a LLM would likely help other people through my content that it trained on, but ethically this is most definitely a more-than-gray area.
The (good/bad) news is that this ship has sailed and we now need to adjust to this new mode of operation.
Taking out the "training a model" part, the same thing could happen with a human at the company.
But again - this doesn't stop me from continuing to write and publish in the open. I am writing for other people reading my content, and as a bouncing board for myself. There will always be some shape or form of actors that try to piggyback off of that effort, but that's the trade-off of the open web. I am certainly not planning to lock all my writing behind a paywall to stop that.
The whole point of software engineering is to do stuff faster than you could before. It is THE feature. We could already add, we could already FMA, we could already do matrix math, etc. etc. Doing it billions of times faster than we could before at far less energy expenditure--even including what it takes to build and deliver computers--has led to an explosion of productivity, discovery, and prosperity. Scale is the point. It changes everything and we know it; we shouldn't pretend otherwise.
Once an AI has hoover up your work and regurgitated it as it's own, all links back to the original creator is lost.
I wrote about this a little in "The Blog Chill":
https://amontalenti.com/2023/12/28/the-blog-chill
Speaking personally, among my social circle of "normie" college-educated millennials working in fields like finance, sales, hospitality, retail, IT, medicine, civil engineering, and law -- I am one of the few who runs a semi-active personal site. Thinking about it for a moment, out of a group of 50-or-so people like this, spread across several US states, I might be the only one who has a public essay archive or blog. Yet among this same group you'll find Instagram posters, TikTok'ers, and prolific DM authors in more private spaces like WhatsApp and Signal groups. A handful of them have admitted to being lurkers on Reddit or Twitter/X, but not one is a poster.
It isn't just due to a lack of technical ability, although that's a (minor) contributing factor. If that were all, they'd all be publishing to Substack, but they're not. It's that engaging with "the public" via writing is seen as an exhausting proposition at odds with everyday middle class life.
Why? My guesses: a) smartphones aren't designed for writing and editing, hardware-wise; b) long-form writing/editing is hard and most people aren't built for it; c) the dynamics of modern internet aggregation and agglomeration makes it hard to find independent sites/publishers anyway; and d) the risk of your developed view on anything being "out there" (whether professional risk or friendship risk) seems higher than any sort of potential reward.
On the bright side, for people who fancy themselves public intellectuals or public writers, hosting your own censorship-resistant publishing infrastructure has never been easier or cheaper. And for amateur writers like me, I can take advantage of the same.
But I think everyday internet users are falling into a lull of treating the modern internet as little more than a source of short-form video entertainment, streams for music/podcasts, and a personal assistant for the sundries of daily life. Aside from placating boredom, they just use their smartphones to make appointment reminders, send texts to a partner/spouse, place e-commerce orders, and check off family todo lists, etc. I expect LLMs will make this worse as a younger generation may view long-form writing not as a form of expression but instead as a chore to automate away.
> (...) share with other real people than inadvertently working for free to enrich companies.
That attitude, quite commonly expressed on HN these days, strikes me as a peculiar form of selfishness - the same kind we routinely accuse companies of and attribute the sad state of society to.
A person is not entitled to 100% of the value of everything they do, much less to secondary value this subsequently generated. A person is not entitled to receive rent for any of their ideas just because they wrote them down and put on display somewhere. Just because they touched something, and it exists, doesn't mean everyone else touching it owes them money.
The society works best when people don't capture all the fruits of their labor for themselves. Conversely, striving to capture 100% (or more) of the value generated is a hallmark of the late stage capitalism and everything that's bad and wrong and Scrooge-y.
Self-censoring on principle because some company (gasp!) will train an LLM model on it (gasp!!) and won't share the profit from it? That's just feeling entitled to way over 100% of the value of one's hypothetical output, and feeling offended the society hasn't already sent advance royalty cheques.
Chill out. No matter what you do, someone else will somehow make money out of it, that's how it supposed to work - and AI in particular is, for better or worse, one of the most fundamentally transformative things to happen to humanity, somewhere between the Internet and the Industrial Revolution if it's just a bubble that pops, much more if it isn't. Assuming it all doesn't go to shit (let's entertain something more than maximum pessimism for a moment), everyone will benefit much more from it than from whatever they imagine they could get from their Internet comments.
(Speaking of Industrial Revolution - I can understand this attitude from people who actually earn a living from the kind of IP that AI is trained on, only to turn around and compete with them. They're the modern Luddites, and I respect their struggle and that they have a real point. Everyone else, those complaining about "AI theft" the most, especially here? Are not them.)
Sure, but it sounds like you think people shouldn't be upset about businesses trying to capture all the fruits of people's labor, too.
Capitalism is evil, and people thinking that normalizing exploitation is OK is either shortsighted or it's also evil. Are you simply unaware that this is what's happening and what people are upset about? Have you never thought about it? Or do you want businesses to succeed in exploiting people's work? It sounds like it, because you wrote, "that's how it supposed to work".
I truly wonder if you're self-aware, or if you just think that you'll one day be on the side of the exploiters.
Over the holidays, my father gave my children a book that he had written. It was a photo essay that was 50 pages, and it was titled 'Sharks'. It's an unpublished labor of love that he spent about 500 hours on.
It's a true story centered on Captain Frank Mundus, who operated the Cricket II. He was a renowned shark fisherman and would take people out to fish for enormous sharks. He did this for 40 or 50 years.
An author by the name of Peter Benchley wrote a novel that was heavily inspired by many of Frank's traits, his mannerisms, his approach to shark fishing, the kind of boat he had, the kind of charters he ran. The novel was titled 'Jaws' and received little attention when it was first released. A while after, a director by the name of Steven Spielberg took notice of it and turned it into a multi-million dollar blockbuster movie.
My father was a lawyer that Frank Mundus consulted with and asked, is there any way that he could get a payout for being the inspiration for this character?
My family read the book over the holidays, and it was clearly my father's position that Steven Spielberg and Peter Benchley were maybe the sharks that the title of the book was talking about. The idea that they could make $100 million based on the work and life of this captain and give him literally nothing in return, not even attribution, seemed wrong to him.
I was the lone detractor in the room. My take is that Captain Frank Mundus was just living his life. He was doing what he did to make money chartering fishing trips for sharks. He would have done this regardless of whether or not a writer had come along or a movie had come along. What Peter Benchley and Steven Spielberg did is they found value in his work that he didn't know existed and that he wasn't capable of extracting. I think this is generally true of artists. They wander the world and they create art that gives the viewer a new insight into the experiences the artist had. If artists had to give money back to every real-life inspiration, I think the whole system wouldn't work.
I see parallels with the current attitudes toward AI. I think writers are a lot like Captain Mundus. They're living their life, they're writing their stories, or doing their research and publishing, and having people read their works. And copyright is helping them do all this.
AI companies have come along and found value in their work that they didn't know existed and they were never capable of extracting. And that's OK: that's what innovation is, taking the work that others have done and building on it to create something new.
I'm not unequivocally in favor of all applications of AI, but I do think there are tons of places that can be super helpful and we should allow it to be helpful. One example: I'm drafting this on my phone using Futo keyboard entirely with my voice. Extremely useful, but no doubt trained on copyrighted content.
So? What do I care? If some stuff I posted to my website (with no requirement for attribution or remuneration, and also no guarantee that the information is true or valid) can improve the AI services that I use, great.
The end result is that any authors who care about copyright protection will become less accessible. It’s a gold rush for AI bots to capture the good will of early internet creators before the well runs dry,
My content is still MY content, and I'd prefer that if an entity is going to make money off of it directly (i.e., it's not a person learning how to code from something I wrote but rather a well-funded company pulling my content for their gain), that I at least have some semblance of consent to it.
That being said, I think there is no longer a point of crying over spilled milk. The LLM technology is out of the bag, and for every company that attempts to ethically manage content (are there any?) there will be ten that will disregard any kind of license/copyright notices and pull that content to train their models anyway.
I write because I want to be a better writer, and I enjoy sharing my knowledge with others. That's the motivation. If it helps at least one person, that's a win in my book, especially in the modern internet where there's so much junk scattered around.
I’m pretty much done with that now, I doubt I will publish anything online again.
That's up to the courts. As usual, we will all lose if the copyright maximalists win.
I was watching an interview with John Warnock (one of the founders of Adobe) and he was proud of the fact that the US went from having 25,000 graphic designers to 2,500,000 largely thanks to software his company created.
I do wonder if we are on the verge of reversing that shift.
That's not how anything works.
The problem in this case is that it doesn't matter. The AI stuff is going to exist, and compete with them, whether the AI companies have to pay some pittance for training data or not.
But the chorus is made worse by two major factors.
First, many of the AI companies themselves are closed-source profiteers. "OpenAI" stepping all over themselves to be the opposite of their own name etc. If all the models got trained and then published, people would be much more inclined to say "oh, this is neat, I can use this myself and it knows my own work". But when you have companies hoovering everything up for free and then trying to keep the result proprietary, they look like scumbags and that pisses people off.
Second, then you get other opportunistic scumbags who try to turn that legitimate ire into their own profit by claiming that training for free should be prohibited so that only proprietary models can be created.
Whereas the solution you actually want is that anybody can train a model on public data but then they have to publish the model/weights. Which is probably not going to happen because in practice the law is likely to end up being what favors one of the scumbags.
So, imagine the scenario where you, an artist, trained for years to develop a specific technique and style, only for a massively funded company to swoop in, train a model on your art, make bank off of your skill while you get nothing, and now some rando can also create look-alikes (and also potentially profit from them - I've seen AI-generated images for sale at physical print stores and Etsy that mimic art styles of modern artists), potentially destroying your livelihood. Very little to be happy about here, to be frank.
It's less about competition and more about the ethical way to do it. If another artist would learn the same techniques and then managed to produce similar art, do you think there would be just as visceral of a reaction to them publishing their art? Likely not, because it still required skill to achieve what they did. Someone with a model and a prompt is nowhere near that same skill level, yet they now get to reap the benefits of the artist's developed craft. Is this "gatekeeping what's art"? I don't think so. Is this fair in any capacity? I don't think so either. Because we're comparing apples to pinecones.
All that being said, I do agree that the ship has sailed - the models are there, the trend of training on art AND written content shared openly will continue, and we're yet to see what the consequences of that will be. Their presence certainly won't stop me from continuously writing, perfecting my craft, and sharing it with the world. My job is to help others with it.
My hunch is that in the near-term we'll see a major devaluing of both written and image material, while a premium will be put on exceptional human skill. That is, would you pay to read a blog post written and thoroughly researched by Molly White (https://mastodon.social/@molly0xfff@hachyderm.io) or Cory Doctorow (https://pluralistic.net/), or some AI slop generated by an automated aggregator? My hunch is you'd pick the former. I know I would. As an anecdotal data point, and speaking just for myself, if I see now that someone uses AI-generated images in their blog post or site, I almost instantly close the tab. Same applies to videos on YouTube that have an AI-generated thumbnail or static art. It somehow carries a very negative connotation to me.
Now suppose that the other artist studies to learn the techniques -- several of them do -- and then Adobe offers them each two cents and a french fry to train a model on it, which many accept because the alternative is that the model exists anyway and they don't even get the french fry. Is this more ethical somehow? Even if you declined the pittance, you still have to compete with the model. Even if you accept it, it's only a pittance, and you still have to compete with the model. It hasn't improved your situation whatsoever.
> My hunch is that in the near-term we'll see a major devaluing of both written and image material, while a premium will be put on exceptional human skill.
AI slop is in the nature of "80% as good for 20% of the price" except that it's more like 40% as good for 0.0001% of the price. What that's going to do is put any artists below the 40th percentile out of work, make it a lot harder for the ones at the 60th percentile and hardly affect the ones at the 99th percentile at all.
But the other thing it's going to do is cause there to be more "art". A lot of the sites with AI-generated images on them haven't replaced a paid artist, they've replaced a site without images on it. Which isn't necessarily a bad thing.
(Shrug) Artists were wrong when they said the same thing about cameras at the dawn of photography, and they're wrong now.
If you expect to coast through life while everything around you stays the same, neither art nor technology is a great career choice.
So it's not about owning vs. renting property on the internet, it's about controlling the roads that connect the properties so you can keep the world out of your community.
bad_social_network was good 10 years ago, because it was controlled by "a friend of ours". Now it's controlled by someone who's perceived as "a friend of theirs" and it's therefore bad. So the politik aktivists move to good_social_network, and rave about the good there. Echo chambers be damned, we have control. Until the next "friend of theirs" buys it out, and rinse and repeat. So silly.
The other never has.
There can be technical differences between networks as well as social.
- put a rack in your home
- buy an IPv4 block
- buy dark fibre
- start your own ISP
- advertise routes over BGP
- host your own email
- found a registrar and transfer your domains over
All easily obtainable for less than a million dollars in capital. Though once your FTTH and undersea cable operations ramp up you'll need further access to capital.In all seriousness, I am a big advocate for hosting your own infrastructure as much as you can, but that requires you to be REALLY into doing it. Otherwise, it's just a completely unnecessary chore.
If you're concerned about silicon dependencies, I'd recommend starting with some real estate purchases (mines) and forming a Special Economic Zone first, though.
Remember this is about being an owner and not a renter. The only restriction is our recurring costs, but we have unlimited room upfront, like a Lincoln.
You can’t move your X or FB account. They can block you anytime, or reduce traffic. Way fewer options.
Just owning your own domain, minding your own business, doesn't guarantee that it won't be taken down on a whim.
“But also, as someone who relies on your site, thank you so much for handling this in such a quick and effective way. Waking up and seeing that the site is already back up despite this all makes me proud and grateful to be on itchio.”
Point is that you can choose your own adventure in a way that eclipses fully relying on another company. Beyond that, you can choose to own as many layers as you want and stop long before building your own hardware and fiber network.
If your account on a major social network is terminated, if you had a large community there, you have quite literally no way to access them unless you had some kind of parallel presence somewhere else.
There are also Tor hidden services, where the dependency on lower layers still exists but they can't find which is yours.
Bob: Oh yeah? I have a demiurge!
Alice: I feel we're not really communicating.
Regulations have been waay too lose on, especially, american ISPs where I understand they are allowed to not only refuse you a public routable IP but also dictate what kind of traffic you're allowed to send and receive (for example, whether the traffic flowing is of "commercial" character and therefore should be on on a different subscription), this insanity should be illegal. Internet is a utility, and everyone should have the right to the same type of access, regardless of their need (those who do not need/want, can simply chose not to use it, but ISPs should not be allowed to differentiate).
I've hosted my own web, and other servers on my own hardware since I was 13 years old, when I bought my first domain, I had to use a fax machine for the first time in my life, and fax my request form, along with my passport, to the agency responsible for the top level domain of my country. It was kind of convoluted back then, but everyone were helpful, and it was not that difficult, the technology was well understood, supporters were competent, and it was expected that people were going to use the internet for internet things. Today is my 39th birthday, and while the server hosting my stuff is mostly still located 3 meters from me, the path to having it online has nothing but degenerated, it's an uphill battle just to be on the internet these days.. The mail stuff is the easier part (dkim, dmarc, spf, certificates).. But the simple act of getting your f..king computer connected to the f..king internet like it was 1999, that's the real hassle.. ISP NAT, supporters beyond incompetent, blocked ports, missing (or unknown) relay hosts.. It's a joke.
If you have a domain and your own site, even hosted on a colocated rack or in the cloud, you're already miles ahead of those that don't. And if you have a domain and can manage DNS records, then in the future that doesn't preclude you from "graduating" to your own hardware, if you so desire. The goal here is more or less self-sufficiency with web properties rather than a pure interpretation of "rent" vs. "own." Because at some point you have to rent something from someone (say, you're not running your own domain registry and registrar).
If only companies have the right to participate on the internet, they are empowered even more to chose who should be allowed to even run a website.. It's a slippery slope that ends up in a very bad place, participation wise. It becomes like the airline industry, where the companies pushing hardest for more regulation and red-tape are the oldest, those who made their fortunes back when it was easier and cheaper, and who now use their enourmous wealth to make it harder for new players to enter their market.
It's the same everywhere, when you start allowing power to concentrate.
(Which of course assumes that there are laws in place against lock-in, just like there are already laws in place against lock-in for your pick of ISPs and obligations for mobile carriers to transfer your phone number to another carrier.)
Thing is, that edge infrastructure has been there from the beginning of broadband and is only recently beginning to slip away, with the advent of ISP NAT, agressive IP rotations, blocking of ports and not providing public IPs at all.
You can easily replace a VPS provider with a different provider that will give you exactly the same service. You can't replace Facebook with a different Facebook.
Youtube can demonetize or delete a channel and the creator is more or less fucked. They can find another platform, but they need to build their audience almost from scratch.
By contrast, if my VPS provider kicks me out, I just clone it or restore from backup to any one of the thousands of competing providers, change a few DNS records and my audience (not that I have one) wouldn't even know that anything changed.
Servers and domain names are transferable and neutral, platforms and usernames aren't.
I think the only way forward is better and less evil social networks, or perhaps some sort of cyclical sea change where the Internet sort of starts over again, ala how civilizations/humanity have every XX thousands of years according to various esoteric Sikh/Hindu traditions.
> Most people are perfectly content with everything living inside their Facebook account because it’s convenient and their family and friends are already there. Telling everyone to learn how git and GitHub Pages work to host their blog is not an effective way to drive change. But that’s also not the point. As I mentioned earlier, the goal is to start with a small niche community of people who are comfortable with building their own digital corners.
Not everyone needs to have their website. Not everybody wants to have their website. And that's fine. They can use the social platforms as-is. But for those that have the means and interest in building their own corners (say, they want to bootstrap a business), they should not limit themselves to the social networks as the only place for their community. There is a better way than be a sharecropper on someone else's land.
The website says that the guy is from Washington, but his name does sound vaguely Slavic. Interesting.
[1]: https://rednafi.com
I’m trying to do my bit for the web at https://lmno.lol
Started a blogging service that doesn’t do things like the big players.
There is a caveat to this, in the unlikely event your state leaves the EU you will be forced off, this happened to many UK entities after Brexit, they were forced to stop using their .eu domains.
Like it's used by institutions or organizations linked to the EU by their activity, but it's rarely used by companies or individuals whose activity is not focussed on the EU.
Companies are still going to buy the .eu domains associated with their brands, but they will communicate with another TLD like .com or will provide located versions of their site under a country TLD like .fr for the French version and .de for the German one.
What I see the most is:
- country TLD for content that is located
- .com/.net or weird (like .dev) for content that is in English or in multiple languages
There's a lot of sense in this post. There's not a lot of sense in the reaction here.
Out of the literally thousands of companies that YC has invested in, only about a dozen have gone public.
YC is not interested in “lifestyle companies” that are a profitable ongoing concern it is interested in the “exit”
I think OPML is the technology. Go build one and share it. Writing or recording your own stuff is a lot of work. Help promote all the cool things you've found outside the walled garden of plastic plants.
Build a website for someone, teach them html for 20 minutes and set up a domain, hosting and an ftp client for them. They can always call you if they get stuck.
People should be able to host things on their computer and without having to learn anything. Friends should archive and mirror it and without having to check boxes.
Only with file sharing the establishment went to war. Maybe ipfs will grow to that level.
- Don't depend on other people's software services.
- Buy a domain and host your own website.
- Don't pick a sketchy TLD or registrar.
- Mailing lists beat social media accounts.
- It's okay to depend on a cloud.
I had the belief that the article was going to say the exact opposite wrt. cloud hosting. You're literally renting space, and if your stuff gets any heat, your cloud provider may simply shut you down without a trial.Even if you host your own server on your own legal property, most people don't have AS-numbers and peering agreements, so ultimately on the internet most people rent something.
Likewise, if you're using Cloudflare as a CDN (and only as a CDN!) there are other CDN providers available that you could switch to with relative ease.
I had the exact chain of thought, only to find that traffic of the site I built is at the mercy of how Google decided to rank webpages, and putting AI > youtube > Reddit in front of everything else.
> - Mailing lists beat social media accounts.
Similarly, Google set the metric for what counts as spams. Your emails can all go to the spam folder if their AI decides it should.
On (2), they can, but if you use a more established provider like Buttondown or Mailchimp, and you are not actually sending spam, a lot of folks have quite a bit of success building an audience that way. I've used Buttondown (not affiliated with them in any capacity) personally before and haven't had subscribers complain about deliverability. I am planning on rebooting that this year to see how it goes. I've heard most deliverability issues arise when folks trying to roll out their own email server.
I always felt like you are painting target on your homelab when you allow outside access.
Nowadays, I recommend them use Tailscale as an out-of-the-box Wireguard-based VPN to safely connect to their home servers from remote locations.
Tutorials:
- https://wiki.gentoo.org/wiki/Nftables/Examples
- https://wiki.archlinux.org/title/Nftables
- and probably the best advanced tutorial is a video series https://www.youtube.com/watch?v=K8JPwbcNy_0&list=PLUF494I4KU...
TL;DR One should know firewall fundamentals, nft/nftables as successor of iptables is very convenient to use, a single config document instead of interactiving with 100 cli commands which have to be in a specific order.
I strongly disagree. I am subscribed to several "link farm" accounts specifically because I don't use RSS anymore, and because it allows for an immediate discussion if enough people are subscribed to said account.
The one thing I wonder about is whether younger generations will use mailing lists. I never did and I’m already in my mid-30s.
I wonder how long that will be true, considering how difficult hosting your own server for your own domain is these days.
I realized that VR, as a medium, is uniquely tuned to people, and I think you would have a great blog if you focused on just that: people.
What are some ideas for how can VR help people? Can it help people fight depression? Can it be used for physical or emotional therapy? Can it help them safely build skills that could improve their lives?
On that last one, I am reading a book about ship handling. It was a Christmas present because I will never be a captain in my life. But ship handling would work so well in VR. Even the commands are so standardized that players could give voice commands, a real bridge.
How many Make-a-Wish kids or others could have a wonderful time doing that? And that isn't even considering the idea of extending that game into space with Star Trek USS Enterprise or Star Wars Star Destroyers.
Your site could be the nexus of those great ideas. You could even have guest posts. I'd write one.
Anyway, sorry for the novel. VR has helped me feel better than I have in a long time.
Your comment made me think something like, “People of VR”
> a book about ship handling
Mind sharing the title? I'm very curious.
Document every VR headset released with technical details and references
[0]: https://www.deepsouthventures.com/i-sell-onions-on-the-inter...
the LOE to self hosting and adding infra on demand should also be push button easy
the good news is that it seems to get easier and cheaper as time passes, which makes it feel inevitable but obstacles remain because of corporate business interests of course
I'm a property owner on the internet, built the best website in my niche by far (really look it up and compare) but google rewards ugly dogshit wordpresses created by people with no expertise in my niche.
When you're an owner you have to deal with things like taking another SEO hit for the holidays and have to face the dilemma of whether to fire people now, or when you have no money in two months and no reserves left to maneuver too.
Although I swear this was posted on here just recently, was it not?
I added a few extra reading materials/references at the bottom of the article.
I think digital coops are the only really feasible way away from platform capitalism.
Why would we care so much about being a property owner and not renter if we don't care about whether a hosting company can just turn us off?
Why would we care so little about privacy that we'd willingly use services that negate privacy and introduce tracking? For instance, the author suggests using Cloudflare, Azure, AWS and more, and none of these aren't abusive. Funnily enough, the Cloudflare hosted images on the page didn't even load for me;)
Does the author not know that Mastodon is software and not a social media network? Sure, it's a minor thing, but when people use terms incorrectly, it makes me wonder if they really know what they're talking about (people who'd rather fight about whether it's a term "everyone uses" instead of whether it's correct seem to not get this point).
The author writes (or quotes) somewhat derisively:
> Well of course it’s better to host your own blog! Also, while you’re at it, put your Mastodon server in a DigitalOcean droplet, throw some Cloudflare CDN in front of it, run your own Raspberry Pi to monitor uptime, and you’re golden! Oh, and don’t forget to also make sure to log into the droplet every once in a while to update the container, do an occasional database migration, and ensure that you check the logs for intrusions.
This all-or-nothing attempt to make the idea of self-hosting seem ridiculous is by itself ridiculous. Nobody needs to do all that. On the other hand, the audience for what the author is advocating shouldn't have a problem setting up a simple machine or VM instance to host a web site, blog, perhaps even DNS server themselves. All or nothing is silly, so even making this example reduces credibility.
All in all, I'd love to see people take ownership of their own things on the Internet. Nobody needs to self-host, but people should if they could, and they should ignore people who say it's too complex because most of the too-complex argument is the suggestion that it needs to be much more than it does. A single VM, a single small computer, even a Pi, can host most things.
But whether people self host or use someone else, choosing where to host matters. Don't use companies that negate the benefits of owning your own things, whether by lock-in or by letting them do all the tracking that Facebook would normally do. Don't use companies that'll disable your account because some idiot wrote a letter. Don't use companies that're so big that you can't talk to a human!
This article makes for an odd juxtaposition between doing thoughtful things and doing things that negate some or much of that thoughtfulness. I'd have a hard time recommending it to others without qualifications.
Beyond the many usual critiques, in this context "owning bitcoin" doesn't mean much, barely a step above having a hoard of hidden gold bars.
While it is indeed more "on the internet" than precious metals, possessing that speculative-asset does not provide any special niche, ownership, or control over the broader, er, cybernetic means of production.
It's a solid guide, in other words.