Blog post that explains the rationale behind the library: https://philippeoger.com/pages/can-we-rag-the-whole-web
Just submit your XML sitemap into a python class, and it will do the crawling, chunking, vectorizing and storage in an SQLite file for you. It's using SQLiteVSS integration with Langchain, but thinking of moving away from it, and do an integration with the new sqlite-vec instead.
A relational crawler on a particular subject with nuanced, opaque, seemingly-temporally-unrelated connections that show a particular MIC conduction of acts::
"Follow all the congress members who have been a part of a particular committee, track their signatory/support for particular ACTs that have been passed, and look at their investment history from open data, quiver, etc - and show language in any public speaking talking about conflicts and arms deals occurring whereby their support of the funding for said conflicts are traceable to their ACTs, committee seat, speaking engagements, investment profit and reporting as compared to their stated net worth over each year as compared to the stated gains stated by their filings for investment. Apply this pattern to all congress, and their public-profile orbit of folks, without violating their otherwise private-related actions."
And give it a series of URLs with known content for which these nuances may be gleaned.
Or have a trainer bot that will constantly only consume this context from the open internet over time such that you can just have a graph over time for the data...
PYTHON: Run it all through txtai / your library ? nodes and ask questions of the data in real time?
(And it reminds me of the work of this fine person/it::
I have not used sqlite-vec much because it was only alpha-released for now, but it finally came out a few days ago. I'm looking into integrating it and use it to make sqlite more my go-to RAG database.
1) The returned output from a query seems pretty limited in length and breadth.
2) No apparent way to adjust my prompts to improve/adjust the output e.g. not really 'conversational' (not sure if that is your intent)
Otherwise keep developing and be sure to push update notifications to your new mailing list! ;-)
Brilliant idea, btw, I like it :-)
Soon websites/apps whatever you want to call them will have their own built-in handling for AI.
It's inefficient and rude to be scraping pages for content. Especially for profit.
It is like every website having search engine vs google.
I would guess the hardest thing by far in developing the advertised product would be user management, authentication, payments and wrapping the subscription model's business logic around the core loop. And probably scaling, as running embeddings over hundreds of scraped pages adds up quickly when free tier users start hammering you.
My question when deciding to sell something I've built is, if building the service model is harder than building the actual service, where is the value add?
My take on the natural evolution is that collating and caching documents, websites etc for search (with source attribution ideally) is a problem that will I think ultimately be solved by OS vendors. Why sign up for SaaS and expose all your content to untrustworthy 3rd parties, when it's built right in and handled by your "trusty" OS.
In the meantime, I reckon someone more dedicated than me will (or probably already has) open source something like I built but better, probably as a CLI tool, which will eventually reach maturity and be stolen cough I mean adopted by the top end of town.
Ethically I think nothing's changed for centuries in regards to plagiarism and attribution. It gets easier to copy work and thinking, but it also ultimately gets easier to acknowledge sources. Good folk will do the right thing as they always have done.
Regarding efficiency, I think tools like this have a place in making access to relevant and summarised knowledge during general research more efficient, when doing the broad strokes to find areas of interest to zoom in on, when more traditional approaches take over.
Interesting times anyway. I have to give credit to people that try, but I'm taking a back seat in thinking of ideas to productise in this space, as by the time I've thought it through, something new comes along that instantly makes it obsolete.
This isn’t “niche”, it’s a pretty cool thing OP has built.
How about instead of commenting and trivialising what people have done, you say something positive
Lmfao. God bless HN for keeping this meme going for decades by now.
Nobody cares how you would build it because you haven’t. At least not in any form that we can see.
A key question that the docs should answer (and perhaps the "How it works" page too): chunking. You generate an embedding for the entire page? Or do you generate embeddings for sections? And what's the size limit per page? Some of our docs pages have thousands of words per page. I'm doubtful you can ingest all that, let alone whether the embedding would be that useful in practice.
But: I feel the more of these services come to being, the more likely it is that every website starts putting up gates to keep the bots away.
Sort of like a weird GenAI take on Cixin Liu's Dark Forest hypothesis (https://en.wikipedia.org/wiki/Dark_forest_hypothesis).
(Edited to add a reference.)
"We've been sitting in our tree chirping like foolish birds for over a century now, wondering why no other birds answered. The galactic skies are full of hawks, that's why." (The Forge of God, Legend edition, 1989, pg 315).
Yeah, same concept and even the same imagery.
Source: https://warwick.ac.uk/fac/sci/physics/research/astro/people/...
I think there's a passage that even uses an analogy of a forest, though I'm not sure.
That's why we need microtransactions, because I'd rather be able to have both nice AI services and useful data repositories that they pull from, than have to choose just one. (and that one would be AI services, because you can't stop all the scrapers, so data sources will just keep tightening their restrictions)
Click fraud on adverts is a form of microfraud, and pay-per-click is the existing form of microtransaction.
But, all of the systems that I've seen (Blendle, video games) have had no problem at all with fraud, and a very small amount of annoyance to value delivered.
There's simply no reason to believe that this will be a problem, either empirically or theoretically.
Previously: https://news.ycombinator.com/item?id=15592192
Unsolved, difficult problems of micropayments:
- pay before viewing: how do you know that the thing you're paying for is the thing that you're expecting? What if it's a rickroll or goatse?
- so do you give refunds a la steam?
- pay and adverts: double-dipping is very annoying
- pay and adverts: how do you know who you're paying? A page appears with a micropayment request, but how do you know you've not just paid the advertiser to view their ad?
- pay and frame: can you have multiple payees per displayed page? (this has good and bad ideas)
- pay and popups: it's going to be like those notification or app install modals, yet another annoyance for people to bounce off
- pay limits: contactless has a £30 limit here. Would you have the same payment system suitable for $.01 payments and $1000 payments? How easy is it to trick people into paying over the odds (see refunds)?
- pay and censors: who's excluded from the payment system? Why?
> If it was that easy, it would have been done.
Part 2: business model problems!
- getting money into the system is plagued by usual fraud problems of card TX for pure digital goods
- nobody wants to build a federated system; everyone wants to build a Play/Apple/Steam store where they take 30%
- winner-take-all effects are strong
- Play store et al already exist, why not use that?
- Free substitute goods are just a click away
- Consumers will pirate anything no matter how cheap the original is
- No real consumer demand for micropayments
=> lemma from previous 3 items: market for online goods is efficient enough to drive all marginal prices to zero
- existing problem of the play store letting your kid spend all the money
- friction: it would be great if you didn't have to repeatedly approve things, such as a micropayment for every page of a webcomic archive. But blanket approval lets bad actors drain the jar or inattentive users waste it and then feel conned
- first most obvious model for making this work is porn, which is inevitably blacklisted by the payment processors, has a worse environment for fraud/chargebacks, and is toxic to VCs (see Patreon and even Craigslist)
- Internet has actually killed previously working micro-ish payment systems such as Minitel, paid ringtones (anyone remember the dark era of Crazy Frog?); surviving ones like premium SMS and phone have a scammy, seedy feel.
- accounting requirements: do you have to pay VAT on that micropayment? do you have to declare it? Is it a federal offence to sell something to an Iranian or North Korean for one cent?
You seem to have conjured the impression that micropayment systems have to be radically different than current payment models, which is wildly mistaken.
You can build an effective micropayment system using only currently available tools (digital wallets, microcurrencies, digital storefronts, review systems) that have most/all of the nice properties of existing platforms, which invalidates almost every single point you make.
Few of these points seem very well thought-out - they're mostly relatively easily refuted by using logic and/or pointing to what the industry is already doing.
> pay before viewing: how do you know that the thing you're paying for is the thing that you're expecting?
In the exact same way as current digital storefronts.
> How easy is it to trick people into paying over the odds
What are "the odds"? Are we betting now?
> so do you give refunds a la steam?
Yes, exactly like current digital storefronts.
> - pay and adverts: double-dipping is very annoying
Exactly like current digital platforms (e.g. Spotify, YouTube premium).
> - pay and adverts: how do you know who you're paying?
What does this even mean - how do "adverts" factor in to "how do you know who you're paying"??
> pay and frame: can you have multiple payees per displayed page?
What does this mean??
> - pay and popups: it's going to be like those notification or app install modals, yet another annoyance for people to bounce off
A theory that is trivially dispelled by empirical evidence of the tens of billions of dollars in microtransactions that US players spend on free-to-play games. You create a microtransaction wallet currency that is roughly equivalent to normal money, and then you pay for things by clicking on them, like in a normal game with microtransactions. Empirically, people get used to this very quickly and the friction becomes unnoticeable.
> - pay limits: contactless has a £30 limit here
What does any of this have to do with contactless payments???
> Would you have the same payment system suitable for $.01 payments and $1000 payments?
It's pretty easy with a few seconds of thought to think of systems that handle both of those cases well. For instance, you can make it so you have to hold down a button to purchase something with your microcurrency, with the duration of the hold (nonlinearly) proportional to the cost of the item.
> - pay and censors: who's excluded from the payment system? Why?
Exactly the same as current platforms - the platforms/wallet providers determine that.
>> If it was that easy, it would have been done.
Objectively false. There are many good ideas that have failed because of market factors or poor marketing. In this case, the prevalence of ads, and people's generosity at a time before scraping has truly taken off, is subsidizing the market. The generosity will decrease and doesn't scale, and I shouldn't have to point out the problems with ads.
> - getting money into the system is plagued by usual fraud problems of card TX for pure digital goods
So you handle that exactly the same way as current platforms - you use a payment processor.
> - nobody wants to build a federated system; everyone wants to build a Play/Apple/Steam store where they take 30%
I have to pay 30% already. I'd rather pay directly than with my eyeballs and brain (through ads). This is a problem, but it's better to implement a solution, and then lobby for regulation requiring an open, interoperable payment protocol.
> - winner-take-all effects are strong
Sure? We have the same problem for ads and platforms currently.
> - Play store et al already exist, why not use that?
This was already answered by vlehto in a response[1] to your comment you linked above, which you conveniently left out here. I'll quote:
> Play store does not have the content I want. And it seems overly difficult to post the content I want to make. And if I want to pay for engagement with the content, that's not an option.
> Patreon is lot closer to what suits my needs. But even that is too inflexible on how to use payments and to what you should pay. So far everybody seems to be making their micro-transaction payment models "flexible" by making the amount paid flexible. But that's exactly the one thing I want to be fixed. I'd like to host my entirely own webpage and patreon just to handle the money from exactly the kind of transaction I want.
Patreon is obviously not a micropayment platform and grossly inadequate for, well, almost anything - it's run by bad people who take large cuts and screw over creators, the friction to use it is incredibly high, and the payment model (subscriptions) does not scale well and isn't fair to smaller creators.
> - Free substitute goods are just a click away
...and yet, somehow people still pay for things online. This is quite the non-argument.
> - Consumers will pirate anything no matter how cheap the original is
Piracy is obviously evil and I'm doing my best to fight it. But this isn't an argument. In fact, the commonly-touted line "piracy is a service problem" logically implies that low-friction micropayments will make piracy less prevalent, not more.
> - No real consumer demand for micropayments
See above points about the market being subsidized by ads.
> => lemma from previous 3 items: market for online goods is efficient enough to drive all marginal prices to zero
...and yet people still pay money for things.
> - existing problem of the play store letting your kid spend all the money
This has nothing to do with microtransactions at all. This is just a platform permissions problem.
> - friction: it would be great if you didn't have to repeatedly approve things, such as a micropayment for every page of a webcomic archive. But blanket approval lets bad actors drain the jar or inattentive users waste it and then feel conned
It's trivial for someone to break up their webcomic into chapters and have users pay for the bundle. If a particular comic creator doesn't, then they'll very quickly implement that as their readers get incredibly annoyed and leave. And the vast majority of comic creators will use an existing platform to host instead of rolling the microtransaction system themselves. As for being conned? We handle that in exactly the same way as current digital storefronts.
> - first most obvious model for making this work is porn
And? I don't see how this is relevant.
> - Internet has actually killed previously working micro-ish payment systems
See above points about the market being subsidized by ads, systems being launched before their time, etc.
> surviving ones like premium SMS and phone have a scammy, seedy feel
That's purely a function of those things, and is not intrinsic to microtransactions, as evidenced by F2P games.
> - accounting requirements: do you have to pay VAT on that micropayment? do you have to declare it? Is it a federal offence to sell something to an Iranian or North Korean for one cent?
We handle this in exactly the same way as current platforms/payment processors.
His claim is that the existing system has fraud, therefore micro-transactions will have analogous thing he named "micro-fraud" — so you agree with him now?
https://github.com/langchain-ai/langchain/blob/master/cookbo...
FWIW, the pricing model of jumping from free to "contact us" is slightly ominous.
> Turn any website into a knowledge base for LLMs
I would pay for the opposite product: make your website completely unusable/unreadable by LLMs while readable by real humans, with low false positive rates.https://github.com/harvard-lil/warc-gpt
https://lil.law.harvard.edu/blog/2024/02/12/warc-gpt-an-open...
https://github.com/MittaAI/SlothAI/blob/main/SlothAI/lib/pro...
https://github.com/MittaAI/mitta-community/tree/main/service...
There's code in there that just reads PDF meta data as well, but you can't always guarantee it's there in a PDF.
people still being ignorant about their publicly posted policies 5 years later is annoying.
In addition, this scraper doesn't even identify itself (I checked). It pretends to be a normal browser, without saying it's a scraper.
You can of course decouple from the big discussion and isolate your content with access restrictions, but the real interesting activity will be outside. Look for example the llama.cpp and other open source AI tools we have gotten recently. So much energy and enthusiasm, so much collaboration. Closed stuff doesn't get that level of energy.
I think IP laws are in for a reckoning, protecting creativity by restricting it is not the best idea in the world. There are better models. Copyright is anachronic, it was invented in the era of the printing press when copying was made easy, LLMs remix they don't simply copy, even the name is unfitting for the new reality. We need to rename it remixright.
The LLM era doesn't give credit or attribution to its sources. It erases exposure. So there's a disincentive to collaborate with it, because it only takes.
> I think IP laws are in for a reckoning, protecting creativity by restricting it is not the best idea in the world.
We've been having this discussion for over 20 years since the Napster era, or even the era of elaborate anti piracy measures for computer games distributed on tapes 40 years ago.
I've reached the conclusion that the stable equilibrium is "small shadow world": enough IP leakage for piracy and preservation, but on a noncommercial scale. We sit with our Plex boxes and our adblockers, knowing that 90% of the world isn't doing that and is paying for it. Too much control is an IP monopoly stranglehold where it costs multiple dollars to set a song as your phone ringtone or briefly heard background music gets your video vaporised off social media. Too _little_ control and eventually there is actually a real economic loss from piracy, and original content does not get made.
AI presents a third threat: unlimited pseudo-creative "slop", which is cheap and adequate to fill people's scrolling time but does not pay humans for its creation and atrophys the creative ecosystem.
This is not an easy problem to solve. In my naive take, authors get to decide how their work is used, not scrapers.
Inasmuch as they've put it on the public web they've already made a decision on who gets to see it, and you really can't stop people from doing what they want with it on a personal level.
If that's print it out and put it on a wall in my house, or use whatever tools I have at my disposal to consume it in any way I please, there's not really anything the author can do about it.
As to what constitutes fair use, that's a whole other story: some scraping may be found to be legal while others may not. Benefiting monetarily from legally dubious scraping only makes that scraping look more infringe-y. Of course, nothing is settled law until a court decides.
That was actually a big enlightening moment for me, as long as money is involved, the so called ethics were out of the window instantly. From the far left newspapers to the far right ones, they all lied on this topic. Only a handful tech blogs and newspapers did tell the truth.
There's still room for an ethical development of such crawlers and technologies, but it needs to be consent-first, with strong ethical and legal standards. The crazy development of such tools has been a massive issue for a number of small online organisations that struggle with poorly implemented or maintained bots (as discussed for OpenStreetMap or Read The Docs).
Because if you save the pages you browse on some site, they're yours (authors don't own your cache).
Perhaps you're arguing that if you wrote a lightweight script/browser (which is just your user agent) to save some website for offline use, that'd be unethical and GDPR violating? Again, I don't think so but maybe I'm missing something. But perhaps this turns on what defines a "user agent".
Perhaps this becomes a "depth of pre-fetch" question. If your browser prefetches linked pages, that's "automated" downloading, akin to the script approach above. Downloading. To your cache. Which you own. (Where I struggle to see an ethical violation)
Genuinely curious where the line is, or what exactly here is triggering ethics, GDPR and practical standards?
In this case, if this tool is used to scrape a website, there are too direct issues: 1/ no immediate way for the website owner to exclude this particular scraper (what is the useragent?) 2/ no way for data subjects (whose data is present on the website) to search whether the scraper learned their personal data in the embeddings. Data being available publicly doesn't mean it can be widely used [at least outside the US, where we have much stricter rules on scraping].
Twitter and Reddit locked down their APIs. Soon enough, you’ll need an account to even access any content
#2. If you are interested in knowledge bases, see #1
If there is no certifications or compliance information then I don't think there is anything to discuss about any enterprise plan.
All of that code is Open Source, and works well for most sites. Some sites block Google IPs, but the Playwright container can run locally, so should be able to work around it with some minimal effort.
> Tech stack is a mix of serverless Laravel, with Cloudflare and AWS functions, and some Pinecone for vector storage. Still experimenting on a few things but don't want to over-engineer unless I know where I'm going.
You can also do this on AWS now fairly easily. https://medium.com/data-reply-it-datatech/how-to-build-a-cus...
The lablab.ai Discord community is a pretty good place to learn how this product category is evolving.
Also I've checked their docs to see if there is any mention about the user agents or IP ranges they use for scraping, with no luck.