95 points | by hbamoria22 hours ago
Is there a tool/technique to achieve this? I’m aware that I can use LLMs to do so, or read all pages and find identical text (header/footer), but I want to keep the page number as part of the metadata to ensure better citation on retrieval.
A better approach will be using Textract as it maintains the flow, such as if you have a table going across multiple pages.
Btw, tesseract is not that good in getting accurate data from tables. Use it with caution especially in financial context.
I have made an open source tool to show missing data from tesseract and easy ocr https://github.com/orasik/parsevision/
ColPali is the standard implementation & SOTA. Much better than OCR. We maintain a ready to go retrieval API that implements this: https://github.com/tjmlabs/ColiVara
It is abstraction hell, and will set you back thousands of engineers hours the moment you want to do something differently.
RAG is actually very simple thing to do; just too much VC money in the space & complexity merchants.
Best way to learn is outside of notebooks (the hard parts of RAG is all around the actual product), and use as little frameworks as possible.
My preferred stack is a FastAPI/numpy/redis. Simple as pie. You can swap redis for pgVector/Postgres when ready for the next complexity step.
My experience with LangChain has been a mixed bag. On the one hand it has been very easy to get up and running quickly. Following their examples actually works!
Trying to go beyond the examples to mix and match concepts was a real challenge because of the abstractions. As with any young framework in a fast moving field the concepts and abstractions seem to be changing quickly, thus examples within the documentation show multiple ways to do something but it isn't clear which is the "right" way.
RAG section: https://github.com/neuml/txtai?tab=readme-ov-file#retrieval-...
Disclaimer: I'm the primary developer
These usually stem from overly strict constraints in the underlying sdks for the integrations, and in general we've been pretty successful asking for those constraints to be loosened. The main "problem" constraint we've seen in the past has been on httpx. Curious if you've seen others!
If you want notebooks that do some of this with local open models: https://github.com/neuml/txtai/tree/master/examples and here: https://gist.github.com/davidmezzetti