The Modern Document Processing Stack

(github.com)

7 points | by marcelmarais12 hours ago

2 comments

alhirzel1 hour ago
Surprised to not see Pandoc.
gwern11 hours ago
This would benefit from examples. What's a gnarly set of documents that this will process to clean useful Markdown, which a much simpler stack like 'pdftotext' would fail on, and what would this buy me over just running Zerox or another OCR tool directly?
- marcelmarais11 hours ago
  This should make the use case a bit clearer. It's basically a starting point / wrapper of a few tools when you know you'll probably build something custom later so want to invest 0 time in the beginning but need something that's workable: https://www.differentiated.io/blog/the-modern-document-proce...
  gwern4 hours ago
  That doesn't really answer my question. Like, I have a website, and I have many references; I also use LLM embeddings for nearest-neighbors recommendations of references to each other.
  What might this... do... for me? Don't dump a bunch of JS which is how I would 'do' whatever it does. What does it do? Like, can I dump the URL 'https://pmc.ncbi.nlm.nih.gov/articles/PMC4543385/' into it and get out nice usable clean text of the abstract, say? What about a complicated PDF like https://gwern.net/doc/psychiatry/anxiety/2025-he.pdf (these are the last two references I added)? What do I get? Do I have to install the whole darn thing just to see what it does?