The Modern Document Processing Stack

(github.com)

7 points | by marcelmarais12 hours ago

2 comments

  • alhirzel1 hour ago
    Surprised to not see Pandoc.
  • gwern11 hours ago
    This would benefit from examples. What's a gnarly set of documents that this will process to clean useful Markdown, which a much simpler stack like 'pdftotext' would fail on, and what would this buy me over just running Zerox or another OCR tool directly?
    • marcelmarais11 hours ago
      This should make the use case a bit clearer. It's basically a starting point / wrapper of a few tools when you know you'll probably build something custom later so want to invest 0 time in the beginning but need something that's workable: https://www.differentiated.io/blog/the-modern-document-proce...
      • gwern4 hours ago
        That doesn't really answer my question. Like, I have a website, and I have many references; I also use LLM embeddings for nearest-neighbors recommendations of references to each other.

        What might this... do... for me? Don't dump a bunch of JS which is how I would 'do' whatever it does. What does it do? Like, can I dump the URL 'https://pmc.ncbi.nlm.nih.gov/articles/PMC4543385/' into it and get out nice usable clean text of the abstract, say? What about a complicated PDF like https://gwern.net/doc/psychiatry/anxiety/2025-he.pdf (these are the last two references I added)? What do I get? Do I have to install the whole darn thing just to see what it does?