194 points | by nhirschfeld1 week ago
It's both. The OCR part is ofc CPU bound, but the entire text extraction involves reading files, or writing and then reading files.
Without async, these simply block.
As for efficiency - if you're working in an async application context you have to "asyncify" these operations or suffer the consequences.
There are alternative options to tesseract ofc.
This sounds... I guess Pythonic? Sheesh.
Even with CPU bound code in Python, there are valid reasons to be using async code. Recognizing that the code is CPU bound, it is possible to use thread and/or process pools to achieve a certain level of parallelism in Python. Threading won't buy you much in Python, until 3.13t, due to the GIL. Even with 3.12+ (with the GIL enabled), it's possible (but not trivial) to use threading with sub interpreters (that have their own, separate GIL). See PEP 734 [0].
I'm currently investigating the use of sub interpreters on a project at work where I'm now CPU bound. I already use multiprocessing & async elsewhere, but I am curious if PEP 734 is easier/faster/slower or even feasible for me. I haven't gotten as far as to actually run any code to compare (I need to refactor my code a bit with the idea of splitting the work up a bit differently to account for being CPU instead of just IO bound).
Yeah. As an API consumer I would not expect a PDF API do IO, hence be async. Have the library be sans-io, the interfaces sync and callers from async code handle IO on their end, offloading to IO threads.
Async is also referred to as “best practice”, but it’s just a tool, for specific use cases. And I say that as an “async fan”!
That said, perhaps it’s easier nowadays to just do async by default, as you say. The real world is async anyway, so why not program closer to that reality.
Can you speak to how this differs in PDF extraction from, say, pymupdf, pdfplumber, unsloth and so on ?
I know the async part is probably a thing, but when building a RAG I would be brutally focused on the quality of text extraction. Have you noticed an ability to do better than others ?
1. Text extraction from a searchable PDF.
2. OCR.
For 1. Kreuzberg uses pypdfium2, which is a python binding for pdfium - the chromium PDF engine. In this regard Kreuzberg has top notch performance. Much faster than miner.six, PDFplumber etc.
Note PyMuPDF has top notch performance but also an AGPL license, and is almost unusable because of this without paying.
For 2. Kreuzberg uses Tesseract, which is very solid. Performance is good, and Kreuzberg utilizes async worker processes to optimize concurrency.
OCR though is a complex world. If what you need is to extract text from standard text documents (broadly speaking), Tesseract and hence Kreuzberg are a good choice.
If what you need is things like layout extraction, hand writing recognition, complete bonding box metadata etc. than you need to use an alternative - commercial one probably.
Quite a few years ago I saw this translated as Sileasian Gate on Google Maps (IIRC), which - for some reason - reason just brought up "Tannhäuser Gate" in my mind right now.
OCR was discussed here lately several times (https://news.ycombinator.com/item?id=42952605 and https://news.ycombinator.com/item?id=42871143), and some cool projects like https://github.com/Future-House/paper-qa?tab=readme-ov-file#... are using PyMuPDF. My experience with Tesseract is pretty sad, it's usually not good enough and modern LLMs are better.
In my tests I found tesseract quite good for regular text documents. For other kinds of texts it's not great.
As for using models - there are some good small language models as well, and of course LLMs.
I sorta feel though that if one needs complex OCR, or a vision model for layout, one should opt for either a commercial solution that abstracts the deployment and GPU management, or bake ones own system.
For most use cases involving text documents though, my subjective opinion is that tesseract is sufficient.
I modified a library card software (Blacklight) into a searchable PDF industrial manual system awhile back on a one-off basis. It couldn't go any further than a contract project that delivered the source code because it's hard to do anything programmatically (at the time) to a PDF without Ghostscript.
I've often thought of rewriting it with Python (and Postgres, to get rid of Solr or Elastic as the search backend), maybe now's the time...
I trust you long enough for a second look because I ctrl-f'd the readme and found "pdfium" so I know I don't have to retread old ground in your github issues about how there's really only a couple of ways to parse a PDF with a semblance of reliability, lol...
(for anyone else reading this getting started with documents.. Adobe and Chrome are really the only PDF rendering libraries that work. PDF.js aka Firefox has always been broken, and Apple's is problematic as well, in both cases rearing their heads in terms of incorrect word / letter spacing).
You should try using a different PSM to see if you get better results.
If it's scientific texts specifically, look at grobid
On the command line, first install `uv` from https://github.com/astral-sh/uv?tab=readme-ov-file#installat..., then run `uv tool install -U "docling[tesserocr,ocrmac,vlm]"` (first includes the tesserocr, ocrmac (macOS only), and vlm (for running a small Image-to-Text model to get descriptions of images).
You go here https://github.com/DS4SD/docling/blob/main/pyproject.toml#L1... to see all the extra installation options.
For cached/offline use, run `docling-tools models download` to download their models.
is a python usage as intended. Being executable pseudo-code, glue language is its selling point. When has it ever been any different.
I'm not sure C++/Rust projects are easier to understand though.
You might want to look into https://github.com/VikParuchuri/surya as an alternative to tesseract. Yes, it's associated with a commercial company, but as you long as you aren't a company with 5M in ARR or $5M in funding it's free to use.
My reasons for using Tesseract - easy OCR is larger, and it has a significant cold start.
It benchmarks better for many OCR tasks though, so I'm thinking of adding it as an alternative backend.
Personally I‘ve used Tesseract before but the results were underwhelming, so I‘m curious how Paddle OCR performs in comparison.
jokes aside, really cool library. I'm currently working in a bigger project where we build a data lake with a wide variety of input sources and formats - this could be quite interesting for us.
btw, liked the name as a turk with a few relatives who lived in germany :D
But seriously, in 13 years living here, only one guy tried to pick pocket me.