Show HN: Benchmarking VLMs vs. Traditional OCR

(getomni.ai)

3 points | by themanmaran1 天前

3 comments

EarlyOom11 小时前
OCR seems to be mostly solved for 'normal' text laid out according to Latin alphabet norms (left to right, normal spacing etc.), but would love to see more adversarial examples. We've seen lots of regressions around faxed or scanned documents where the text boxes may be slightly rotated (e.g. https://www.cad-notes.com/autocad-tip-rotate-multiple-texts-...) not to mention handwriting and poorly scanned docs. Then there's contextually dependent information like X-axis labels that are implicit from a legend somewhere, so its not clear even with the bounding boxes what the numbers refer to. This is where VLMs really shine: they can extract text then use similar examples from the page to map them into their output values when the bounding box doesn't provide this for free.
betula_ai13 小时前
Thank you for sharing this. Some of the other public models that we can host ourselves may perform in practice better than the models listed - e.g. Qwen 2.5 VL https://github.com/QwenLM/Qwen2.5-VL?tab=readme-ov-file
fzysingularity1 天前
What VLMs do you use when you're listing OmniAI - is this mostly wrapping the model providers like your zerox repo?