Show HN: Kreuzberg – Modern async Python library for document text extraction

197 points by nhirschfeld 8 months ago

I'm excited to showcase Kreuzberg!

Kreuzberg is a modern Python library built from the ground up with async/await, type hints, and optimized I/O handling.

It provides a unified interface for extracting text from documents (PDFs, images, office files) without external API dependencies.

Key technical features: - Built with modern Python best practices (async/await, type hints, functional-first) - Optimized async I/O with anyio for multi-loop compatibility - Smart worker process pool for CPU-bound tasks (OCR, doc conversion) - Efficient batch processing with concurrent extractions - Clean error handling with context-rich exceptions

I built this after struggling with existing solutions that were either synchronous-only, required complex deployments, or had poor async support. The goal was to create something that works well in modern async Python applications, can be easily dockerized or used in serverless contexts, and relies only on permissive OSS.

Key advantages over alternatives: - True async support with optimized I/O - Minimal dependencies (much smaller than alternatives) - Perfect for serverless and async web apps - Local processing without API calls - Built for modern Python codebases with rigorous typing and testing

I Would love feedback!

The library is MIT licensed and open to contributions.

Here is the repo: https://github.com/Goldziher/kreuzberg

Staring is caring

diarrhea 8 months ago

I’m curious about the async aspect of this. I was under the impression PDF processing like OCR is purely CPU bound. OS file I/O interfaces are sync, so async does not help. With GIL, so single threaded Python, I can’t see how async improves performance for the PDF use case. Only parallelism helps, and concurrency doesn’t. When would it yield back to the event loop when it’s busy number crunching?

nhirschfeld 8 months ago

Thanks for asking!
It's both. The OCR part is ofc CPU bound, but the entire text extraction involves reading files, or writing and then reading files.
Without async, these simply block.
As for efficiency - if you're working in an async application context you have to "asyncify" these operations or suffer the consequences.
- skavi 8 months ago
  
  in that case, what’s the deal with extract_bytes being async? i’m not incredibly familiar with python, but i’d expect a “byte string” to be in memory.
  
  nhirschfeld 8 months ago
  
  You still need to write it to file to process it via pandoc/tesseract etc.
  There are alternative options to tesseract ofc.
  
  LoganDark 8 months ago
  
  > You still need to write it to file to process it via pandoc/tesseract etc.
  This sounds... I guess Pythonic? Sheesh.
nurettin 8 months ago

It just litters perfectly reasonable python code with async/await. Maybe they are preparing for something we don't know, like a parallel async executor which can be set up to use native threads without changing code and somehow protects you if it detects shared state.
- hermitdev 8 months ago
  
  Caveat: I have not looked at the neither the API nor the implementation of Kreuzberg, this is purely from personal work.
  Even with CPU bound code in Python, there are valid reasons to be using async code. Recognizing that the code is CPU bound, it is possible to use thread and/or process pools to achieve a certain level of parallelism in Python. Threading won't buy you much in Python, until 3.13t, due to the GIL. Even with 3.12+ (with the GIL enabled), it's possible (but not trivial) to use threading with sub interpreters (that have their own, separate GIL). See PEP 734 [0].
  I'm currently investigating the use of sub interpreters on a project at work where I'm now CPU bound. I already use multiprocessing & async elsewhere, but I am curious if PEP 734 is easier/faster/slower or even feasible for me. I haven't gotten as far as to actually run any code to compare (I need to refactor my code a bit with the idea of splitting the work up a bit differently to account for being CPU instead of just IO bound).
  [0] https://peps.python.org/pep-0734/
  
  impoppy 8 months ago
  
  Will it lock the GIL if you use thread executor with asyncio for a native c / ffi extension? If that’s the case, that would also add to benefits of asyncio.
- diarrhea 8 months ago
  
  > It just litters perfectly reasonable python code with async/await
  Yeah. As an API consumer I would not expect a PDF API do IO, hence be async. Have the library be sans-io, the interfaces sync and callers from async code handle IO on their end, offloading to IO threads.
  Async is also referred to as “best practice”, but it’s just a tool, for specific use cases. And I say that as an “async fan”!
  That said, perhaps it’s easier nowadays to just do async by default, as you say. The real world is async anyway, so why not program closer to that reality.
  
  nhirschfeld 8 months ago
  
  thats why Kreuzberg also exposes a sync API for you to consume.
  
  PDFBolt 8 months ago
  
  Async is great when you truly need it, but it can overcomplicate things when misused. Having both sync and async options, seems like the best approach. Lets devs choose based on their needs rather than forcing one paradigm.
ismailmaj 8 months ago

It is probably not worth the complexity currently but considering they are using small local CPU models for OCR like tesseract, if they add the support of reading files on the web then I wouldn't be so sure of the CPU bound aspect.

pseudony 8 months ago

Interesting, thanks for sharing :)

Can you speak to how this differs in PDF extraction from, say, pymupdf, pdfplumber, unsloth and so on ?

I know the async part is probably a thing, but when building a RAG I would be brutally focused on the quality of text extraction. Have you noticed an ability to do better than others ?

nhirschfeld 8 months ago

So, for PDF we need to distinguish between two types of text extraction-
1. Text extraction from a searchable PDF.
2. OCR.
For 1. Kreuzberg uses pypdfium2, which is a python binding for pdfium - the chromium PDF engine. In this regard Kreuzberg has top notch performance. Much faster than miner.six, PDFplumber etc.
Note PyMuPDF has top notch performance but also an AGPL license, and is almost unusable because of this without paying.
For 2. Kreuzberg uses Tesseract, which is very solid. Performance is good, and Kreuzberg utilizes async worker processes to optimize concurrency.
OCR though is a complex world. If what you need is to extract text from standard text documents (broadly speaking), Tesseract and hence Kreuzberg are a good choice.
If what you need is things like layout extraction, hand writing recognition, complete bonding box metadata etc. than you need to use an alternative - commercial one probably.
- dleeftink 8 months ago
  
  An oldy but goody for layout extraction is Cermine by Dominika Tkaczyk and colleagues[0]. Java required.
  [0]: http://cermine.ceon.pl/about.html
  
  mdaniel 8 months ago
  
  Also AGPLv3 https://github.com/CeON/CERMINE/blob/cermine-parent-1.13/LIC...
  
  nhirschfeld 8 months ago
  
  didnt know this!
- ilaksh 8 months ago
  
  PaddleOCR layout works, and so do some open source large language vision models
tomcam 8 months ago

What is a RAG?
- nhirschfeld 8 months ago
  
  Retrieval Augmented Generation. Its a class of techniques for generating content using LLMs. I'd recommend Googling this.
  
  tomcam 8 months ago
  
  Was going to reply indignantly that it's hard to google rag and get that answer when I read your comment. Then I did, and it was the first result.
  Apologies!
  
  maxnoe 8 months ago
  
  I understood the comment as "Google <the long version I provided> to get more info"

rednafi 8 months ago

Gotta write something named Wedding, Schoneberg, or Pankow. Kewt names.

a012 8 months ago

Don’t forget Neukölln
- martin_balsam 8 months ago
  
  Garbage collect module (cfr. Neuköllner for the past 12 years)
  
  socksy 8 months ago
  
  Not sure I would trust a garbage collector called Neukölln
  
  rednafi 8 months ago
  
  But multicultural. So I don't mind.
- nhirschfeld 8 months ago
  
  I'm actually considering another library with optional API called `Kreuzköln` - probably without the Umlaut!
  
  guender 8 months ago
  
  [dead]
- madduci 8 months ago
  
  What about Mitte, Steglitz or Charlottenburg?
- mohsen1 8 months ago
  
  can you import it in python with ö in the name?
flessner 8 months ago

Moabit - maybe a name for a new crypto currency?
- a3w 8 months ago
  
  Eight moabit to a moabyte? More like some moabit to a charlottenburg, if I remember the geography correctly.
ant6n 8 months ago

Python Zoo, Python Tiergarten...
- rednafi 8 months ago
  
  Python dependencies are tear garden for sure.
jacomoRodriguez 8 months ago

Mitte?
- herval 8 months ago
  
  Too gentrified for Python
- jenadine 8 months ago
  
  Neuhohenschönhausen?
  
  rednafi 8 months ago
  
  Imagine having to import this or some nightmare like Hausvogteiplatz or Schlesisches Tor. Not German, and I wanna cry everytime I have to pronounce these :v
  
  BjoernKW 8 months ago
  
  > Schlesisches Tor
  Quite a few years ago I saw this translated as Sileasian Gate on Google Maps (IIRC), which - for some reason - reason just brought up "Tannhäuser Gate" in my mind right now.

eamag 8 months ago

Love the name!

OCR was discussed here lately several times (https://news.ycombinator.com/item?id=42952605 and https://news.ycombinator.com/item?id=42871143), and some cool projects like https://github.com/Future-House/paper-qa?tab=readme-ov-file#... are using PyMuPDF. My experience with Tesseract is pretty sad, it's usually not good enough and modern LLMs are better.

nhirschfeld 8 months ago

Thanks, I'll check these links.
In my tests I found tesseract quite good for regular text documents. For other kinds of texts it's not great.
As for using models - there are some good small language models as well, and of course LLMs.
I sorta feel though that if one needs complex OCR, or a vision model for layout, one should opt for either a commercial solution that abstracts the deployment and GPU management, or bake ones own system.
For most use cases involving text documents though, my subjective opinion is that tesseract is sufficient.
FlyingSnake 8 months ago

Can’t wait for non-Germans to butcher that name.

RNCTX 8 months ago

Awesome.

I modified a library card software (Blacklight) into a searchable PDF industrial manual system awhile back on a one-off basis. It couldn't go any further than a contract project that delivered the source code because it's hard to do anything programmatically (at the time) to a PDF without Ghostscript.

I've often thought of rewriting it with Python (and Postgres, to get rid of Solr or Elastic as the search backend), maybe now's the time...

I trust you long enough for a second look because I ctrl-f'd the readme and found "pdfium" so I know I don't have to retread old ground in your github issues about how there's really only a couple of ways to parse a PDF with a semblance of reliability, lol...

(for anyone else reading this getting started with documents.. Adobe and Chrome are really the only PDF rendering libraries that work. PDF.js aka Firefox has always been broken, and Apple's is problematic as well, in both cases rearing their heads in terms of incorrect word / letter spacing).

maleldil 8 months ago

The API is pretty nice and easy to get started, but I couldn't get good results with parsing scientific paper PDFs, unfortunately (including OCR). Are there plans to use other backends? Docling works alright, and LLMs like Gemini Flash are interesting too.

nhirschfeld 8 months ago

Yes, there have already been several suggestions here for other backend etc.
You should try using a different PSM to see if you get better results.
If it's scientific texts specifically, look at grobid

leif_lundberg 8 months ago

Very cool, we've been using https://github.com/DS4SD/docling in our project, but will give this a try :)

kachau 8 months ago

can you please share some details how are you using docling? This looks very promising but I am not sure how to use this one basically we have built document parser for all type of documents to extract texts and then feed these texts to llms to further find out semantics of these texts? do you think docling will help here with efficiency and latency?
- rapjul 8 months ago
  
  Docling works quite well for me to convert a scanned book PDF to Markdown text.
  On the command line, first install `uv` from https://github.com/astral-sh/uv?tab=readme-ov-file#installat..., then run `uv tool install -U "docling[tesserocr,ocrmac,vlm]"` (first includes the tesserocr, ocrmac (macOS only), and vlm (for running a small Image-to-Text model to get descriptions of images).
  You go here https://github.com/DS4SD/docling/blob/main/pyproject.toml#L1... to see all the extra installation options.
  For cached/offline use, run `docling-tools models download` to download their models.

taosx 8 months ago

I know this is contrary to popular opinion but I wish people would slowly move away from python. I've wasted so much time in understanding, integrating or just making python projects work that at this point I'm just avoiding anything python. The best python projects that I can confidently say are high quality are the ones where a lot of the code is c,c++ or rust and python is just a high level wrapper.

d0mine 8 months ago

"python is a high level wrapper"
is a python usage as intended. Being executable pseudo-code, glue language is its selling point. When has it ever been any different.
I'm not sure C++/Rust projects are easier to understand though.

madisonmay 8 months ago

pypdfium2 is a great choice and a solid piece of software!

You might want to look into https://github.com/VikParuchuri/surya as an alternative to tesseract. Yes, it's associated with a commercial company, but as you long as you aren't a company with 5M in ARR or $5M in funding it's free to use.

pzo 8 months ago

this still seems GPL. another OCR worth considering is easyOCR [0] (apache license). AFAIK there is not layout detection but they do provide bounding boxes and support many languages also detecting text on many different world objects from images (signpost, etc)
[0] https://github.com/JaidedAI/EasyOCR
- nhirschfeld 8 months ago
  
  Yup, easy OCR is good.
  My reasons for using Tesseract - easy OCR is larger, and it has a significant cold start.
  It benchmarks better for many OCR tasks though, so I'm thinking of adding it as an alternative backend.
  
  cdrini 8 months ago
  
  Where did you find benchmarks for OCR tools? There have been so many OCR engines coming lately, I would love to see benchmarks!
  
  nhirschfeld 8 months ago
  
  I google this for a while...
  
  alex_suzuki 8 months ago
  
  Any experience with Paddle OCR? https://github.com/PaddlePaddle/PaddleOCR
  Personally I‘ve used Tesseract before but the results were underwhelming, so I‘m curious how Paddle OCR performs in comparison.
  
  nhirschfeld 8 months ago
  
  I haven't, testing it out is on my todo list for sure
nhirschfeld 8 months ago

interesting!

richrichardsson 8 months ago

What led to the name choice?

nhirschfeld 8 months ago

That's my neighborhood in Berlin, which I love
- jacomoRodriguez 8 months ago
  
  amazing that half of the comments revolve around the name and the Neighbourhood. But I also clicked the topic because of the name, hello neighbour :)
  jokes aside, really cool library. I'm currently working in a bigger project where we build a data lake with a wide variety of input sources and formats - this could be quite interesting for us.
  
  nhirschfeld 8 months ago
  
  Amazing, would be interested in reading your experience
- richrichardsson 8 months ago
  
  Ah, cool. I have a friend who lives there, so knew the name from that.
- lippihom 8 months ago
  
  Berlin SO 36!

umitkaanusta 8 months ago

Say I have a regular job that parses thousands of PDFs in bulk each day, how would kreuzberg help me?

btw, liked the name as a turk with a few relatives who lived in germany :D

v3ss0n 8 months ago

We are building something similar and waiting my partners/clients approval for opensourcing it. Looks like we should join forces.

ulrischa 8 months ago

A really impressive feature list but a pretty heavy system level dependencies. On windows chocolatey is needed for them.

ideashower 8 months ago

Is there something like this for handwritten documents? I know newer models have been really good at handwriting transcription.

nhirschfeld 8 months ago

You'll need to use a different OCR engine. Look at easy ocr

odiroot 8 months ago

Do you have to watch your pockets when using this library?

nhirschfeld 8 months ago

lol ;).
But seriously, in 13 years living here, only one guy tried to pick pocket me.
- tymm 8 months ago
  
  I live in 36 since 15 years or so. Wasn't as lucky as you :)
  
  nhirschfeld 8 months ago
  
  Sorry to hear...

m00dy 8 months ago

good naming, it feels so warm that I feel like home :)

coderstartup 8 months ago

That's Great.

thecuntdaniel 8 months ago

[dead]