EmbeddedDocumentExtractor or OCR module for extracting images?

Cristian Zamfir Mon, 07 Apr 2025 08:21:52 -0700

 Hey!

I’m trying to figure out the best way to use the /tika endpoint in
tika-server. My goal is to extract all the text recursively and for images,
either save them to disk or return them as e.g., base64


Right now, I’m using the plain text output, but it seems like I’ll need a
more structured format. Is XHTML the only option, or is there a built-in
way to get JSON instead? If not, I can probably build something to output
JSON myself, please let me know if you have any pointers to where to start.

As for images, I could handle this from a new OCR module, but is there a
simpler way to handle this, since I would only be extracting the image
binaries? Maybe by tweaking EmbeddedDocumentExtractor?

Thanks,
Cristi

EmbeddedDocumentExtractor or OCR module for extracting images?

Reply via email to