Re: EmbeddedDocumentExtractor or OCR module for extracting images?

Cristian Zamfir Wed, 09 Apr 2025 03:22:31 -0700

Hello,

I have been looking at the code and got some answers:


On Mon, Apr 7, 2025 at 4:58 PM Cristian Zamfir <cri...@cyberhaven.com>
wrote:

> Hey!
>
> I’m trying to figure out the best way to use the /tika endpoint in
> tika-server. My goal is to extract all the text recursively and for images,
> either save them to disk or return them as e.g., base64
>
> Right now, I’m using the plain text output, but it seems like I’ll need a
> more structured format. Is XHTML the only option, or is there a built-in
> way to get JSON instead? If not, I can probably build something to output
> JSON myself, please let me know if you have any pointers to where to start.
>

Looks like the easiest option is to use the XHTML output which is already
provided - quick question on that though, looks like the output is xml 1.1
-- are there any specific features of 1.1 required or is 1.0 ok as well?
Also, the output does not pass lint with 1.1 because it can produce null
values
Character reference "&#0" is an invalid XML character. Here is an example,
should be easy to reproduce:

<?xml version="1.1" encoding="UTF-8"?><html xmlns="
http://www.w3.org/1999/xhtml";>

    <head>

        <meta name="X-TIKA:Parsed-By"
content="org.apache.tika.parser.CompositeParser"/>

        <meta name="X-TIKA:Parsed-By"
content="org.apache.tika.parser.DefaultParser"/>

        <meta name="X-TIKA:Parsed-By"
content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>

        <meta name="extended-properties:DocSecurityString" content="None"/>

        <meta name="Content-Length" content="256283"/>

        <meta name="Content-Type"
content="application/vnd.openxmlformats-officedocument.wordprocessingml.document"/>

        <title>&#0;</title>

    </head>

    <body>
        <p/>




>
> As for images, I could handle this from a new OCR module, but is there a
> simpler way to handle this, since I would only be extracting the image
> binaries? Maybe by tweaking EmbeddedDocumentExtractor?
>

I favor the OCR module route. By the way, I am planning to contribute my
HTTP OCR module that uses an external service for OCR, I just did not get a
chance to write the tests and create a PR.

Thanks,
Cristi


>
> Thanks,
> Cristi
>

Re: EmbeddedDocumentExtractor or OCR module for extracting images?

Reply via email to