Re: EmbeddedDocumentExtractor or OCR module for extracting images?

Tim Allison Thu, 10 Apr 2025 05:42:34 -0700

I'm sorry for not replying sooner.

Try the /rmeta endpoint. That will get you recursive text and metadata in a
json format. You can select whether you want the extracted text as text or
xhtml (/rmeta/text /rmeta/xml...IIRC). That won't get you everything, but
it will get you into the recursive handling framework, and it maintains
embedded file metadata and embedded file exceptions, both of which do not
come through in the /tika endpoint.


If the usual /tika with xhtml works for you, great! That's simple. Go with
that. Please open a PR for illegal content. The xhtml/xml content should be
parseable. :D

I've developed and am deploying a "extract everything" option that uses the
tika-pipes framework. Configuring it is fiddly, and it is cutting edge (er,
there will be dragons), but what that buys you is the json output from the
/rmeta endpoint, and all of the bytes from the embedded files written to
anywhere a tika-emitter can write: local fileshare or s3 or... In the
deployment I'm working with I've packed that into a custom Kafka consumer,
but I _think_ you should be able to configure that to use with tika-server.

As for an external OCRParser, that sounds super useful. As you probably
did, rely on our TesseractParser as a model for implementing that.

You can see why I didn't respond sooner. LOL.

Best,

          Tim

On Wed, Apr 9, 2025 at 6:21 AM Cristian Zamfir
<cri...@cyberhaven.com.invalid> wrote:

> Hello,
>
> I have been looking at the code and got some answers:
>
> On Mon, Apr 7, 2025 at 4:58 PM Cristian Zamfir <cri...@cyberhaven.com>
> wrote:
>
> > Hey!
> >
> > I’m trying to figure out the best way to use the /tika endpoint in
> > tika-server. My goal is to extract all the text recursively and for
> images,
> > either save them to disk or return them as e.g., base64
> >
> > Right now, I’m using the plain text output, but it seems like I’ll need a
> > more structured format. Is XHTML the only option, or is there a built-in
> > way to get JSON instead? If not, I can probably build something to output
> > JSON myself, please let me know if you have any pointers to where to
> start.
> >
>
> Looks like the easiest option is to use the XHTML output which is already
> provided - quick question on that though, looks like the output is xml 1.1
> -- are there any specific features of 1.1 required or is 1.0 ok as well?
> Also, the output does not pass lint with 1.1 because it can produce null
> values
> Character reference "&#0" is an invalid XML character. Here is an example,
> should be easy to reproduce:
>
> <?xml version="1.1" encoding="UTF-8"?><html xmlns="
> http://www.w3.org/1999/xhtml";>
>
>     <head>
>
>         <meta name="X-TIKA:Parsed-By"
> content="org.apache.tika.parser.CompositeParser"/>
>
>         <meta name="X-TIKA:Parsed-By"
> content="org.apache.tika.parser.DefaultParser"/>
>
>         <meta name="X-TIKA:Parsed-By"
> content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
>
>         <meta name="extended-properties:DocSecurityString" content="None"/>
>
>         <meta name="Content-Length" content="256283"/>
>
>         <meta name="Content-Type"
>
> content="application/vnd.openxmlformats-officedocument.wordprocessingml.document"/>
>
>         <title>&#0;</title>
>
>     </head>
>
>     <body>
>         <p/>
>
>
>
>
> >
> > As for images, I could handle this from a new OCR module, but is there a
> > simpler way to handle this, since I would only be extracting the image
> > binaries? Maybe by tweaking EmbeddedDocumentExtractor?
> >
>
> I favor the OCR module route. By the way, I am planning to contribute my
> HTTP OCR module that uses an external service for OCR, I just did not get a
> chance to write the tests and create a PR.
>
> Thanks,
> Cristi
>
>
> >
> > Thanks,
> > Cristi
> >
>

Re: EmbeddedDocumentExtractor or OCR module for extracting images?

Reply via email to