[I] Allow indexing of embedded documents/attachments as individual docs [stormcrawler]

via GitHub Thu, 19 Jun 2025 06:53:31 -0700


tballison opened a new issue, #361:
URL: https://github.com/apache/stormcrawler/issues/361

Tika's legacy behavior was to concatenate the content of embedded documents
into one handler and ignore metadata from embedded documents. This was
probably driven by the desire to allow Tika to handle reads and writes in a
streaming fashion.

If you're willing to forego streaming and are willing to store the extracted
content in memory, you might consider Jukka Zitting's and Nick Burch's
"RecursiveParserWrapper" which returns a list of Metadata objects for each
input file. The first Metadata object in the list represents the container
document and then the rest represent each embedded document. The "text" for
each document/embedded document is stored in each metadata object by the
RecursiveParserWrapper.TIKA_CONTENT key.

You can see the output in Json format via tika-app's -J command or the
/rmeta endpoint in tika-server.

See recursiveParserWrapperExample() [in this
example](https://git-wip-us.apache.org/repos/asf?p=tika.git;a=blob;f=tika-example/src/main/java/org/apache/tika/example/ParsingExample.java;h=5b8a9f36cd2e1bce9bc1e4443fb6e5fd23bb9302;hb=1b72a3863b8eeb5f4f5d290e5f02c7d072b1cd9b).
You can specify whether you want the content as text, HTML or XHTML via the
BasicContentHandlerFactory.HANDLER_TYPE.

This is critical for maintaining metadata from embedded objects. Imagine,
as one use case, you have a zip of jpegs with lat/longs, this will allow you to
index each individually.

See [SOLR-7229](https://issues.apache.org/jira/browse/SOLR-7229) for work to
integrate this into Solr's DIH...I haven't gotten around to submitting a PR for
that. :(

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@stormcrawler.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Allow indexing of embedded documents/attachments as individual docs [stormcrawler]

Reply via email to