tballison opened a new issue, #361:
URL: https://github.com/apache/stormcrawler/issues/361

   Tika's legacy behavior was to concatenate the content of embedded documents 
into one handler and ignore metadata from embedded documents.  This was 
probably driven by the desire to allow Tika to handle reads and writes in a 
streaming fashion.
   
   If you're willing to forego streaming and are willing to store the extracted 
content in memory, you might consider Jukka Zitting's and Nick Burch's 
"RecursiveParserWrapper" which returns a list of Metadata objects for each 
input file.  The first Metadata object in the list represents the container 
document and then the rest represent each embedded document.  The "text" for 
each document/embedded document is stored in each metadata object by the 
RecursiveParserWrapper.TIKA_CONTENT key.
   
   You can see the output in Json format via tika-app's -J command or the 
/rmeta endpoint in tika-server.
   
   See recursiveParserWrapperExample() [in this 
example](https://git-wip-us.apache.org/repos/asf?p=tika.git;a=blob;f=tika-example/src/main/java/org/apache/tika/example/ParsingExample.java;h=5b8a9f36cd2e1bce9bc1e4443fb6e5fd23bb9302;hb=1b72a3863b8eeb5f4f5d290e5f02c7d072b1cd9b).
  You can specify whether you want the content as text, HTML or XHTML via the 
BasicContentHandlerFactory.HANDLER_TYPE.
   
   This is critical for maintaining metadata from embedded objects.  Imagine, 
as one use case, you have a zip of jpegs with lat/longs, this will allow you to 
index each individually.
   
   See [SOLR-7229](https://issues.apache.org/jira/browse/SOLR-7229) for work to 
integrate this into Solr's DIH...I haven't gotten around to submitting a PR for 
that. :(
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@stormcrawler.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to