RE: fetching content from archives and images

Maciej Liżewski Tue, 08 Jan 2013 02:13:28 -0800

Hi,


Thank you all for suggestions. They lead me to the solution. There was a
problem with how Solr initializes parser context (described here:
https://issues.apache.org/jira/browse/SOLR-2416) which caused that
EmptyParser was used for archived files. Applying patch and recompiling
solr-cell jar solved my problem.

 

Btw - Tika App does not show even file names. <body> tag in structured
output is empty.

 

Maciek

 

 

From: Dave Meikle [mailto:[email protected]] 
Sent: Tuesday, January 08, 2013 12:54 AM
To: [email protected]
Subject: Re: fetching content from archives and images

 

Hi Maciej,

 

On 7 Jan 2013, at 20:53, Maciej Liżewski <[email protected]> wrote:





Hi,

I downloaded tika sources and noticed that tests (ZipParserTest) check if
AutoDetectParser run with ZIP file return all file names and text content
extracted from those files... and this test passes without errors. however
when trying with tika-app (and in Solr) I do not get this content. Tried to
debug and for Zip files PackageParser is used. The parser iterates through
all archived entries, but then in tika-app the output is empty even for same
zip file as in tests... There is also one difference between tika-app and
Solr: Solr return at least file names while tika-app shows nothing at all.

I simply do not get it... If test confirm that extracting archived files
content works ok, then why I do not get any content in application/Solr?

 

By default the Tika GUI app does not extract the files as the
DocumentSelector set on the context used to define wether an embedded entry
should be parsed only allows Images to be processed.  Note the Tika CLI app
doesn't have this problem as it uses its own EmbeddedDocumentExtractor.

 

Furthermore, I suspect you will find that the Tika App is showing the file
names just in the structured text view as they are output as div tags.

 

Would need to check Solr code to see what it was doing.

 

Cheers,

Dave

RE: fetching content from archives and images

Reply via email to