RE: fetching content from archives and images

Maciej Liżewski Mon, 07 Jan 2013 12:53:31 -0800

Hi,

I downloaded tika sources and noticed that tests (ZipParserTest) check if
AutoDetectParser run with ZIP file return all file names and text content
extracted from those files... and this test passes without errors. however
when trying with tika-app (and in Solr) I do not get this content. Tried to
debug and for Zip files PackageParser is used. The parser iterates through
all archived entries, but then in tika-app the output is empty even for same
zip file as in tests... There is also one difference between tika-app and
Solr: Solr return at least file names while tika-app shows nothing at all.


I simply do not get it... If test confirm that extracting archived files
content works ok, then why I do not get any content in application/Solr?


-----Original Message-----
From: Nick Burch [mailto:[email protected]] 
Sent: Monday, January 07, 2013 5:29 PM
To: [email protected]
Subject: RE: fetching content from archives and images

On Mon, 7 Jan 2013, Maciej Liżewski wrote:
> Could you provide example how to use it to recursively index files in 
> archive? Lets say I have archive.zip with 3 files: file.txt, file.doc, 
> file.pdf. I would like to have output with text content of all those
files.

Just set the AutoDetectParser as Parser.class in the ParseContext, and you
should be sorted.

It's probably worth you getting the tika source code, including unit tests,
and look for places where a Parser.class is set on the ParseContext, that'll
give you several examples to compare

Nick

RE: fetching content from archives and images

Reply via email to