Hi, I downloaded tika sources and noticed that tests (ZipParserTest) check if AutoDetectParser run with ZIP file return all file names and text content extracted from those files... and this test passes without errors. however when trying with tika-app (and in Solr) I do not get this content. Tried to debug and for Zip files PackageParser is used. The parser iterates through all archived entries, but then in tika-app the output is empty even for same zip file as in tests... There is also one difference between tika-app and Solr: Solr return at least file names while tika-app shows nothing at all.
I simply do not get it... If test confirm that extracting archived files content works ok, then why I do not get any content in application/Solr? -----Original Message----- From: Nick Burch [mailto:[email protected]] Sent: Monday, January 07, 2013 5:29 PM To: [email protected] Subject: RE: fetching content from archives and images On Mon, 7 Jan 2013, Maciej Liżewski wrote: > Could you provide example how to use it to recursively index files in > archive? Lets say I have archive.zip with 3 files: file.txt, file.doc, > file.pdf. I would like to have output with text content of all those files. Just set the AutoDetectParser as Parser.class in the ParseContext, and you should be sorted. It's probably worth you getting the tika source code, including unit tests, and look for places where a Parser.class is set on the ParseContext, that'll give you several examples to compare Nick
