On Sat, 8 Dec 2012, Lewis John Mcgibbney wrote:
We use Tika 1.2 over in Nutch, I wonder what kind of support Tika has for parsing .zip files and whether someone can comment on whether I can work towards dropping the legacy parser for Nutch?

Tika has pretty good support for archive formats, including .zip and .tar, via the commons compress integration

What you'll want to do is attach a recursing parser onto the parsecontext, and Tika will then call that for each entry in the zip file. It's up to you what you then do with it. Take a look at the tika cli for an example of a recursing parser that extracts out all the embedded entries to files on the fs.

Depending on your needs with nutch, you'll either likely want to process each entry in the zip file as a standalone resource, or roll them all into the output of the parent file.

Nick

Reply via email to