Perfect Ta
Lewis On Mon, Dec 10, 2012 at 2:21 AM, Nick Burch <[email protected]> wrote: > On Sat, 8 Dec 2012, Lewis John Mcgibbney wrote: >> >> We use Tika 1.2 over in Nutch, I wonder what kind of support Tika has for >> parsing .zip files and whether someone can comment on whether I can work >> towards dropping the legacy parser for Nutch? > > > Tika has pretty good support for archive formats, including .zip and .tar, > via the commons compress integration > > What you'll want to do is attach a recursing parser onto the parsecontext, > and Tika will then call that for each entry in the zip file. It's up to you > what you then do with it. Take a look at the tika cli for an example of a > recursing parser that extracts out all the embedded entries to files on the > fs. > > Depending on your needs with nutch, you'll either likely want to process > each entry in the zip file as a standalone resource, or roll them all into > the output of the parent file. > > Nick -- Lewis
