Perfect

Ta

Lewis

On Mon, Dec 10, 2012 at 2:21 AM, Nick Burch <[email protected]> wrote:
> On Sat, 8 Dec 2012, Lewis John Mcgibbney wrote:
>>
>> We use Tika 1.2 over in Nutch, I wonder what kind of support Tika has for
>> parsing .zip files and whether someone can comment on whether I can work
>> towards dropping the legacy parser for Nutch?
>
>
> Tika has pretty good support for archive formats, including .zip and .tar,
> via the commons compress integration
>
> What you'll want to do is attach a recursing parser onto the parsecontext,
> and Tika will then call that for each entry in the zip file. It's up to you
> what you then do with it. Take a look at the tika cli for an example of a
> recursing parser that extracts out all the embedded entries to files on the
> fs.
>
> Depending on your needs with nutch, you'll either likely want to process
> each entry in the zip file as a standalone resource, or roll them all into
> the output of the parent file.
>
> Nick



-- 
Lewis

Reply via email to