Lucene has no built-in recognition of anything. You have to parse
the header and index the relevant bits as you need to.

There are projects *based* upon lucene that do web crawls that
you might want to look into, Nutch comes to mind.

Erick

On 4/5/07, Developer Developer <[EMAIL PROTECTED]> wrote:

I am using WGET to download content from the www with ---save-header
option.
The save-header option saves the hppt header to the downloaded files.
Does Lucene make use of content type  while indexing  or  I have to parse
the header , determine the content-type and determine the right set of
actions to do ?

Thanks !

Reply via email to