Lucene has no built-in recognition of anything. You have to parse the header and index the relevant bits as you need to.
There are projects *based* upon lucene that do web crawls that you might want to look into, Nutch comes to mind. Erick On 4/5/07, Developer Developer <[EMAIL PROTECTED]> wrote:
I am using WGET to download content from the www with ---save-header option. The save-header option saves the hppt header to the downloaded files. Does Lucene make use of content type while indexing or I have to parse the header , determine the content-type and determine the right set of actions to do ? Thanks !