Hi,

On Mon, Jun 20, 2011 at 11:29 PM, Koorosh Vakhshoori
<koorosh.vakhsho...@synopsys.com> wrote:
> I am parsing a PDF document for the purpose of indexing it in Solr, specially 
> Solr Cell. My problem is to
> exclude certain areas of document from being indexed; for example Copyright 
> section. The same
> argument applies to HTML pages where I don't want to index footer or header 
> or other irrelevant
> sections.

Sounds like a useful feature, though currently the only thing we have
along those lines is the Boilerpipe support for HTML.

As a general rule it'll probably be easiest to build such exclusion
rules into the formats-specific parsers (for example it's easy to
exclude headers and footers within the office format parsers).

BR,

Jukka Zitting

Reply via email to