Hi, On Mon, Jun 20, 2011 at 11:29 PM, Koorosh Vakhshoori <koorosh.vakhsho...@synopsys.com> wrote: > I am parsing a PDF document for the purpose of indexing it in Solr, specially > Solr Cell. My problem is to > exclude certain areas of document from being indexed; for example Copyright > section. The same > argument applies to HTML pages where I don't want to index footer or header > or other irrelevant > sections.
Sounds like a useful feature, though currently the only thing we have along those lines is the Boilerpipe support for HTML. As a general rule it'll probably be easiest to build such exclusion rules into the formats-specific parsers (for example it's easy to exclude headers and footers within the office format parsers). BR, Jukka Zitting