On Jul 26, 2009, at 7:24 AM, starz10de wrote:


Hi,

I am indexing a set of html websites using lucene (IndexHtml). The indexer work fine and I can also find the indexed term but the problem this class (IndexHtml) index all text inside the html site even the advertisements. I am interested just in the body text and not interested in the advertisements
or side links text.

Any help how to solve this problem? Did I use the class wrongly?



No, you didn't do anything wrong. That class does not have any capabilities like you want (in fact, it's a pretty basic bit of demo code). You might look into some more robust HTML parsing libraries out there.

-Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to