Re: Best practice for searching html

Yonik Seeley Thu, 09 Mar 2006 06:33:09 -0800

On 3/9/06, Raul Raja Martinez <[EMAIL PROTECTED]> wrote:
> Hi I have a lot of html indexed such as:
>
> Mart&iacute;nez
>
> Of course my users are gonna search for Martínez and they're not gonna
> get a match.
>
> Is there a common approach to solve this kind of problem in lucene,
> Maybe some utility class or something?


If you might have other random HTML markup as well as entities check out,
Solr's HTMLStrip* tokenizers:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

It's good if your input is dirty - if you don't know if it's HTML or
not, or if there are HTML fragments that would cause a normaly HTML
parser to choke.

If you actually have HTML documents, I would go with an HTML parser.
If you have *just* entities, there is probably a simpler approach.


-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Best practice for searching html

Reply via email to