You actually wouldn't have to maintain two versions. You could,
instead, inject the accentless (stemmed) terms in your single
index as synonyms (See Lucene In Action). This is easier
to search and maintain....

But it also bloats your index by some factor since you're storing two
words for every accented word in your corpus. And gives you
headaches if there is more than one accent in the word (do you
then store all 4 possibilities for two accents? 8 for 3? etc?).

I think your notion of running the search terms through a dictionary
is a very good one. That way, your searcher doesn't have to care
about all this nonsense, and assume correctly-accented characters.

Erick

On 4/28/07, Andrew Green <[EMAIL PROTECTED]> wrote:

El vie, 27-04-2007 a las 16:59 -0700, Chris Hostetter escribió:
> : In order to do this, we tried subclassing the SnowballAnalyzer... it
> : doesn't work yet, though. Here is the code of our custom class:
>
> At first glance, what youv'e got seems fine, can you elaborate on what
you
> mean by "it doesn't work" ?
>
> Perhaps the issue is that the SnowballStemmer can't handle the accented
> characters, and you should strip them first, then stem?
>
>   public TokenStream tokenStream(String fieldName, Reader reader) {
>     TokenStream result = new StandardTokenizer(reader);
>     result = new StandardFilter(result);
>     result = new LowerCaseFilter(result);
>     if (stopSet != null)
>       result = new StopFilter(result, stopSet);
>     result = new ISOLatin1AccentFilter(result);
>     result = new SnowballFilter(result, name);
>     return result;
>   }
>
Thanks for your answer, Chris.

It doesn't work for the opposite reason: it requires words to be spelled
correctly, including accents, in order to stem them. So, for example,
"civilización" and its plural, "civilizaciones" are stemmed correctly,
but the accentless version, "civilizacion", doesn't get stemmed at all.
So if someone misspells the word, omitting the accent, in the search
query--a likely scenario--the only hits they get are identical
misspellings in the documents, if such things exist. But we need
stemming of both accented and unaccented versions of the word. Stemming
misspellings may sound inherently evil, I suppose, but it seems to be
our best bet.

We're currently trying to modify the SpanishStemmer to do this, but
haven't gotten it quite yet.

Another option that I'm imagining might work, though less well, would be
to simultaneously maintain two indexes, one of correctly stemmed words
generated without the accents filter, and another of unstemmed words
with the accents stripped, and query both indexes when searching.

Yet another possibility would be, I think, to silently use a dictionary
to correct spellings in queries before searching.

A few Google queries show that they do things sort of the way we're
trying to, though perhaps not quite...

Thanks again,
Andrew


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reply via email to