El vie, 27-04-2007 a las 16:59 -0700, Chris Hostetter escribió: > : In order to do this, we tried subclassing the SnowballAnalyzer... it > : doesn't work yet, though. Here is the code of our custom class: > > At first glance, what youv'e got seems fine, can you elaborate on what you > mean by "it doesn't work" ? > > Perhaps the issue is that the SnowballStemmer can't handle the accented > characters, and you should strip them first, then stem? > > public TokenStream tokenStream(String fieldName, Reader reader) { > TokenStream result = new StandardTokenizer(reader); > result = new StandardFilter(result); > result = new LowerCaseFilter(result); > if (stopSet != null) > result = new StopFilter(result, stopSet); > result = new ISOLatin1AccentFilter(result); > result = new SnowballFilter(result, name); > return result; > } > Thanks for your answer, Chris.
It doesn't work for the opposite reason: it requires words to be spelled correctly, including accents, in order to stem them. So, for example, "civilización" and its plural, "civilizaciones" are stemmed correctly, but the accentless version, "civilizacion", doesn't get stemmed at all. So if someone misspells the word, omitting the accent, in the search query--a likely scenario--the only hits they get are identical misspellings in the documents, if such things exist. But we need stemming of both accented and unaccented versions of the word. Stemming misspellings may sound inherently evil, I suppose, but it seems to be our best bet. We're currently trying to modify the SpanishStemmer to do this, but haven't gotten it quite yet. Another option that I'm imagining might work, though less well, would be to simultaneously maintain two indexes, one of correctly stemmed words generated without the accents filter, and another of unstemmed words with the accents stripped, and query both indexes when searching. Yet another possibility would be, I think, to silently use a dictionary to correct spellings in queries before searching. A few Google queries show that they do things sort of the way we're trying to, though perhaps not quite... Thanks again, Andrew --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]