Mark Dilger <mark.dil...@enterprisedb.com> writes: > I am a bit surprised to see that you are right about this, because non-latin > languages often have transliteration/romanization schemes for writing the > language in the Latin alphabet, developed before computers had wide spread > adoption of non-ASCII character sets, and still in use today for text > messaging. I expected to find stemming rules for transliterated words, but > can't find any indication of that, neither in the postgres sources, nor in > the snowball sources I pulled from their repo. Is there some architectural > separation of stemming from transliteration such that we'd never need to > worry about it? If snowball ever published stemmers for transliterated text, > we might have to revisit this issue, but for now your proposed change sounds > fine to me.
Agreed, if the Snowball stemmers worked on romanized texts then the situation would be different. But they don't, AFAICS. Don't know if that is architectural, or a policy decision, or just lack of round tuits. The thing that I actually find a bit shaky in this area is our architectural decision to route words to different dictionaries depending on whether they are all-ASCII or not. AIUI that was done purely on the basis of the Russian/English case; it would fail badly if say you wanted to separate Russian from French. However, I have no great desire to revisit that design right now. regards, tom lane