Mark Dilger <mark.dil...@enterprisedb.com> writes:
> I am a bit surprised to see that you are right about this, because non-latin 
> languages often have transliteration/romanization schemes for writing the 
> language in the Latin alphabet, developed before computers had wide spread 
> adoption of non-ASCII character sets, and still in use today for text 
> messaging.  I expected to find stemming rules for transliterated words, but 
> can't find any indication of that, neither in the postgres sources, nor in 
> the snowball sources I pulled from their repo.  Is there some architectural 
> separation of stemming from transliteration such that we'd never need to 
> worry about it?  If snowball ever published stemmers for transliterated text, 
> we might have to revisit this issue, but for now your proposed change sounds 
> fine to me.

Agreed, if the Snowball stemmers worked on romanized texts then the
situation would be different.  But they don't, AFAICS.  Don't know
if that is architectural, or a policy decision, or just lack of
round tuits.

The thing that I actually find a bit shaky in this area is our
architectural decision to route words to different dictionaries
depending on whether they are all-ASCII or not.  AIUI that was
done purely on the basis of the Russian/English case; it would
fail badly if say you wanted to separate Russian from French.
However, I have no great desire to revisit that design right now.

                        regards, tom lane


Reply via email to