Tom Lane wrote:
Frans <fr...@geodan.nl> writes:
Tom Lane wrote:
The
fuzzystrmatch module doesn't really work with utf8 (nor any other
multibyte encoding), because it depends on the <ctype.h> functions.
What you'll probably get when applying it to non-ascii utf8 is
an invalidly encoded string.

Well, in 8.2.6 the result for non-ASCII UTF-8 was an empty string (ASCII code 0).

A comparison of the 8.2 and 8.3 fuzzystrmatch sources shows no
difference.  The behavior of the ascii() function has indeed changed,
but soundex() is no more nor less broken than it was before.

[ thinks for a bit... ]  If you are seeing a difference in what soundex
itself does, the most likely explanation is a difference in the behavior
of isalpha() or perhaps toupper().  Are you running on the same
underlying C library as before?  Are you quite sure you have the same
encoding and locale selected?

Thank you for pointing me in the right direction. I have done some more research now.. I have installed 8.2.13 and 8.3.7 on the same workstation, selecting locale=C and character encoding=UTF-8 in both cases. In both cases soundex() behaved as desired, i.e. it produces a null string if it can not handle the input. It looks like the difference in behaviour I noticed was not caused by the different PostgreSQL versions after all, but by a different locale setting. I see (in postgresql.ini) that for the database in which soundex() produces the 'wrong' output the locale apparently was set to 'Dutch_Netherlands'. I can not recall consciously selecting this locale but it might have been set by the installer. Does it make sense that the locale setting influences the workings of the soundex function?

In the database where I noticed the undesired soundex() behaviour I did a further test, using the bit_length() function. The query "(select bit_length(soundex('?'))" returns a value of 0 where ascii() also returns 0 but it returns a value of 32 in the other case (where ascii() returns 944). So it seems soundex() really has a different output in both cases.

I don't know now if this issue should still be regarded as a bug.. At least it seems to me that the locale setting is also affecting the soundex function should be documented.

                        regards, tom lane

Reply via email to