Tom Lane wrote:
Frans <fr...@geodan.nl> writes:
Tom Lane wrote:
The
fuzzystrmatch module doesn't really work with utf8 (nor any other
multibyte encoding), because it depends on the <ctype.h> functions.
What you'll probably get when applying it to non-ascii utf8 is
an invalidly encoded string.
Well, in 8.2.6 the result for non-ASCII UTF-8 was an empty string (ASCII
code 0).
A comparison of the 8.2 and 8.3 fuzzystrmatch sources shows no
difference. The behavior of the ascii() function has indeed changed,
but soundex() is no more nor less broken than it was before.
[ thinks for a bit... ] If you are seeing a difference in what soundex
itself does, the most likely explanation is a difference in the behavior
of isalpha() or perhaps toupper(). Are you running on the same
underlying C library as before? Are you quite sure you have the same
encoding and locale selected?
Thank you for pointing me in the right direction. I have done some more
research now.. I have installed 8.2.13 and 8.3.7 on the same
workstation, selecting locale=C and character encoding=UTF-8 in both
cases. In both cases soundex() behaved as desired, i.e. it produces a
null string if it can not handle the input. It looks like the difference
in behaviour I noticed was not caused by the different PostgreSQL
versions after all, but by a different locale setting. I see (in
postgresql.ini) that for the database in which soundex() produces the
'wrong' output the locale apparently was set to 'Dutch_Netherlands'. I
can not recall consciously selecting this locale but it might have been
set by the installer. Does it make sense that the locale setting
influences the workings of the soundex function?
In the database where I noticed the undesired soundex() behaviour I did
a further test, using the bit_length() function. The query "(select
bit_length(soundex('?'))" returns a value of 0 where ascii() also
returns 0 but it returns a value of 32 in the other case (where ascii()
returns 944). So it seems soundex() really has a different output in
both cases.
I don't know now if this issue should still be regarded as a bug.. At
least it seems to me that the locale setting is also affecting the
soundex function should be documented.
regards, tom lane