Hello folks,

my system is a SuSE 10.0 Linux and a plain PostgreSQL 8.1.2 (compiled by 
myself, NLS enabled). LOCALE is set to de_DE.UTF-8.

The bug shows up using the operator '~*' with umlauts. An easy way to produce 
a faulty result is

select 'XXXMÜLLERYyyy' ~* '.*müller.*'; 

The result should be "TRUE", however Postgres thinks, it's "FALSE" (see also 
discussion in www.pg-forum.de, subject "Konfiguration", thread "Umlaute bei 
Regular Expressions"). It seems that this problem does not exist in Windows 
based installations.

It seems to me that this bug is originated in the file 
src/backend/regex/regc_locale.c. The functions pg_wc_tolower(pg_wchar) and 
pg_wc_toupper(pg_wchar) rely on the C-functions toupper(unsigned char) and 
tolower(unsigned char) which definitely are the wrong choice for UTF8 
characters beyond the ASCII coding.

To check my estimation, I replaced the bodies of pg_wc_tolower and 
pg_wc_toupper simply by "return towlower(c);" and "return towupper(c);", 
which lead to the correct results of 
select 'XXXMÜLLERYyyy' ~* '.*müller.*';

Since I don't have any idea concerning the side effects of this change, please 
let me know as soon as an "official" patch is available - I definitely do 
need regular expressions handling UTF8 correctly...

Thanks,
Helmar Spangenberg
e-mail: [EMAIL PROTECTED]

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Reply via email to