Re: [GENERAL] UTF-8 and =, LIKE problems

Michael Glaesemann Wed, 03 Nov 2004 20:55:36 -0800


On Nov 4, 2004, at 1:24 PM, Edmund Lian wrote:

I am running a web-based accounting package (SQL-Ledger) that supports multiple languages on PostgreSQL. When a database encoding is set to Unicode, multilingual operation is possible.


<snip />

Semantically, one might expect U+FF17 U+FF19 to be identical to U+0037 U+0039, but of course they aren't if a simple-minded byte-by-byte or character-by-character comparison is done.

In the ideal case, one would probably want to convert all full width chars to their half width equivalents because the numbers look wierd on the screen (e.g., "7 9 B r i s b a n e S t r e e t" instead of "79 Brisbane Street". Is there any way to get PostgreSQL to do so?

Failing this, is there any way to get PostgreSQL to be a bit smarter in doing comparisons? I think I'm SOL, but I thought I'd ask anyway.

I've thought this would be a useful addition to PostgreSQL, but currently I think it's best handled in the application layer. A brief glance at the SQL-Ledger homepage shows that it's written in Perl. I'm still in the early learning stages of Perl (heck, I'm the in the early learning stages of nearly everthing), but I'd assume with Perl's good Unicode support there should be a way to do this, similar to PHP's mb_convert_kana (which handles much more than just kana, btw). Ideally, I'd think you'd want to store all numbers and latin characters as single-width characters, so you'd filter them before they enter the database.

I'd think this might be best placed in the SQL-Ledger code, though you might be able to fashion a plperl function that would do the same thing. You could either update all entries (UPDATE foo SET bar = double_to_single(bar)) or make a functional index on double_to_single(bar).

I'm not sure which would be the best, and others out there have more informed opinions than mine which I'd love to read.

Hope this helps a bit.

Michael


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Re: [GENERAL] UTF-8 and =, LIKE problems

Reply via email to