Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-25 Thread Lincoln Yeoh
At 09:20 PM 8/24/2004 +0200, Peter Eisentraut wrote: David Wheeler wrote: > That's not the trouble so much as that the locales can be badly If we always followed the principle "X could be broken, so let's not use X", then we would never get anything done. Instead, "X is broken, so fix it". > broke

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-24 Thread Tom Lane
David Wheeler <[EMAIL PROTECTED]> writes: >>> Hmm. I tried putting your string into a UNICODE database and I got >>> ERROR: invalid byte sequence for encoding "UNICODE": 0xc7 >> >> Really? Curious. > Oh, are you sure that you got my UTF-8 data? Because it came back in > your reply all mangled.

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-24 Thread David Wheeler
On Aug 24, 2004, at 12:20 PM, Peter Eisentraut wrote: broken, and that they're useless for multilingual use. I don't agree with that, but perhaps we differ in our interpretation of "multilingual use". If you have special requirements, you can always turn the locales off. Well, we're getting beyond

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-24 Thread David Wheeler
On Aug 23, 2004, at 10:25 PM, Joel wrote: If the locale machinery iw functioning correctly (and if I understand correctly), there ought to be a setting that would allow those to collate to the same point. Bleh. There must be some distinction between them. It sounds like querying for synonyms. I'm

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-24 Thread Peter Eisentraut
David Wheeler wrote: > But given what you've said, Tatsuo, it makes me wonder if it's worth > it to use the system locale default when running initdb? Yes, because that is the locale that the user prefers. If a locale is broken then you shouldn't set it as system locale in the first place. --

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread Joel
On Tue, 24 Aug 2004 01:34:46 +0200 (BIan Barwick <[EMAIL PROTECTED]> wrote (B (B> ... (B> wild speculation in need of a Korean speaker, but: (B> (B> [EMAIL PROTECTED]:~/tmp> cat j.txt (B> $Bec,e$;ec(Bˆ (B> $ByyPl%$%9wd!"(B (B> $Bx"(l%$(B€l$B%i(B (B> $Bw{%1v.%/wd(Bœ (B> 

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread David Wheeler
On Aug 23, 2004, at 6:49 PM, Tim Allen wrote: One possible clue: your original post in this thread was using encoding euc-kr, not unicode (utf-8). If your mailer was set to use that encoding, perhaps your other client software is/was also? Bah! Stupid Mail.app was trying to be too smart! Thanks,

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread Tim Allen
Tom Lane wrote: David Wheeler <[EMAIL PROTECTED]> writes: bric=3D# reindex index udx_keyword__name; REINDEX bric=3D# select * from keyword where name =3D'=BA=CF=C7=D1=C0=C7'; id | name | screen_name | sort_name | active --++-+---+ 1218 | =B1=B9=B9=E6=BA

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread David Wheeler
On Aug 23, 2004, at 5:22 PM, Tatsuo Ishii wrote: Locales for multibyte encodings are often broken on many platforms. I see identical things with Japanese on Red Hat. This is one of the reason why I tell Japanese PostgreSQL users not to enable locale while initdb... Yep, and exporting my data, delet

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread Tatsuo Ishii
> > > > Ð ÐÐÐ, 23.08.2004, Ð 23:04, David Wheeler ÐÐÑÐÑ: > > > On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote: > > > > > > > er, the characters in "name" don't seem to match the characters in the > > > > query - 'êëë' vs. 'ëíì' - does that have any bearing? > > > > > > Yes, it means that = is doin

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread David Wheeler
On Aug 23, 2004, at 5:07 PM, Ian Barwick wrote: Does this go away if you change your locale to C? Yes. Hallelujah! I'm running initdb again now. Cheers, David smime.p7s Description: S/MIME cryptographic signature

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread Ian Barwick
On Mon, 23 Aug 2004 16:50:04 -0700, David Wheeler <[EMAIL PROTECTED]> wrote: > On Aug 23, 2004, at 4:34 PM, Ian Barwick wrote: > > > wild speculation in need of a Korean speaker, but: > > > > [EMAIL PROTECTED]:~/tmp> cat j.txt > > ããã > > íêì > > ìêì > > ìëì > > êëë > > ëíì > > ããã > > [EMAIL PROT

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread David Wheeler
On Aug 23, 2004, at 4:49 PM, David Wheeler wrote: Hmm. I tried putting your string into a UNICODE database and I got ERROR: invalid byte sequence for encoding "UNICODE": 0xc7 Really? Curious. Oh, are you sure that you got my UTF-8 data? Because it came back in your reply all mangled. Cheers, Da

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread David Wheeler
On Aug 23, 2004, at 4:34 PM, Ian Barwick wrote: wild speculation in need of a Korean speaker, but: [EMAIL PROTECTED]:~/tmp> cat j.txt テスト 환경설 전검색 웹문서 국방비 북한의 てすと [EMAIL PROTECTED]:~/tmp> uniq j.txt テスト 환경설 てすと All but the first and last lines are random Korean (Hangul) characters. Evidently our re

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread David Wheeler
On Aug 23, 2004, at 4:35 PM, Tom Lane wrote: Hmm. I tried putting your string into a UNICODE database and I got ERROR: invalid byte sequence for encoding "UNICODE": 0xc7 Really? Curious. So there's something funny happening here. What is your client_encoding setting? It's not set. I've had it c

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread Tom Lane
David Wheeler <[EMAIL PROTECTED]> writes: >> Is the problem query using an index? If so, does REINDEX help? > Doesn't look like it: > bric=3D# reindex index udx_keyword__name; > REINDEX > bric=3D# select * from keyword where name =3D'=BA=CF=C7=D1=C0=C7'; >id | name | screen_name | sort_na

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread Ian Barwick
On Tue, 24 Aug 2004 00:46:50 +0200, Markus Bertheau <[EMAIL PROTECTED]> wrote: > > > Ð ÐÐÐ, 23.08.2004, Ð 23:04, David Wheeler ÐÐÑÐÑ: > > On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote: > > > > > er, the characters in "name" don't seem to match the characters in the > > > query - 'êëë' vs. 'ëíì'

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread David Wheeler
On Aug 23, 2004, at 4:08 PM, Tom Lane wrote: [ looks back at discussion... ] Actually I misremembered --- the discussion was about how we would *reject* legal UTF-8 codes that are more than 2 bytes long. So the code is broken, but not in the direction that would cause your problem. Time for ano

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread Tom Lane
David Wheeler <[EMAIL PROTECTED]> writes: > Is the encoding check fixed in 8.0beta1? [ looks back at discussion... ] Actually I misremembered --- the discussion was about how we would *reject* legal UTF-8 codes that are more than 2 bytes long. So the code is broken, but not in the direction that

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread David Wheeler
On Aug 23, 2004, at 3:59 PM, Tom Lane wrote: But is it possible to store non-UTF-8 data in a UNICODE database? In theory not ... but I think there was a discussion earlier that concluded that our check for encoding validity is not airtight ... Well, it it was mostly right, I wouldn't expect it to b

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread Tom Lane
David Wheeler <[EMAIL PROTECTED]> writes: > But is it possible to store non-UTF-8 data in a UNICODE database? In theory not ... but I think there was a discussion earlier that concluded that our check for encoding validity is not airtight ... regards, tom lane ---

Re: [GENERAL] UTF-8 and LIKE vs =

2004-08-23 Thread David Wheeler
On Aug 23, 2004, at 3:46 PM, Markus Bertheau wrote: The collation rules of your (and my) locale say that these strings are the same: [EMAIL PROTECTED] markus]$ cat > t 국방비 북한의 [EMAIL PROTECTED] markus]$ uniq t 국방비 [EMAIL PROTECTED] markus]$ Interesting. Make sure that you have initdb'd the database