Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Oliver Jowett
Tatsuo Ishii wrote: Tom Lane wrote: If I understood what I was reading, this would take several things: * Remove the "special UTF-8 check" in pg_verifymbstr; * Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case; * Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8. Are

Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Tatsuo Ishii
> Tom Lane wrote: > > > If I understood what I was reading, this would take several things: > > * Remove the "special UTF-8 check" in pg_verifymbstr; > > * Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case; > > * Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8. > >

Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Tom Lane
Oliver Jowett <[EMAIL PROTECTED]> writes: > Does this change what client_encoding = UNICODE might produce? The JDBC > driver will need some tweaking to handle this -- Java uses UTF-16 > internally and I think some supplementary character (?) scheme for > values above 0x as of JDK 1.5. You'r

Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Oliver Jowett
Tom Lane wrote: If I understood what I was reading, this would take several things: * Remove the "special UTF-8 check" in pg_verifymbstr; * Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case; * Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8. Are there any other place

Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
> -Original Message- > From: Tom Lane [mailto:[EMAIL PROTECTED] > Sent: Sunday, August 08, 2004 2:43 AM > To: Dennis Bjorklund > Cc: Tatsuo Ishii; John Hansen; [EMAIL PROTECTED]; > [EMAIL PROTECTED] > Subject: Re: [PATCHES] [HACKERS] UNICODE characters above

Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Tom Lane
Dennis Bjorklund <[EMAIL PROTECTED]> writes: > On Sat, 7 Aug 2004, Tatsuo Ishii wrote: >> Anyway my point is if current specification of Unicode only allows >> 24-bit range, why we need to allow usage against the specification? > Is there a specific reason you want to restrict it to 24 bits? I se

Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
> -Original Message- > From: Dennis Bjorklund [mailto:[EMAIL PROTECTED] > Sent: Saturday, August 07, 2004 11:23 PM > To: John Hansen > Cc: Takehiko Abe; [EMAIL PROTECTED] > Subject: RE: [PATCHES] [HACKERS] UNICODE characters above 0x1 > > On Sat, 7 Aug

Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Dennis Bjorklund
On Sat, 7 Aug 2004, John Hansen wrote: > Now, is it really 24 bits tho? > Afaict, it's really 21 (0 - 10 or 0 - xxx1 ) Yes, up to 0x10 should be enough. The 24 is not really important, this is all about what utf-8 strings to accept as input. The strings are stored

Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
> -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > Dennis Bjorklund > Sent: Saturday, August 07, 2004 10:48 PM > To: Takehiko Abe > Cc: [EMAIL PROTECTED] > Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x1

Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Dennis Bjorklund
On Sat, 7 Aug 2004, Takehiko Abe wrote: It looked like you sent the last mail only to me and not the list. I assume it was a misstake and I send the reply to both. > > Is there a specific reason you want to restrict it to 24 bits? > > ISO 10646 is said to have removed its private use codepoints

Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
: [PATCHES] [HACKERS] UNICODE characters above 0x1 On Sat, 7 Aug 2004, John Hansen wrote: > should not allow them to be stored, since there might me someone using > the high ranges for a private character set, which could very well be > included in the specification some day. There

Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Dennis Bjorklund
On Sat, 7 Aug 2004, Tatsuo Ishii wrote: > More seriously, Unicode is filled with tons of confusion and > inconsistency IMO. Remember that once Unicode adovocates said that the > merit of Unicode was it only requires 16-bit width. Now they say they > need surrogate pairs and 32-bit width chars... >

Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
s are allowed? Regards, John Hansen -Original Message- From: Tatsuo Ishii [mailto:[EMAIL PROTECTED] Sent: Saturday, August 07, 2004 8:46 PM To: John Hansen Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [PATCHES] [HACKERS] UNICODE characters

Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Dennis Bjorklund
On Sat, 7 Aug 2004, John Hansen wrote: > should not allow them to be stored, since there might me someone using > the high ranges for a private character set, which could very well be > included in the specification some day. There are areas reserved for private character sets. -- /Dennis Björk

Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Tatsuo Ishii
> Yes, but the specification allows for 6byte sequences, or 32bit > characters. UTF-8 is just an encoding specification, not character set specification. Unicode only has 17 256x256 planes in its specification. > As dennis pointed out, just because they're not used, doesn't mean we > should not a

Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
] [HACKERS] UNICODE characters above 0x1 > Dennis Bjorklund <[EMAIL PROTECTED]> writes: > > ... This also means that the start byte can never start with 7 or 8 > > ones, that is illegal and should be tested for and rejected. So the > > longest utf-8 sequence is 6 byt

Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Tatsuo Ishii
> Dennis Bjorklund <[EMAIL PROTECTED]> writes: > > ... This also means that the start byte can never start with 7 or 8 > > ones, that is illegal and should be tested for and rejected. So the > > longest utf-8 sequence is 6 bytes (and the longest character needs 4 > > bytes (or 31 bits)). > > Tatsu