Re: [HACKERS] Unicode support

2009-04-20 Thread Peter Eisentraut
On Sunday 19 April 2009 18:54:45 Tom Lane wrote: > Peter Eisentraut writes: > > On Monday 13 April 2009 20:18:31 - - wrote: > >> 1) Functions like char_length() or length() do NOT return the number > >> of characters (the manual says they do), instead they return the > >> number of code points. >

Re: [HACKERS] Unicode support

2009-04-19 Thread Tom Lane
Peter Eisentraut writes: > On Monday 13 April 2009 20:18:31 - - wrote: >> 1) Functions like char_length() or length() do NOT return the number >> of characters (the manual says they do), instead they return the >> number of code points. > I have added a Todo item about possibly fixing this. I th

Re: [HACKERS] Unicode support

2009-04-19 Thread Peter Eisentraut
On Monday 13 April 2009 20:18:31 - - wrote: > 1) Functions like char_length() or length() do NOT return the number > of characters (the manual says they do), instead they return the > number of code points. I have added a Todo item about possibly fixing this. -- Sent via pgsql-hackers mailing l

Re: [HACKERS] Unicode support

2009-04-15 Thread Martijn van Oosterhout
On Tue, Apr 14, 2009 at 11:32:57AM -0700, David E. Wheeler wrote: > I've no idea what it would require, but the mapping table must be > pretty substantial. Still, I'd love to have this functionality in the > database. The Unicode tables in ICU outweigh the size of the code by a factor 5 or so.

Re: [HACKERS] Unicode support

2009-04-14 Thread Andrew Gierth
> "Peter" == Peter Eisentraut writes: > On Tuesday 14 April 2009 07:07:27 Andrew Gierth wrote: >> FWIW, the SQL spec puts the onus of normalization squarely on the >> application; the database is allowed to assume that Unicode >> strings are already normalized, is allowed to behave in >>

Re: [HACKERS] Unicode support

2009-04-14 Thread David E. Wheeler
On Apr 14, 2009, at 11:10 AM, Tom Lane wrote: Andrew Dunstan writes: I think there's a good case for some functions implementing the various Unicode normalization functions, though. I have no objection to that so long as the code footprint is in line with the utility gain (i.e. not all tha

Re: [HACKERS] Unicode support

2009-04-14 Thread - -
>> I don't believe that the standard forbids the use of combining chars at all. >> RFC 3629 says: >> >> ... This issue is amenable to solutions based on Unicode Normalization >> Forms, see [UAX15]. > This is the relevant part. Tom was claiming that the UTF8 encoding required > normalizing the st

Re: [HACKERS] Unicode support

2009-04-14 Thread Peter Eisentraut
On Tuesday 14 April 2009 19:26:41 Tom Lane wrote: > Another question is "what is the purpose of a database"? To me it would > be quite the wrong thing for the DB to not store what is presented, as > long as it's considered legal. Normalization of legal variant forms > seems pretty questionable.

Re: [HACKERS] Unicode support

2009-04-14 Thread Peter Eisentraut
On Monday 13 April 2009 20:18:31 - - wrote: > 2) PG has no support for the Unicode collation algorithm. Collation is > offloaded to the OS, which makes this quite inflexible. This argument is unclear. Do you want the Unicode collation algorithm or do you want flexibility? Some OS do implement t

Re: [HACKERS] Unicode support

2009-04-14 Thread Peter Eisentraut
On Tuesday 14 April 2009 18:49:45 Greg Stark wrote: > What's really at issue is "what is a string?". That is, it a sequence > of characters or a sequence of code points. I think a sequence of codepoints would be about as silly a definition as the antiquated notion of a string as a sequence of byt

Re: [HACKERS] Unicode support

2009-04-14 Thread Kevin Grittner
Greg Stark wrote: > Peter Eisentraut wrote: >> SELECT U&'\00E9', char_length(U&'\00E9'); >> ?column? | char_length >> --+- >> é| 1 >> (1 row) >> >> SELECT U&'\0065\0301', char_length(U&'\0065\0301'); >> ?column? | char_length >> --+- >

Re: [HACKERS] Unicode support

2009-04-14 Thread Tom Lane
Andrew Dunstan writes: > I think there's a good case for some functions implementing the various > Unicode normalization functions, though. I have no objection to that so long as the code footprint is in line with the utility gain (i.e. not all that much). If we have to bring in ICU or somethin

Re: [HACKERS] Unicode support

2009-04-14 Thread Andrew Dunstan
Kevin Grittner wrote: I'm curious -- can every multi-code-point character be normalized to a single-code-point character? I don't believe so. Those combinations used in the most common orthographic languages have their own code points, but I understand you can use the combining chars

Re: [HACKERS] Unicode support

2009-04-14 Thread Andrew Dunstan
David E. Wheeler wrote: On Apr 14, 2009, at 9:26 AM, Tom Lane wrote: Another question is "what is the purpose of a database"? To me it would be quite the wrong thing for the DB to not store what is presented, as long as it's considered legal. Normalization of legal variant forms seems prett

Re: [HACKERS] Unicode support

2009-04-14 Thread David E. Wheeler
On Apr 14, 2009, at 9:26 AM, Tom Lane wrote: Another question is "what is the purpose of a database"? To me it would be quite the wrong thing for the DB to not store what is presented, as long as it's considered legal. Normalization of legal variant forms seems pretty questionable. So I'm w

Re: [HACKERS] Unicode support

2009-04-14 Thread Tom Lane
Greg Stark writes: > What's really at issue is "what is a string?". That is, it a sequence > of characters or a sequence of code points. If it's the former then we > would also have to prohibit certain strings such as U&'\0301' > entirely. And we have to make substr() pick out the right number of

Re: [HACKERS] Unicode support

2009-04-14 Thread Greg Stark
On Tue, Apr 14, 2009 at 1:32 PM, Peter Eisentraut wrote: > On Monday 13 April 2009 22:39:58 Andrew Dunstan wrote: >> Umm, but isn't that because your encoding is using one code point? >> >> See the OP's explanation w.r.t. canonical equivalence. >> >> This isn't about the number of bytes, but about

Re: [HACKERS] Unicode support

2009-04-14 Thread Peter Eisentraut
On Tuesday 14 April 2009 07:07:27 Andrew Gierth wrote: > FWIW, the SQL spec puts the onus of normalization squarely on the > application; the database is allowed to assume that Unicode strings > are already normalized, is allowed to behave in implementation-defined > ways when presented with string

Re: [HACKERS] Unicode support

2009-04-14 Thread Peter Eisentraut
On Monday 13 April 2009 22:39:58 Andrew Dunstan wrote: > Umm, but isn't that because your encoding is using one code point? > > See the OP's explanation w.r.t. canonical equivalence. > > This isn't about the number of bytes, but about whether or not we should > count characters encoded as two or mo

Re: [HACKERS] Unicode support

2009-04-13 Thread Andrew Gierth
> "Gregory" == Gregory Stark writes: >>> I don't believe that the standard forbids the use of combining >>> chars at all. RFC 3629 says: >>> >>> ... This issue is amenable to solutions based on Unicode >>> Normalization Forms, see [UAX15]. Gregory> This is the relevant part. Tom was

Re: [HACKERS] Unicode support

2009-04-13 Thread Gregory Stark
- - writes: >>> The original post seemed to be a contrived attempt to say "you should >>> use ICU". >> >> Indeed. The OP should go read all the previous arguments about ICU >> in our archives. > > Not at all. I just was making a suggestion. You may use any other > library or implement it yourse

Re: [HACKERS] Unicode support

2009-04-13 Thread - -
Tom Lane wrote: > Greg Stark writes: >> Is it really true trhat canonical encodings never contain any composed >> characters in them? I thought there were some glyphs which could only >> be represented by composed characters. > > AFAIK that's not true. However, in my original comment I was think

Re: [HACKERS] Unicode support

2009-04-13 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan writes: This isn't about the number of bytes, but about whether or not we should count characters encoded as two or more combined code points as a single char or not. It's really about whether we should support non-canonical encodings. AFAIK that's a

Re: [HACKERS] Unicode support

2009-04-13 Thread Tom Lane
Greg Stark writes: > Is it really true trhat canonical encodings never contain any composed > characters in them? I thought there were some glyphs which could only > be represented by composed characters. AFAIK that's not true. However, in my original comment I was thinking about UTF16 surrogate

Re: [HACKERS] Unicode support

2009-04-13 Thread Greg Stark
On Mon, Apr 13, 2009 at 9:15 PM, Tom Lane wrote: > Andrew Dunstan writes: >> This isn't about the number of bytes, but about whether or not we should >> count characters encoded as two or more combined code points as a single >> char or not. > > It's really about whether we should support non-can

Re: [HACKERS] Unicode support

2009-04-13 Thread Tom Lane
Andrew Dunstan writes: > This isn't about the number of bytes, but about whether or not we should > count characters encoded as two or more combined code points as a single > char or not. It's really about whether we should support non-canonical encodings. AFAIK that's a hack to cope with imple

Re: [HACKERS] Unicode support

2009-04-13 Thread Andrew Dunstan
Alvaro Herrera wrote: - - wrote: 1) Functions like char_length() or length() do NOT return the number of characters (the manual says they do), instead they return the number of code points. I think you have client_encoding misconfigured. alvherre=# select length('á'::text); length

Re: [HACKERS] Unicode support

2009-04-13 Thread Kevin Grittner
Alvaro Herrera wrote: >> 1) Functions like char_length() or length() do NOT return the number >> of characters (the manual says they do), instead they return the >> number of code points. > > I think you have client_encoding misconfigured. > > alvherre=# select length('á'::text); > length > -

Re: [HACKERS] Unicode support

2009-04-13 Thread Alvaro Herrera
- - wrote: > 1) Functions like char_length() or length() do NOT return the number > of characters (the manual says they do), instead they return the > number of code points. I think you have client_encoding misconfigured. alvherre=# select length('á'::text); length 1 (1 fila) a

[HACKERS] Unicode support

2009-04-13 Thread - -
Hi. While PostgreSQL is a great database, it lacks some fundamental Unicode support. I want to present some points that have--to my knowledge--not been addressed so far. In the following text, it is assumed that the database and client encoding is UTF-8. 1) Functions like char_length() or length

[HACKERS] Unicode support in postgresql code

2009-01-06 Thread Kalyankumar Ramaseshan
Hi, Any one could throw some light on how the unicode support is enabled in the postgresql code? I know that this is a step during the installation to select the default locale of the postgresql system & other place is during the creation of a database, there is a option to select the languag