On Sunday 19 April 2009 18:54:45 Tom Lane wrote:
> Peter Eisentraut writes:
> > On Monday 13 April 2009 20:18:31 - - wrote:
> >> 1) Functions like char_length() or length() do NOT return the number
> >> of characters (the manual says they do), instead they return the
> >> number of code points.
>
Peter Eisentraut writes:
> On Monday 13 April 2009 20:18:31 - - wrote:
>> 1) Functions like char_length() or length() do NOT return the number
>> of characters (the manual says they do), instead they return the
>> number of code points.
> I have added a Todo item about possibly fixing this.
I th
On Monday 13 April 2009 20:18:31 - - wrote:
> 1) Functions like char_length() or length() do NOT return the number
> of characters (the manual says they do), instead they return the
> number of code points.
I have added a Todo item about possibly fixing this.
--
Sent via pgsql-hackers mailing l
On Tue, Apr 14, 2009 at 11:32:57AM -0700, David E. Wheeler wrote:
> I've no idea what it would require, but the mapping table must be
> pretty substantial. Still, I'd love to have this functionality in the
> database.
The Unicode tables in ICU outweigh the size of the code by a factor 5
or so.
> "Peter" == Peter Eisentraut writes:
> On Tuesday 14 April 2009 07:07:27 Andrew Gierth wrote:
>> FWIW, the SQL spec puts the onus of normalization squarely on the
>> application; the database is allowed to assume that Unicode
>> strings are already normalized, is allowed to behave in
>>
On Apr 14, 2009, at 11:10 AM, Tom Lane wrote:
Andrew Dunstan writes:
I think there's a good case for some functions implementing the
various
Unicode normalization functions, though.
I have no objection to that so long as the code footprint is in line
with the utility gain (i.e. not all tha
>> I don't believe that the standard forbids the use of combining chars at all.
>> RFC 3629 says:
>>
>> ... This issue is amenable to solutions based on Unicode Normalization
>> Forms, see [UAX15].
> This is the relevant part. Tom was claiming that the UTF8 encoding required
> normalizing the st
On Tuesday 14 April 2009 19:26:41 Tom Lane wrote:
> Another question is "what is the purpose of a database"? To me it would
> be quite the wrong thing for the DB to not store what is presented, as
> long as it's considered legal. Normalization of legal variant forms
> seems pretty questionable.
On Monday 13 April 2009 20:18:31 - - wrote:
> 2) PG has no support for the Unicode collation algorithm. Collation is
> offloaded to the OS, which makes this quite inflexible.
This argument is unclear. Do you want the Unicode collation algorithm or do
you want flexibility? Some OS do implement t
On Tuesday 14 April 2009 18:49:45 Greg Stark wrote:
> What's really at issue is "what is a string?". That is, it a sequence
> of characters or a sequence of code points.
I think a sequence of codepoints would be about as silly a definition as the
antiquated notion of a string as a sequence of byt
Greg Stark wrote:
> Peter Eisentraut wrote:
>> SELECT U&'\00E9', char_length(U&'\00E9');
>> ?column? | char_length
>> --+-
>> é| 1
>> (1 row)
>>
>> SELECT U&'\0065\0301', char_length(U&'\0065\0301');
>> ?column? | char_length
>> --+-
>
Andrew Dunstan writes:
> I think there's a good case for some functions implementing the various
> Unicode normalization functions, though.
I have no objection to that so long as the code footprint is in line
with the utility gain (i.e. not all that much). If we have to bring in
ICU or somethin
Kevin Grittner wrote:
I'm curious -- can every multi-code-point character be normalized to a
single-code-point character?
I don't believe so. Those combinations used in the most common
orthographic languages have their own code points, but I understand you
can use the combining chars
David E. Wheeler wrote:
On Apr 14, 2009, at 9:26 AM, Tom Lane wrote:
Another question is "what is the purpose of a database"? To me it would
be quite the wrong thing for the DB to not store what is presented, as
long as it's considered legal. Normalization of legal variant forms
seems prett
On Apr 14, 2009, at 9:26 AM, Tom Lane wrote:
Another question is "what is the purpose of a database"? To me it
would
be quite the wrong thing for the DB to not store what is presented, as
long as it's considered legal. Normalization of legal variant forms
seems pretty questionable. So I'm w
Greg Stark writes:
> What's really at issue is "what is a string?". That is, it a sequence
> of characters or a sequence of code points. If it's the former then we
> would also have to prohibit certain strings such as U&'\0301'
> entirely. And we have to make substr() pick out the right number of
On Tue, Apr 14, 2009 at 1:32 PM, Peter Eisentraut wrote:
> On Monday 13 April 2009 22:39:58 Andrew Dunstan wrote:
>> Umm, but isn't that because your encoding is using one code point?
>>
>> See the OP's explanation w.r.t. canonical equivalence.
>>
>> This isn't about the number of bytes, but about
On Tuesday 14 April 2009 07:07:27 Andrew Gierth wrote:
> FWIW, the SQL spec puts the onus of normalization squarely on the
> application; the database is allowed to assume that Unicode strings
> are already normalized, is allowed to behave in implementation-defined
> ways when presented with string
On Monday 13 April 2009 22:39:58 Andrew Dunstan wrote:
> Umm, but isn't that because your encoding is using one code point?
>
> See the OP's explanation w.r.t. canonical equivalence.
>
> This isn't about the number of bytes, but about whether or not we should
> count characters encoded as two or mo
> "Gregory" == Gregory Stark writes:
>>> I don't believe that the standard forbids the use of combining
>>> chars at all. RFC 3629 says:
>>>
>>> ... This issue is amenable to solutions based on Unicode
>>> Normalization Forms, see [UAX15].
Gregory> This is the relevant part. Tom was
- - writes:
>>> The original post seemed to be a contrived attempt to say "you should
>>> use ICU".
>>
>> Indeed. The OP should go read all the previous arguments about ICU
>> in our archives.
>
> Not at all. I just was making a suggestion. You may use any other
> library or implement it yourse
Tom Lane wrote:
> Greg Stark writes:
>> Is it really true trhat canonical encodings never contain any composed
>> characters in them? I thought there were some glyphs which could only
>> be represented by composed characters.
>
> AFAIK that's not true. However, in my original comment I was think
Tom Lane wrote:
Andrew Dunstan writes:
This isn't about the number of bytes, but about whether or not we should
count characters encoded as two or more combined code points as a single
char or not.
It's really about whether we should support non-canonical encodings.
AFAIK that's a
Greg Stark writes:
> Is it really true trhat canonical encodings never contain any composed
> characters in them? I thought there were some glyphs which could only
> be represented by composed characters.
AFAIK that's not true. However, in my original comment I was thinking
about UTF16 surrogate
On Mon, Apr 13, 2009 at 9:15 PM, Tom Lane wrote:
> Andrew Dunstan writes:
>> This isn't about the number of bytes, but about whether or not we should
>> count characters encoded as two or more combined code points as a single
>> char or not.
>
> It's really about whether we should support non-can
Andrew Dunstan writes:
> This isn't about the number of bytes, but about whether or not we should
> count characters encoded as two or more combined code points as a single
> char or not.
It's really about whether we should support non-canonical encodings.
AFAIK that's a hack to cope with imple
Alvaro Herrera wrote:
- - wrote:
1) Functions like char_length() or length() do NOT return the number
of characters (the manual says they do), instead they return the
number of code points.
I think you have client_encoding misconfigured.
alvherre=# select length('á'::text);
length
Alvaro Herrera wrote:
>> 1) Functions like char_length() or length() do NOT return the
number
>> of characters (the manual says they do), instead they return the
>> number of code points.
>
> I think you have client_encoding misconfigured.
>
> alvherre=# select length('á'::text);
> length
> -
- - wrote:
> 1) Functions like char_length() or length() do NOT return the number
> of characters (the manual says they do), instead they return the
> number of code points.
I think you have client_encoding misconfigured.
alvherre=# select length('á'::text);
length
1
(1 fila)
a
Hi.
While PostgreSQL is a great database, it lacks some fundamental
Unicode support. I want to present some points that have--to my
knowledge--not been addressed so far. In the following text, it is
assumed that the database and client encoding is UTF-8.
1) Functions like char_length() or length
Hi,
Any one could throw some light on how the unicode support is enabled in
the postgresql code? I know that this is a step during the installation
to select the default locale of the postgresql system & other place is
during the creation of a database, there is a option to select the
languag
31 matches
Mail list logo