Re: [HACKERS] invalidly encoded strings

Andrew Dunstan Mon, 10 Sep 2007 09:11:06 -0700


Tatsuo Ishii wrote:


I don't understand whole discussion.

Why do you think that employing the Unicode code point as the chr()
argument could avoid endianness issues? Are you going to represent
Unicode code point as UCS-4? Then you have to specify the endianness
anyway.  (see the UCS-4 standard for more details)

The code point is simply a number. The result of chr() will be a textvalue one char (not one byte) wide, in the relevant database encoding.

U+nnnn maps to the same Unicode char and hence the same UTF8 encodingpattern regardless of endianness. e.g. U+00a9 is the copyright symbol onall machines. So to get this char in a UTF8 database you could call"select chr(169)" and get back the byte pattern \xC2A9.

Or are you going to represent Unicode point as a character string such
as 'U+0259'? Then representing any encoding as a string could avoid
endianness issues anyway, and I don't see Unicode code point is any
better than others.


The argument will be a number, as now.

Also I'd like to point out all encodings has its own code point
systems as far as I know. For example, EUC-JP has its corresponding
code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see
we can't use "code point" as chr()'s argument for othe encodings(of
course we need optional parameter specifying which character set is
supposed).

Where can I find the tables that map code points (as opposed toencodings) to characters for these others?


cheers

andrew



---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Re: [HACKERS] invalidly encoded strings

Reply via email to