> Tatsuo Ishii wrote: > > > BTW, every encoding has its own charset. However the relationship > > between encoding and charset are not so simple as Unicode. For > > example, encoding EUC_JP correponds to multiple charsets, namely > > ASCII, JIS X 0201, JIS X 0208 and JIS X 0212. So a function which > > returns a "code point" is not quite usefull since it lacks the charset > > info. I think we need to continute design discussion, probably > > targetting for 8.4, not 8.3. > > Is Unicode complete as far as Japanese chars go? I mean, is there a > character in EUC_JP that is not representable in Unicode?
I don't think Unicode is "complete" in this case. Problems are: EUC_JP allows user defined characters which are not mapped to Unicode. Also some characters in EUC_JP corresponds to multiple Unicode points. > Because if Unicode is complete, ISTM it makes perfect sense to have a > unicode_char() (or whatever we end up calling it) that takes an Unicode > code point and returns a character in whatever JIS set you want > (specified by setting client_encoding to that). Because then you solved > the problem nicely. I'm not sure what kind of use case for unicode_char() you are thinking about. Anyway if you want a "code point" from a character, we could easily add such functions to all backend encodings currently we support. Probably it would look like: to_code_point(str TEXT) returns TEXT An example outputs are: ASCII - 41 ISO 10646 - U+0041 ISO 10646 - U+29E3D ISO 8859-1 - a5 JIS X 0208 - 4141 It's a little bit too late for 8.2 though. > One thing that I find confusing in your text above is whether EUC_JP is > an encoding or a charset? I would think that the various JIS X are > encodings, and EUC_JP is the charset; or is it the other way around? No, EUC_JP is an encoding. JIS X are the charsets. -- Tatsuo Ishii SRA OSS, Inc. Japan ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org