On Sat, Jul 12, 2008 at 10:02:24AM +0200, Zdenek Kotala wrote: > Background: > We specify encoding in initdb phase. ANSI specify repertoire, charset, > encoding and collation. If I understand it correctly, then charset is > subset of repertoire and specify list of allowed characters for > language->collation. Encoding is mapping of character set to binary format. > For example for Czech alphabet(charset) we have 6 different encoding for > 8bit ASCII, but on other side for UTF8 there is specified multi charsets.
Oh, so you're thinking of a charset as a sort of check constraint. If your locale is turkish and you have a column marked charset ASCII then storing lower('HI') results in an error. A collation must be defined over all possible characters, it can't depend on the character set. That doesn't mean sorting in en_US must do something meaningful with japanese characters, it does mean it can't throw an error (the usual procedure is to sort on unicode point). > I think if we support UTF8 encoding, than it make sense to create own > charsets, because system locales could have defined collation for that. We > need conversion only in case when client encoding is not compatible with > charset and conversion is not defined. The problem is that locales in POSIX are defined on an encoding, not a charset. In locale en_US.UTF-8 doesn't actually sort any differently than en_US.latin1, it's just that japanese characters are not representable in the latter. locale-gen can create a locale for any pair of (locale code,encoding), whether the result is meaningful is another question. Have a nice day, -- Martijn van Oosterhout <[EMAIL PROTECTED]> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
signature.asc
Description: Digital signature