On Tue, May 22, 2012 at 11:50 AM, Tatsuo Ishii <is...@postgresql.org> wrote: > > I think it's possible. The first characters are defined like this: > > #define IS_LCPRV1(c) ((unsigned char)(c) == 0x9a || (unsigned char)(c) > == 0x9b) > #define IS_LCPRV2(c) ((unsigned char)(c) == 0x9c || (unsigned char)(c) > == 0x9d) > > It seems IS_LCPRV1 is not used in any of PostgreSQL supported > encodings at this point, that means there's 0 chance which existing > databases include LCPRV1. So you could safely ignore it. > > For IS_LCPRV2, it is only used for Chinese encodings (EUC_TW and BIG5) > in backend/utils/mb/conversion_procs/euc_tw_and_big5/euc_tw_and_big5.c > and it is fixed to 0x9d. So you can always restore the value to 0x9d. > > > Also in this part of code we're shifting first byte by 16 bits: > > > > if (IS_LC1(*from) && len >= 2) > > { > > *to = *from++ << 16; > > *to |= *from++; > > len -= 2; > > } > > else if (IS_LCPRV1(*from) && len >= 3) > > { > > from++; > > *to = *from++ << 16; > > *to |= *from++; > > len -= 3; > > } > > > > Why don't we shift it by 8 bits? > > Because we want the first byte of LC1 case to be placed in the second > byte of wchar. i.e. > > 0th byte: always 0 > 1th byte: leading byte (the first byte of the multibyte) > 2th byte: always 0 > 3th byte: the second byte of the multibyte > > Note that we always assume that the 1th byte (called "leading byte": > LB in short) represents the id of the character set (from 0x81 to > 0xff) in MULE INTERNAL encoding. For the mapping between LB and > charsets, see pg_wchar.h.
Thanks for your comments. They clarify a lot. But I still don't realize how can we distinguish IS_LCPRV2 and IS_LC2? Isn't it possible for them to produce same pg_wchar? ------ With best regards, Alexander Korotkov.