Arjen Nienhuis <a.g.nienh...@gmail.com> writes: > GB18030 is a special case, because it's a full mapping of all unicode > characters, and most of it is algorithmically defined.
True. > This makes UtfToLocal a bad choice to implement it. I disagree with that conclusion. There are still 30000+ characters that need to be translated via lookup table, so we still need either UtfToLocal or a clone of it; and as I said previously, I'm not on board with cloning it. > I think the best solution is to get rid of UtfToLocal for GB18030. Use > a specialized algorithm: > - For characters > U+FFFF use the algorithm from my patch > - For charcaters <= U+FFFF use special mapping tables to map from/to > UTF32. Those tables would be smaller, and the code would be faster (I > assume). I looked at what wikipeda claims is the authoritative conversion table: http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml According to that, about half of the characters below U+FFFF can be processed via linear conversions, so I think we ought to save table space by doing that. However, the remaining stuff that has to be processed by lookup still contains a pretty substantial number of characters that map to 4-byte GB18030 characters, so I don't think we can get any table size savings by adopting a bespoke table format. We might as well use UtfToLocal. (Worth noting in this connection is that we haven't seen fit to sweat about UtfToLocal's use of 4-byte table entries for other encodings, even though most of the others are not concerned with characters outside the BMP.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers