On Fri, May 15, 2015 at 4:10 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Arjen Nienhuis <a.g.nienh...@gmail.com> writes: >> GB18030 is a special case, because it's a full mapping of all unicode >> characters, and most of it is algorithmically defined. > > True. > >> This makes UtfToLocal a bad choice to implement it. > > I disagree with that conclusion. There are still 30000+ characters > that need to be translated via lookup table, so we still need either > UtfToLocal or a clone of it; and as I said previously, I'm not on board > with cloning it. > >> I think the best solution is to get rid of UtfToLocal for GB18030. Use >> a specialized algorithm: >> - For characters > U+FFFF use the algorithm from my patch >> - For charcaters <= U+FFFF use special mapping tables to map from/to >> UTF32. Those tables would be smaller, and the code would be faster (I >> assume). > > I looked at what wikipeda claims is the authoritative conversion table: > > http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml > > According to that, about half of the characters below U+FFFF can be > processed via linear conversions, so I think we ought to save table > space by doing that. However, the remaining stuff that has to be > processed by lookup still contains a pretty substantial number of > characters that map to 4-byte GB18030 characters, so I don't think > we can get any table size savings by adopting a bespoke table format. > We might as well use UtfToLocal. (Worth noting in this connection > is that we haven't seen fit to sweat about UtfToLocal's use of 4-byte > table entries for other encodings, even though most of the others > are not concerned with characters outside the BMP.) >
It's not about 4 vs 2 bytes, it's about using 8 bytes vs 4. UtfToLocal uses a sparse array: map = {{0, x}, {1, y}, {2, z}, ...} v.s. map = {x, y, z, ...} That's fine when not every code point is used, but it's different for GB18030 where almost all code points are used. Using a plain array saves space and saves a binary search. Gr. Arjen -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers