Re: [HACKERS] Patch for bug #12845 (GB18030 encoding)

Tom Lane Fri, 15 May 2015 07:11:29 -0700

Arjen Nienhuis <a.g.nienh...@gmail.com> writes:
> GB18030 is a special case, because it's a full mapping of all unicode
> characters, and most of it is algorithmically defined.


True.

> This makes UtfToLocal a bad choice to implement it.

I disagree with that conclusion.  There are still 30000+ characters
that need to be translated via lookup table, so we still need either
UtfToLocal or a clone of it; and as I said previously, I'm not on board
with cloning it.

> I think the best solution is to get rid of UtfToLocal for GB18030. Use
> a specialized algorithm:
> - For characters > U+FFFF use the algorithm from my patch
> - For charcaters <= U+FFFF use special mapping tables to map from/to
> UTF32. Those tables would be smaller, and the code would be faster (I
> assume).

I looked at what wikipeda claims is the authoritative conversion table:

http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml

According to that, about half of the characters below U+FFFF can be
processed via linear conversions, so I think we ought to save table
space by doing that.  However, the remaining stuff that has to be
processed by lookup still contains a pretty substantial number of
characters that map to 4-byte GB18030 characters, so I don't think
we can get any table size savings by adopting a bespoke table format.
We might as well use UtfToLocal.  (Worth noting in this connection
is that we haven't seen fit to sweat about UtfToLocal's use of 4-byte
table entries for other encodings, even though most of the others
are not concerned with characters outside the BMP.)

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch for bug #12845 (GB18030 encoding)

Reply via email to