Re: Rewrite of IBM doublebyte charsets

Ulf Zibis Sat, 09 May 2009 10:49:29 -0700

Am 01.05.2009 08:48, Xueming Shen schrieb:

Hi,
While I'm waiting for Alan's code-review result for my rewriting ofEUC_TW
   http://cr.openjdk.java.net/~sherman/6831794_6229811/webrev
(much faster, much smaller, near 8% decrease of size of charsets.jarwith one
charset update. OK, it's a shame...I meant the old data structure)



EUC_TW statistics:

Plane   range   length  segments  segments-usage-ratio

0    a1a1-fdcb   5868  5d = 93   66 %
_0   a1a1-a744    434   7 = 7    65 %
_1   c2a1-fdcb   5434  3c = 60   95 %

1:8ea2   -f2c4   7650  52 = 82   98 %
2:8ea3   -e7aa   6394  47 = 71   95 %
3:8ea4   -eedc   7286  4e = 78   98 %
4:8ea5   -fcd1   8601  5c = 92   98 %
5:8ea6   -e4fa   6385  44 = 68   99 %
6:8ea7   -ebd5   6532  4b = 75   98 %
7:8eaf   -edb9   8721  4d = 77   92 %

Sum:             55446  262 = 610

memory amount for all segments (not truncated):
610 * 95 = 57950 code points



*** Decoder-Suggestions:

(1) Increase dimension of b2c and decouple plane 0:
     String[] b2c = new String[0x10]
     String b2c_0 = ...

Benefit[1]: save calculation of plane no. to range 0..7 (but mask by0xa0)

   Benefit[2]: save range-check for plane (catch malformed plane by NPE)
   sophisticated (additionally save masking of plane no.):
     String[] b2c = new String[0xb0]

(2) Save Strings in 2-dimensional array:
     String[][] b2c = new String[0x10][]
     String[] b2c_0 = new String[0x5d]
     b2c[0x2] = new String[0x52]
     b2c[0x3] = new String[0x47]
     b2c[0x4] = new String[0x4e]
     b2c[0x5] = new String[0x5c]
     b2c[0x6] = new String[0x44]
     b2c[0x7] = new String[0x4b]
     b2c[0xf] = new String[0x4d]
   sophisticated (segments a8..c1 are unused in plane 0):
     String[] b2c_0 = new String[0x07]
     String[] b2c_1 = new String[0x3c]

Benefit[3]: save calculation of index (multiplying with dbSegSize),but add 1 indirectionBenefit[4]: save range-check for segment index (catch malformedsegment index by NPE)Benefit[5]: save range-check for String index (catch malformedString indexes by IndexOutOfBoundsException)

   Benefit[6]: avoid 22 % superfluous memory and disk-footprint

(3) Truncate Strings (catch unmappable String indexes byIndexOutOfBoundsException):

   Benefit[7]: save another 4 % superfluous memory and disk-footprint

Note: All exceptions can be catched at once, as they are all ofRuntimeException.

(4) Save mappings in data file (modified UTF-8-saved chars need 2.97bytes in average):

   Benefit[8]: save modified UTF-8 decoding while loading class file
   Benefit[9]: avoid another 48 % superfluous disk-footprint
   Note: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795536

( I have just created patch, but I'm waiting for launch of OpenJDK-7project "charset-enhancement".)

   Disadvantage[1]: loading data from jar-file may be slow, but ...

- host data file outside of jar, as loading bynio.channel.FileChannel from direct buffer should be fast

   - resolve Bugs:
     http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6818736
     http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6818737

(5) Generate mappings as surrogate pairs:

High surrogates could be saved as bytes and ANDed by 0xdc00, as theywon't exceed 0xdc80Benefit[10]: save decoding to surrogate pairs (I guess, this wouldsignificantly increase performance)Benefit[11]: save b2cIsSupp[] (saves another 4 % memory anddisk-footprint)

   Disadvantage[2]: memory and disk-footprint would again increase by 50 %

(6) Change parameters of decode() method:

static void decode(byte[] src, char[] dst, int sp, int sl, int dp,int dl, int p) ("beta" approach)

   speads up buffer access + avoids c1, c1 buffering
   Benefit[12]: increase performance
   Disadvantage[3]: need different methods for direct buffers

(7) Provide 4-way fork from de/encodeLoop():

See:https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/trunk/src/sun/nio/cs/SingleByteEncoder_new.java?rev=&view=markup

   Benefit[13]: increase performance, if there is only 1 direct buffer

(8) Quit coders xBufferLoop by exception on xflow:
   http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6806227
   Benefit[14]: increase performance

(9) Get rid of sun.io package dependency:

https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/tags/milestone2/src/sun/io/

   Benefit[15]: avoid superfluous disk-footprint
   Benefit[16]: save maintenance of sun.io converters

Disadvantage[4]: published under JRL (waiting for launch ofOpenJDK-7 project "charset-enhancement") ;-)


*** Encoder-Suggestions (not complete, just some thoughts):

(11) Initialize encoder mappings lazy:
   Benefit[17]: increase startup performance for decoder

(12) Generate mappings for surrogate pairs:

Benefit[18]: save encoding from surrogate pairs (I guess, this wouldsignificantly increase performance)

(13) Introduce 16-bit intermediate mapping ("beta"-thoughts: overallcount of code points is < 65536):

   Benefit[19]: avoid superfluous memory and disk-footprint

-Ulf

Re: Rewrite of IBM doublebyte charsets

Reply via email to