Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

Xueming Shen Fri, 30 Sep 2011 13:44:54 -0700

On 09/30/2011 07:09 AM, Ulf Zibis wrote:

(1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} --->CoderResult.malformedForLength(1)It appears the Unicode Standard now explicitly recommends to returnthe malformed length 2,
what UTF-8 is doing now, for this scenario
My idea behind is, that in case of malformed length 1 a consecutivecall to the decode loop would again return another malformed length 1,to ensure 2 replacement chars in the output string. (Not sure, if thatis expected in this corner case.)

Unicode Standard's "best practices" D93a/b recommends to return 2 inthis case.

3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results<http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537>
I'm not sure I understand the suggested b1 < -0x3e patch, I don'tsee we can simply replace
((b1 >> 5) == -2) with (b1 < -0x3e).
You must see the b1 < -0x3e in combination with the following b1 <-0x20. ;-)
But now I have a better "if...else if" switch. :-)
- saves the shift operations
- only 1 comparison per case
- only 1 constant to load per case
- helps compiler to benefit from 1 byte constants and op-codes
- much better readable

I believe we changed from (b1 < xyz) to (b1 >> x) == -2 back to 2009(?)becausethe benchmark shows the "shift" version is slightly faster. Do you haveany numbershows any difference now. My non-scientific benchmark still suggests the"shift"

type is faster on -server vm, no significant difference on -client vm.

  ------------------  your new switch---------------
(1) -server
Method                      Millis  Ratio
Decoding 1b UTF-8 :            125  1.000
Decoding 2b UTF-8 :           2558 20.443
Decoding 3b UTF-8 :           3439 27.481
Decoding 4b UTF-8 :           2030 16.221
(2) -client
Decoding 1b UTF-8 :            335  1.000
Decoding 2b UTF-8 :           1041  3.105
Decoding 3b UTF-8 :           2245  6.694
Decoding 4b UTF-8 :           1254  3.741

  ------------------ existing "shift"---------------
(1) -server
Decoding 1b UTF-8 :            134  1.000
Decoding 2b UTF-8 :           1891 14.106
Decoding 3b UTF-8 :           2934 21.886
Decoding 4b UTF-8 :           2133 15.913
(2) -client
Decoding 1b UTF-8 :            341  1.000
Decoding 2b UTF-8 :            949  2.560
Decoding 3b UTF-8 :           2321  6.255
Decoding 4b UTF-8 :           1278  3.446



-sherman

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

Reply via email to