On 09/30/2011 07:09 AM, Ulf Zibis wrote:
(1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} --->
CoderResult.malformedForLength(1)
It appears the Unicode Standard now explicitly recommends to return
the malformed length 2,
what UTF-8 is doing now, for this scenario
My idea behind is, that in case of malformed length 1 a consecutive
call to the decode loop would again return another malformed length 1,
to ensure 2 replacement chars in the output string. (Not sure, if that
is expected in this corner case.)
Unicode Standard's "best practices" D93a/b recommends to return 2 in
this case.
3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results
<http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537>
I'm not sure I understand the suggested b1 < -0x3e patch, I don't
see we can simply replace
((b1 >> 5) == -2) with (b1 < -0x3e).
You must see the b1 < -0x3e in combination with the following b1 <
-0x20. ;-)
But now I have a better "if...else if" switch. :-)
- saves the shift operations
- only 1 comparison per case
- only 1 constant to load per case
- helps compiler to benefit from 1 byte constants and op-codes
- much better readable
I believe we changed from (b1 < xyz) to (b1 >> x) == -2 back to 2009(?)
because
the benchmark shows the "shift" version is slightly faster. Do you have
any number
shows any difference now. My non-scientific benchmark still suggests the
"shift"
type is faster on -server vm, no significant difference on -client vm.
------------------ your new switch---------------
(1) -server
Method Millis Ratio
Decoding 1b UTF-8 : 125 1.000
Decoding 2b UTF-8 : 2558 20.443
Decoding 3b UTF-8 : 3439 27.481
Decoding 4b UTF-8 : 2030 16.221
(2) -client
Decoding 1b UTF-8 : 335 1.000
Decoding 2b UTF-8 : 1041 3.105
Decoding 3b UTF-8 : 2245 6.694
Decoding 4b UTF-8 : 1254 3.741
------------------ existing "shift"---------------
(1) -server
Decoding 1b UTF-8 : 134 1.000
Decoding 2b UTF-8 : 1891 14.106
Decoding 3b UTF-8 : 2934 21.886
Decoding 4b UTF-8 : 2133 15.913
(2) -client
Decoding 1b UTF-8 : 341 1.000
Decoding 2b UTF-8 : 949 2.560
Decoding 3b UTF-8 : 2321 6.255
Decoding 4b UTF-8 : 1278 3.446
-sherman