Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

Ulf Zibis Fri, 14 Oct 2011 01:31:34 -0700

Am 13.10.2011 21:13, schrieb Xueming Shen:

On 10/13/2011 09:55 AM, Ulf Zibis wrote:

Am 11.10.2011 19:49, schrieb Xueming Shen:


I don't know which one is better, I did a run on

    private static boolean op1(int b) {
        return (b >> 6) != -2;
    }
    private static boolean op2(int b) {
        return (b & 0xc0) != 0x80;
    }
    private static boolean op3(byte b) {
         return b >= (byte)0xc0;
    }

with 1000000 iteration on my linux machine,  and got the scores

op1=1149
op2=1147
op3=1146

I would interpret it as they are identical.

Me too. thanks for your effort.
Maybe the comparison would differ on different architectures.

So I would prefer opt3, because the others ...
1. in question need 1 more CPU register to save the original value of b for 
later usage
2. need 1 more constant to load into CPU
and opt 3 ...
3. is the best readable source code
4. in question seems best to help Hotspot finding best optimization on 
arbitrary architectures.

5. is the smallest in bytecode footprint
6. so interpreter would be faster too.

I doubt it's more "readable":-), given it's the "byte operation" means
"<0x80 && >= 0xc0" in int.

If b would be an unsigned int in range [0..0xFF], half yes (it would be: b<0x80 || 
b>=0xc0).

But b is in range [-0x80..0x7F] due to it's origin from a byte array, so the operation translated toint would be: "b < -0x80 || b >= -0x40"

You need "b" to be byte for b >= (byte)0xc0

No, it works as same for int, because the lower limit -0x80 will never be exceeded and (byte)0xc0 is-0x40.

So the notation "b >= (byte)0xc0" looks most close to its real unsigned 
counterpart.

to be the equivalent of "<0x80 && >= 0xc0" and all use cases in UTF-8
existing implementation the "b" has been stored in "int" already.  Arguably
you can update the whole implementation to achieve this,

yes, that's exactly what I wanted to say.

but personally
I would like to just stick to the problem this proposal is trying to solve.

I agree, but it's not much more than declaring the bx as byte.


And, no, for the same reason I don't want to replace all "(b & 0xc0) != 0x80
by "isNotContinuation(b)", they just look fine for me, together with their
neighbors, such as "<0x80 && >= 0xc0".

Yes, they look fine, but the reader always must put in mind, that "(b & 0xc0) != 0x80" issemantically same than "isNotContinuation(b)".Why you introduce isNotContinuation(b) at all? It could always be inlined, as I don't think, thetiny operation has any effect on HotSpot's optimization strategy, and as a side effect, I guess C1code would be faster.


-Ulf

-Sherman
Additionally I guess using always byte variables would in question help HotSpot to optimize with1-byte-operand CPU instructions.
Don't you like to replace all "(bx & 0xc0) != 0x80" by "isNotContinuation(bx)" ?

-Ulf

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

Reply via email to