Re: [perf-discuss] Performance of 32-bit vs 64-bit benchmark

Darryl Gove Fri, 30 Jan 2009 11:38:07 -0800


On 01/30/09 10:52 AM, Elad Lahav wrote:
>> Just to state what might be obvious (but then again maybe not), it
>> sounds like the original expectation was that in 64-bit mode, the
>> 64-bit integers would be used to reduce the number of operations
>> needed for a given input size.  However, depending on how the
>> libgcrypt code is written, this may or may not happen.  IIRC, on
>> Solaris (at least), in 32-bit mode, 'int', 'long', and pointer types
>> are 32 bits (though you can still use 64 bit integers if you like), in
>> 64-bit mode, the only ones that change are 'long' and pointer types,
>> so if all the code uses '[unsigned] int' for all their types, all the
>> code will still be using 32-bit integers (thus no possible difference
>> if such a thing would have an impact).
> 
> My expectation was that something like the following loop would run 
> faster on 64 bit:
> 
> unsigned long array[ARR_SIZE];
> for (i = 0; i < ARR_SIZE; i++) {
>     array[i] ^= (unsigned long)SOME_CONSTANT;
> }
> 
> However, the two previous replies seem to suggest that there should be 
> no performance advantage to running 64 bit applications. So which of 
> these options is true:
> 
> 1. There may be a performance advantage to running 64 bit code, but 
> applications are not tuned to exploit it; or


I think the point was that with 64-bit ints you might be able to use a 
different algorithm, or perhaps require fewer iterations and converge 
faster.

However, you could use 'long long' which is a 64-bit data type.

> 2. There is no performance advantage (under SPARC) in running 64 bit code.
> 
There's an improved ABI (floating point variables are passed in the 
floating point register), but there are no other 'gains' (unless you 
restrict the code to V8 rather than the default of v8plusa - then you 
don't get prefetch).

Running through your example code:

Code compiled for v8plusa:
                         .L900000105:
/* 0x0040          4 */         prefetch        [%o2+264],22
/* 0x0044            */         xor     %g3,%o0,%g2
/* 0x0048          3 */         add     %g5,4,%g5
/* 0x004c          4 */         ld      [%o2],%g3  << Load
/* 0x0050            */         st      %g2,[%o2-8]
/* 0x0054          3 */         cmp     %g5,%o5
/* 0x0058            */         add     %o2,16,%o2  << Step size
/* 0x005c          4 */         xor     %o3,%o0,%g1
/* 0x0060            */         ld      [%o2-12],%o3 << Load
/* 0x0064            */         st      %g1,[%o2-20]
/* 0x0068            */         xor     %g3,%o0,%o1
/* 0x006c            */         ld      [%o2-8],%g3  << Load
/* 0x0070            */         st      %o1,[%o2-16]
/* 0x0074            */         xor     %o3,%o0,%g4
/* 0x0078            */         ld      [%o2-4],%o3  << Load
/* 0x007c          3 */         ble,pt  %icc,.L900000105
/* 0x0080          4 */         st      %g4,[%o2-12]
Loop unrolled and pipelined 4 times. Each loop iteration consumes 16 bytes.


Code compiled for v8 (pure 32-bit code):
                         .L900000105:
/* 0x0040          4 */         xor     %g3,%o0,%g1
/* 0x0044          3 */         add     %o1,3,%o1
/* 0x0048          4 */         ld      [%o2],%o3
/* 0x004c            */         st      %g1,[%o2-8]
/* 0x0050          3 */         cmp     %o1,%o5
/* 0x0054            */         add     %o2,12,%o2
/* 0x0058          4 */         xor     %g4,%o0,%g4
/* 0x005c            */         ld      [%o2-8],%g3
/* 0x0060            */         st      %g4,[%o2-16]
/* 0x0064            */         xor     %o3,%o0,%g2
/* 0x0068            */         ld      [%o2-4],%g4
/* 0x006c          3 */         ble     .L900000105
/* 0x0070          4 */         st      %g2,[%o2-12]
Loop unrolled and pipelined 3 times, each iteration consumes 12 bytes.
[No  prefetch]

Code compiled v9:
                         .L900000105:
/* 0x0040          4 */         prefetch        [%o2+272],22
/* 0x0044            */         xor     %g3,%o0,%g2
/* 0x0048          3 */         add     %g5,4,%g5
/* 0x004c          4 */         ldx     [%o2],%g3
/* 0x0050            */         stx     %g2,[%o2-16]
/* 0x0054          3 */         cmp     %g5,%o5
/* 0x0058            */         add     %o2,32,%o2  << Step size
/* 0x005c          4 */         xor     %o3,%o0,%g1
/* 0x0060            */         ldx     [%o2-24],%o3
/* 0x0064            */         stx     %g1,[%o2-40]
/* 0x0068            */         xor     %g3,%o0,%o1
/* 0x006c            */         ldx     [%o2-16],%g3
/* 0x0070            */         stx     %o1,[%o2-32]
/* 0x0074            */         xor     %o3,%o0,%g4
/* 0x0078            */         ldx     [%o2-8],%o3
/* 0x007c          3 */         ble,pt  %icc,.L900000105
/* 0x0080          4 */         stx     %g4,[%o2-24]
Loop unrolled and pipelined 4 times, each iteration consumes 32 bytes.
[Notice that the memory ops become eXtended - 64-bit rather than 32-bit.]

If I change the array to be of type long long, I get identical code for 
both 32-bit and 64-bit variants.

If I use the long long type and compile v8 I get:
                         .L900000105:
/* 0x0044          4 */         xor     %o0,%g3,%o0  << %g3
/* 0x0048          3 */         add     %o2,2,%o2
/* 0x004c          4 */         ld      [%o3],%o4
/* 0x0050            */         st      %o0,[%o3-8]
/* 0x0054          3 */         cmp     %o2,%o5
/* 0x0058            */         add     %o3,16,%o3   << step size 16
/* 0x005c          4 */         xor     %o1,%g4,%g1  <<%g4
/* 0x0060            */         ld      [%o3-12],%o1
/* 0x0064            */         st      %g1,[%o3-20]
/* 0x0068            */         xor     %o4,%g3,%g1  <<%g3
/* 0x006c            */         ld      [%o3-8],%o0
/* 0x0070            */         st      %g1,[%o3-16]
/* 0x0074            */         xor     %o1,%g4,%o4  <<%g4
/* 0x0078            */         ld      [%o3-4],%o1
/* 0x007c          3 */         ble     .L900000105
/* 0x0080          4 */         st      %o4,[%o3-12]

No native support for 64-bit integers, so I have to handle each 32-bit 
value independently.

The conclusion:

If you compile v8 and v9, then v9 is likely to be faster. However, if 
you compile v8plusa (which is the default), then you'll not get any gain 
from instruction set.

Obviously if the data set has doubled in size, then you'll have to 
stream twice as many bytes through the chip, so it will take twice as long.

Regards,

Darryl.

Code:
void func(unsigned long long* a, int n)
{
   for (int i=0; i<n; i++)
    a[i]^=1024410244ull;
}

Compiler:
cc -O -S test.c


> Thanks,
> --Elad

-- 
Darryl Gove
Compiler Performance Engineering
Blog: http://blogs.sun.com/d/
Book: http://www.sun.com/books/catalog/solaris_app_programming.xml
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] Performance of 32-bit vs 64-bit benchmark

Reply via email to