On Friday, June 17, 2016 at 2:30:12 AM UTC+2, gordo...@gmail.com wrote:
>
> > Modern x86 CPUs don't work like that. 
>
> > In general, optimally scheduled assembly code which uses more registers 
> has higher performance than optimally scheduled assembly code which uses 
> smaller number of registers. Assuming both assembly codes correspond to the 
> same source code. 
>
> > Register renaming: since Intel Pentium Pro and AMD K5. 
>
> > Suggestion for reading: 
> http://www.agner.org/optimize/microarchitecture.pdf 
>
> > An excerpt from the above PDF document (Section 10 about Haswell and 
> Broadwell pipeline): "... the register file has 168 integer registers and 
> 168 vector registers ..." 
>
> I am aware of all of the above and have already read Agner Fogg's 
> publications.  In addition modern CPU's do Out of Order Execution (OOE) so 
> rearrange the instructions to best reduce instruction latencies and 
> increase throughput given that there are parallel execution pipelines and 
> ahead-of-time execution, so the actual execution order is almost certainly 
> not as per the assembly listing. 
>
> Yes, both assembly listings are from the same tight loop code, but the 
> "C/C++" one has been converted from another assembly format to the golang 
> assembly format. 
>
> Daniel Bernstein, the author of "primegen" wrote for the Pentium 3 in x86 
> (32-bit) code, as the Pentium Pro processor wasn't commonly available at 
> that time and 64-bit code didn't exist.  His hand optimized C code for the 
> Sieve of Eratosthenes ("eratspeed.c" in the "doit()" function for the 
> "while k < B loop") uses six registers for this inner culling loop being 
> discussed, and takes about 3.5 CPU clock cycles per loop on a modern CPU 
> (Haswell).
>

The following isn't an argument against your post or in favor of your post. 
It is just some assembly code that I find interesting.

$ clang-3.9 -S -O3 eratspeed.c -mtune=bdver3
.LBB1_4:
        movl    %edx, %edi
        shrl    $5, %edi
        movl    %edx, %ecx
        andl    $31, %ecx
        movl    two(,%rcx,4), *%ecx*
        addl    %esi, %edx
        orl     *%ecx*, a(%rbx,%rdi,4)
        cmpl    $32032, %edx
        jb      .LBB1_4

5 registers: DX DI CX SI BX

$ perf stat --repeat=100 -- ./eratspeed.orig >/dev/null

 Performance counter stats for './eratspeed.orig' (100 runs):

        466.592220      task-clock (msec)         #    0.999 CPUs utilized 
           ( +-  0.11% )
                 1      context-switches          #    0.002 K/sec         
           ( +- 12.37% )
                 0      cpu-migrations            #    0.000 K/sec         
           ( +-100.00% )
                86      page-faults               #    0.185 K/sec         
           ( +-  0.09% )
     1,862,076,369      cycles                    #    3.991 GHz           
           ( +-  0.11% )
        74,805,707      stalled-cycles-frontend   #    4.02% frontend 
cycles idle     ( +-  1.20% )
       272,721,373      stalled-cycles-backend    #  *14.65%* backend 
cycles idle       ( +-  0.16% )
     4,116,301,949      instructions              #    *2.21*  insn per 
cycle         
                                                  #    0.07  stalled cycles 
per insn  ( +-  0.00% )
       473,019,237      branches                  # 1013.774 M/sec         
           ( +-  0.00% )
         8,443,283      branch-misses             #    1.78% of all 
branches          ( +-  1.05% )

       0.467167554 seconds time elapsed                                     
     ( +-  0.11% )

-----

Hand optimization of the above:

.LBB1_4:
        movl    %edx, %edi
        shrl    $5, %edi
        movl    %edx, %ecx
        andl    $31, %ecx
        movl    two(,%rcx,4), *%r11d*
        addl    %esi, %edx
        orl     *%r11d*, a(%rbx,%rdi,4)
        cmpl    $32032, %edx
        jb      .LBB1_4

6 registers: DX DI CX SI BX R11

$ perf stat --repeat=100 -- ./eratspeed.opt >/dev/null
 Performance counter stats for './eratspeed.opt' (100 runs):

        444.740487      task-clock (msec)         #    0.999 CPUs utilized 
           ( +-  0.01% )
                 1      context-switches          #    0.002 K/sec         
           ( +- 11.68% )
                 0      cpu-migrations            #    0.000 K/sec         
           ( +-100.00% )
                85      page-faults               #    0.191 K/sec         
           ( +-  0.10% )
     1,775,283,795      cycles                    #    3.992 GHz           
           ( +-  0.00% )
        68,397,472      stalled-cycles-frontend   #    3.85% frontend 
cycles idle     ( +-  0.04% )
       201,687,390      stalled-cycles-backend    #  *11.36%* backend 
cycles idle       ( +-  0.02% )
     4,116,277,783      instructions              #    *2.32*  insn per 
cycle         
                                                  #    0.05  stalled cycles 
per insn  ( +-  0.00% )
       473,014,763      branches                  # 1063.575 M/sec         
           ( +-  0.00% )
         8,263,323      branch-misses             #    1.75% of all 
branches          ( +-  0.00% )

       0.445287360 seconds time elapsed                                     
     ( +-  0.01% )

The number of internal CPU registers actually used by the CPU to effect OOE 
> is beside the point, as they have to do with the CPU's internal 
> optimizations and not compiler optimizations; my point is that the 
> compiler's incorrect use of registers still costs time.  

 

While I don't expect golang, with its philosophy of preserving "safe" 
> paradigms in doing array bounds checks by default, to run as fast as C/C++ 
> code that doesn't have that philosophy, I do expect it to run at least as 
> fast as C#/Java code which are Just In Time (JIT) compiled and do have the 
> "safe" philosophy.  The golang compiler version 1.7beta1 is not quite there 
> yet for the indicated reasons:  inconsistent use of registers, using one 
> too many in one place in order to avoid an immediate load which doesn't 
> cost any execution time, and saving one register by use of the "trick" 
> which does cost execution time as compared to the use of a single register. 
>
> However, there is hope as version 1.7 has made great advances since 
> version 1.6; surely version 1.8, which is said to intend to improve this 
> further will be faster yet.  At any rate, version 1.7 speed is "adequate" 
> for many purposes as at least it comes close (probably within about 15% to 
> 20% or less) of C#/Java speed in many of the most demanding tight loop 
> algorithms, and thus is quite usable as compared to previous versions.  But 
> even the most avid golang protagonists must admit that it isn't the 
> language to re-write Kim Walisch's "primesieve" with its extreme loop 
> unrolling that takes an average of about 1.4 CPU clock cycles per composite 
> number cull for small ranges of primes, as even with array bounds checking 
> turned off, golang would still take at least twice and more likely three 
> times as long. 
>
> That is also why I first started posting to this thread:  the only reason 
> the golang version of "primegen" is reasonably comparable in speed to C/C++ 
> "primegen" is that it uses multi-threading on a multi-core processor, which 
> weren't available to Daniel Bernstein when he wrote "primegen".  My point 
> was one should compare like with like. 

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to