On Friday, June 17, 2016 at 2:30:12 AM UTC+2, gordo...@gmail.com wrote: > > > Modern x86 CPUs don't work like that. > > > In general, optimally scheduled assembly code which uses more registers > has higher performance than optimally scheduled assembly code which uses > smaller number of registers. Assuming both assembly codes correspond to the > same source code. > > > Register renaming: since Intel Pentium Pro and AMD K5. > > > Suggestion for reading: > http://www.agner.org/optimize/microarchitecture.pdf > > > An excerpt from the above PDF document (Section 10 about Haswell and > Broadwell pipeline): "... the register file has 168 integer registers and > 168 vector registers ..." > > I am aware of all of the above and have already read Agner Fogg's > publications. In addition modern CPU's do Out of Order Execution (OOE) so > rearrange the instructions to best reduce instruction latencies and > increase throughput given that there are parallel execution pipelines and > ahead-of-time execution, so the actual execution order is almost certainly > not as per the assembly listing. > > Yes, both assembly listings are from the same tight loop code, but the > "C/C++" one has been converted from another assembly format to the golang > assembly format. > > Daniel Bernstein, the author of "primegen" wrote for the Pentium 3 in x86 > (32-bit) code, as the Pentium Pro processor wasn't commonly available at > that time and 64-bit code didn't exist. His hand optimized C code for the > Sieve of Eratosthenes ("eratspeed.c" in the "doit()" function for the > "while k < B loop") uses six registers for this inner culling loop being > discussed, and takes about 3.5 CPU clock cycles per loop on a modern CPU > (Haswell). >
The following isn't an argument against your post or in favor of your post. It is just some assembly code that I find interesting. $ clang-3.9 -S -O3 eratspeed.c -mtune=bdver3 .LBB1_4: movl %edx, %edi shrl $5, %edi movl %edx, %ecx andl $31, %ecx movl two(,%rcx,4), *%ecx* addl %esi, %edx orl *%ecx*, a(%rbx,%rdi,4) cmpl $32032, %edx jb .LBB1_4 5 registers: DX DI CX SI BX $ perf stat --repeat=100 -- ./eratspeed.orig >/dev/null Performance counter stats for './eratspeed.orig' (100 runs): 466.592220 task-clock (msec) # 0.999 CPUs utilized ( +- 0.11% ) 1 context-switches # 0.002 K/sec ( +- 12.37% ) 0 cpu-migrations # 0.000 K/sec ( +-100.00% ) 86 page-faults # 0.185 K/sec ( +- 0.09% ) 1,862,076,369 cycles # 3.991 GHz ( +- 0.11% ) 74,805,707 stalled-cycles-frontend # 4.02% frontend cycles idle ( +- 1.20% ) 272,721,373 stalled-cycles-backend # *14.65%* backend cycles idle ( +- 0.16% ) 4,116,301,949 instructions # *2.21* insn per cycle # 0.07 stalled cycles per insn ( +- 0.00% ) 473,019,237 branches # 1013.774 M/sec ( +- 0.00% ) 8,443,283 branch-misses # 1.78% of all branches ( +- 1.05% ) 0.467167554 seconds time elapsed ( +- 0.11% ) ----- Hand optimization of the above: .LBB1_4: movl %edx, %edi shrl $5, %edi movl %edx, %ecx andl $31, %ecx movl two(,%rcx,4), *%r11d* addl %esi, %edx orl *%r11d*, a(%rbx,%rdi,4) cmpl $32032, %edx jb .LBB1_4 6 registers: DX DI CX SI BX R11 $ perf stat --repeat=100 -- ./eratspeed.opt >/dev/null Performance counter stats for './eratspeed.opt' (100 runs): 444.740487 task-clock (msec) # 0.999 CPUs utilized ( +- 0.01% ) 1 context-switches # 0.002 K/sec ( +- 11.68% ) 0 cpu-migrations # 0.000 K/sec ( +-100.00% ) 85 page-faults # 0.191 K/sec ( +- 0.10% ) 1,775,283,795 cycles # 3.992 GHz ( +- 0.00% ) 68,397,472 stalled-cycles-frontend # 3.85% frontend cycles idle ( +- 0.04% ) 201,687,390 stalled-cycles-backend # *11.36%* backend cycles idle ( +- 0.02% ) 4,116,277,783 instructions # *2.32* insn per cycle # 0.05 stalled cycles per insn ( +- 0.00% ) 473,014,763 branches # 1063.575 M/sec ( +- 0.00% ) 8,263,323 branch-misses # 1.75% of all branches ( +- 0.00% ) 0.445287360 seconds time elapsed ( +- 0.01% ) The number of internal CPU registers actually used by the CPU to effect OOE > is beside the point, as they have to do with the CPU's internal > optimizations and not compiler optimizations; my point is that the > compiler's incorrect use of registers still costs time. While I don't expect golang, with its philosophy of preserving "safe" > paradigms in doing array bounds checks by default, to run as fast as C/C++ > code that doesn't have that philosophy, I do expect it to run at least as > fast as C#/Java code which are Just In Time (JIT) compiled and do have the > "safe" philosophy. The golang compiler version 1.7beta1 is not quite there > yet for the indicated reasons: inconsistent use of registers, using one > too many in one place in order to avoid an immediate load which doesn't > cost any execution time, and saving one register by use of the "trick" > which does cost execution time as compared to the use of a single register. > > However, there is hope as version 1.7 has made great advances since > version 1.6; surely version 1.8, which is said to intend to improve this > further will be faster yet. At any rate, version 1.7 speed is "adequate" > for many purposes as at least it comes close (probably within about 15% to > 20% or less) of C#/Java speed in many of the most demanding tight loop > algorithms, and thus is quite usable as compared to previous versions. But > even the most avid golang protagonists must admit that it isn't the > language to re-write Kim Walisch's "primesieve" with its extreme loop > unrolling that takes an average of about 1.4 CPU clock cycles per composite > number cull for small ranges of primes, as even with array bounds checking > turned off, golang would still take at least twice and more likely three > times as long. > > That is also why I first started posting to this thread: the only reason > the golang version of "primegen" is reasonably comparable in speed to C/C++ > "primegen" is that it uses multi-threading on a multi-core processor, which > weren't available to Daniel Bernstein when he wrote "primegen". My point > was one should compare like with like. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.