On Sun, 31 Jul 2016, Slawa Olhovchenkov wrote:

On Sun, Jul 31, 2016 at 11:11:25PM +1000, Bruce Evans wrote:

Misalignment of this loop made it almost twice as slow on old Turion2 with
slow DDR2 memory.  It made no difference on Haswell.  I added an extra
movnti, but that makes little or no differences.  2 more movnti's wouldn't
fit in a 16-byte cache line so are slower unless even more care is taken
with alignment (or with less care, 4 with misalignment are not less than
twice as slow as 1 with alignment).

I thought that alignment and unrolling didn't matter here, because movnti
has to wait for memory and almost any loop runs fast enough to keep up.
The timing on my old system is something like: CPUs at 2 GHz; main memory
at 4 GB/sec; movnti is only 4 bytes wide on i386 (so this problem
only affects i386, at least with slow memory).  So sustaining 4 GB/sec
requires 1 G movnti's/sec, so the loop needs to run at 2 cycles/iteration
to keep up.  But when it is misaligned, it runs at 3-4 cycles/iteration.
Alignment makes it take about 2, and the extra movnti is for safety and
to work with faster memory.

On Haswell with CPUs at 4 GHz, 2 cycles/iteration gives 8 GB/sec on
i386 and 16 GB/sec on amd64 with wider movnti.  IIRC, 16 GB/sec is about
the main memory speed so nothing better is possible but just 1 extra
movnti gives more with faster memory.  This is just worse than bzero()

What about modern system with 120 GB/sec main memory speed?

Is there such a system?  It would have main memory almost twice as fast
as Haswell L2 and almost half as fast as Haswell L1.

My fastest memory actually does 20001 MB/s according to old memtest
and that is about right according to other tests.

Bruce
_______________________________________________
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Reply via email to