Am 18.08.2015 um 21:06 schrieb Karel Gardas:
Thanks a lot for doing this. It looks like g++ is memory-bound in this
case, isn't it? What does stream[1] benchmark tell on host and
emulated as 32/64bit sparc binary? Let's see if the ratio is kind of
similar to the time you get...
[1]:https://www.cs.virginia.edu/stream/
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
==>host Ubuntu 15.04 x64
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 14147 microseconds.
(= 14147 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 8877.1 0.018049 0.018024 0.018074
Scale: 8842.7 0.018206 0.018094 0.018749
Add: 10312.9 0.023367 0.023272 0.023901
Triad: 10114.3 0.023758 0.023729 0.023871
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
qemu 2.4.50 x64 build
==>netbsd-guest NetBSD 6.1.5 SPARC64 (pure 64bit) running from ramdisk
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 42 microseconds.
Each test below will take on the order of 330428 microseconds.
(= 7867 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 771.5 0.214717 0.207377 0.244214
Scale: 288.1 0.573320 0.555401 0.660161
Add: 423.5 0.633523 0.566661 1.092067
Triad: 242.9 1.053032 0.987970 1.499563
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
==>debian-guest 7.8.0 SPARC64 (mixed 32/64bit) running from ramdisk
!!32bit version!!
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 41 microseconds.
Each test below will take on the order of 394519 microseconds.
(= 9622 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 629.4 0.280860 0.254224 0.401105
Scale: 231.7 0.733338 0.690452 0.868741
Add: 346.9 0.747893 0.691890 0.889102
Triad: 201.4 1.239293 1.191786 1.394918
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
!!64bit version!!
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 40 microseconds.
Each test below will take on the order of 395364 microseconds.
(= 9884 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 651.3 0.251320 0.245668 0.274346
Scale: 240.3 0.694808 0.665834 0.770982
Add: 353.0 0.690291 0.679792 0.715228
Triad: 201.5 1.207881 1.191054 1.256001
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------