Hi Andrew, hello world,
Now with AMD Instinct MI200 data - see below.
And a better look at the numbers. In terms of USM,
there does not seem to be any clear winner of both
approaches. If we want to draw conclusions, definitely
more runs are needed (statistics):
The runs below show that the differences between runs
can be larger than the effect of mapping vs. USM.
And that OG13's USM was be 40% slower on MI210
(compared with mainline or OG13 'map') while
mainline's USM is about as fast as 'map' (OG13 or mainline)
is not consistent with the MI250X result, were both USM are
slower with mainline's USM being much slower with ~30%
than OG13 with 12%.
Tobias Burnus wrote:
I have now tried it on my laptop with
BabelStream,https://github.com/UoB-HPC/BabelStream
Compiling with:
echo "#pragma omp requires unified_shared_memory" > omp-usm.h
cmake -DMODEL=omp -DCMAKE_CXX_COMPILER=$HOME/projects/gcc-trunk-offload/bin/g++
\
-DCXX_EXTRA_FLAGS="-g -include ../omp-usm.h -foffload=nvptx-none
-fopenmp" -DOFFLOAD=ON ..
(and the variants: no -include (→ map) + -DOFFLOAD=OFF (= host), and with
hostfallback,
via env var (or usm-14 by due to lacking support.)
For mainline, I get (either with libgomp.so of mainline or GCC 14, i.e. w/o USM
support):
host-14.log 195.84user 0.94system 0 11.20elapsed 1755%CPU
(0avgtext+0avgdata 1583268maxresident)k
host-mainline.log 200.16user 1.00system 0 11.89elapsed 1691%CPU
(0avgtext+0avgdata 1583272maxresident)k
hostfallback-mainline.log 288.99user 4.57system 0 19.39elapsed 1513%CPU
(0avgtext+0avgdata 1583972maxresident)k
usm-14.log 279.91user 5.38system 0 19.57elapsed 1457%CPU
(0avgtext+0avgdata 1590168maxresident)k
map-14.log 4.17user 0.45system 0 03.58elapsed 129%CPU
(0avgtext+0avgdata 1691152maxresident)k
map-mainline.log 4.15user 0.44system 0 03.58elapsed 128%CPU
(0avgtext+0avgdata 1691260maxresident)k
usm-mainline.log 3.63user 1.96system 0 03.88elapsed 144%CPU
(0avgtext+0avgdata 1692068maxresident)k
Thus: GPU is faster than host, host fallback takes 40% longer than doing host
compilation.
USM is 15% faster than mapping.
Correction: I shouldn't look at user time but at elapsed time. For the
latter, USM is 8% slower on mainline; hostfallback is ~70% slower than
host execution.
With OG13, the pattern is similar, except that USM is only 3% faster.
Here, USM (elapsed) is 2.5% faster. It is a bit difficult to compare the
results as OG13 is faster for mapping and USM, which makes
distinguishing OG13 vs mainline performance and the two different USM
approaches difficult.
host-og13.log 191.51user 0.70system 0 09.80elapsed 1960%CPU
(0avgtext+0avgdata 1583280maxresident)k
map-hostfallback-og13.log 205.12user 1.09system 0 10.82elapsed 1905%CPU
(0avgtext+0avgdata 1585092maxresident)k
usm-hostfallback-og13.log 338.82user 4.60system 0 19.34elapsed 1775%CPU
(0avgtext+0avgdata 1584580maxresident)k
map-og13.log 4.43user 0.42system 0 03.59elapsed 135%CPU
(0avgtext+0avgdata 1692692maxresident)k
usm-og13.log 4.31user 1.18system 0 03.68elapsed 149%CPU
(0avgtext+0avgdata 1686256maxresident)k
* * *
As IT issues are now solved:
(A) On AMD Instinct MI210 (gfx90a)
The host fallback is here very slow with elapsed time 24s vs. 1.6s for host
execution.
map and USM seem to be in the same ballpark.
For two 'map' runs, I see a difference of 8%, the USM times are between those
map results.
I see similar results for OG13 than mainline, except for USM which is ~40%
slower (elapse time)
than map (OG13 or mainline - or mainline's USM).
host-mainline-2.log 194.00user 7.21system 0 01.44elapsed 13954%CPU
(0avgtext+0avgdata 1320960maxresident)k
host-mainline.log 221.53user 5.58system 0 01.78elapsed 12716%CPU
(0avgtext+0avgdata 1318912maxresident)k
hostfallback-mainline-1.log 3073.35user 146.22system 0 24.25elapsed
13272%CPU (0avgtext+0avgdata 1644544maxresident)k
hostfallback-mainline-2.log 2268.62user 146.13system 0 23.39elapsed
10320%CPU (0avgtext+0avgdata 1650544maxresident)k
map-mainline-1.log 5.38user 16.16system 0 03.00elapsed 716%CPU
(0avgtext+0avgdata 1714936maxresident)k
map-mainline-2.log 5.12user 15.93system 0 02.74elapsed 768%CPU
(0avgtext+0avgdata 1714932maxresident)k
usm-mainline-1.log 7.61user 2.30system 0 02.89elapsed 342%CPU
(0avgtext+0avgdata 1716984maxresident)k
usm-mainline-2.log 7.75user 2.92system 0 02.89elapsed 369%CPU
(0avgtext+0avgdata 1716980maxresident)k
host-og13-1.log 213.69user 6.37system 0 01.56elapsed 14026%CPU
(0avgtext+0avgdata 1316864maxresident)k
hostfallback-map-og13-1.log 3026.68user 123.77system 0 23.69elapsed
13295%CPU (0avgtext+0avgdata 1642496maxresident)k
hostfallback-map-og13-2.log 3118.71user 123.81system 0 24.49elapsed
13235%CPU (0avgtext+0avgdata 1628160maxresident)k
hostfallback-usm-og13-1.log 3070.33user 116.23system 0 23.86elapsed
13354%CPU (0avgtext+0avgdata 1648632maxresident)k
hostfallback-usm-og13-2.log 3112.34user 125.54system 0 24.39elapsed
13273%CPU (0avgtext+0avgdata 1622012maxresident)k
map-og13-1.log 5.61user 7.13system 0 02.69elapsed 472%CPU
(0avgtext+0avgdata 1716984maxresident)k
map-og13-2.log 5.39user 16.25system 0 02.83elapsed 764%CPU
(0avgtext+0avgdata 1716984maxresident)k
usm-og13-1.log 7.23user 3.13system 0 04.37elapsed 237%CPU
(0avgtext+0avgdata 1716964maxresident)k
usm-og13-2.log 7.31user 3.15system 0 03.98elapsed 262%CPU
(0avgtext+0avgdata 1716964maxresident)k
* * *
Running it on MI250X:
USM is in the sam ballpark as MAP – but here USM is actually 30% or 12% slower
than map.
omp-stream-mainline-map
7.24user 0.71system 0:01.18elapsed 672%CPU (0avgtext+0avgdata
1728852maxresident)k
omp-stream-mainline-usm
2.48user 1.07system 0:01.44elapsed 247%CPU (0avgtext+0avgdata
1728916maxresident)k
omp-stream-og13-map
7.14user 0.72system 0:01.10elapsed 712%CPU (0avgtext+0avgdata
1728708maxresident)k
omp-stream-og13-usm
2.32user 0.91system 0:01.23elapsed 262%CPU (0avgtext+0avgdata
1991180maxresident)k
Tobias