[AMD Official Use Only - AMD Internal Distribution Only]

Re: The two GPU stats.  I believe square running on APU will only have one 
kernel since it does not need to DMA. The stats sections are mostly likely (1) 
system boot until end of 1st and only kernel due to the path Matt pointed out 
and (2) the stat dump gem5 does at exit. The second section is only capturing 
the application between last kernel ending and gem5 exiting.  Usually this is 
just application cleanup / tear down and maybe a verification step, all of 
which run on CPU, thus GPU stats would be zeros


-Matt

From: Matt Sinclair <mattdsinclair.w...@gmail.com>
Sent: Saturday, October 5, 2024 5:58 PM
To: The gem5 Users mailing list <gem5-users@gem5.org>
Cc: Beser, Nicholas D. <nick.be...@jhuapl.edu>; Poremba, Matthew 
<matthew.pore...@amd.com>
Subject: Re: [gem5-users] Question about running GPU emulation in gem5

Caution: This message originated from an External Source. Use proper caution 
when opening attachments, clicking links, or responding.

Hi Nicholas,

Really glad to hear these GPU tests are useful for your class!  I am not in 
front of a terminal, so I can't confirm every single thing, but here is what I 
think is happening:

- You mention there are 2 sets of stats.  This is potentially because a recent 
commit (https://github.com/gem5/gem5/pull/1217) added support for each GPU 
kernel to dump and reset the stats.

- Why are there 2 sets of stats if only 1 kernel seems to be launched?  Well 
GPUs have special kernels that are not visible to users that do things like DMA 
operations (e.g., hipMemcpy's), copying kernel code, etc.  This is what is 
happening in your case.  Specifically, there is a Blit/SDMA kernel happening 
(probably doing a DMA operation).  The above commit made it so these special 
kernels keep stats separately because otherwise the hits and misses would not 
line up for the "real" kernels (e.g., DMA operations would affect the L2, but 
would not have any activity on the CUs).

- However, you mentioned that the second set of stats was the "empty" one (with 
no activity on the CUs).  This is slightly surprising, as I would have expected 
the first one to be empty (e.g., because it was copying the kernel to the GPU). 
 But perhaps in your case the second set of stats is for a hipMemcpy after the 
"real" kernel... or it's just the CPU portion of the program after the kernel 
completes (e.g., 
https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L79).
  Based on the context you provided, the latter sounds more likely.  In any 
event, the stats for the phase without CU activity can effectively be ignored 
if you only want to look at the GPU phase.  You could also consider putting in 
m5_work_begin and m5_work_end markers in the code to help ensure stats from 
outside the ROI are not included (e.g., 
https://github.com/gem5/gem5-resources/blob/stable/src/gpu/pannotia/color/coloring_maxmin.cpp#L183).
  Also, to verify how many GPU phases are actually happening, you could run 
with the GPUKernelInfo debug flag -- this will basically only print for each 
new GPU kernel ("real" GPU kernel or Blit/SDMA kernel).  If there is a 
blit/SDMA kernel there should be (at least) 2 kernels launched.

- Finally, in terms of the input size and baseline GPU configuration.  You are 
right that the baseline GPU configuration is not a particularly large GPU.  My 
group has an artifact we're releasing with an upcoming MICRO paper that models 
something more substantive that I can point you to, but in the meantime let me 
explain what is happening.  Increasing the number of threads by itself is not 
going to increase the amount of work being done in square.  Instead, the GPU 
kernel's work depends on the size of the input array 
(https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L45),
 which gets set here 
(https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L54).
  Increasing the number of threads without increasing the work the kernel is 
doing will just result in threads that almost immediately exit the GPU kernel 
because they are attempting to access indices out of bounds, which the above 
loop I linked ignores.  Also, square in its default state is really sized for 
running very quick validation tests in gem5's daily regression.  So, instead, 
you'd need to change line 54 to increase N (and then increase the number of 
work groups) if you want to make square run something larger.  Probably we 
could update the code that determines work groups 
(https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L72)
 to be more directly configured with input size too.

- One last thing you may consider: when I have run the GPU portion of the gem5 
tutorials and bootcamps in the past, I've used other architectural features 
such as register allocation to demonstrate performance impact for applications 
with very short runtimes.  For example, you may consider the example here 
(https://youtu.be/1a9Yj-QaQoo?t=5388) from the 2022 bootcamp with square, which 
I subsequently updated to run with MFMA (AMD's equivalent to TensorCore) 
operations in the 2024 bootcamp 
(https://github.com/gem5bootcamp/2024/blob/main/slides/04-GPU-model/gpu-slides.pdf,
 slide 58).

Hope this helps,
Matt

On Sat, Oct 5, 2024 at 10:50 AM Beser, Nicholas D. via gem5-users 
<gem5-users@gem5.org<mailto:gem5-users@gem5.org>> wrote:
I am teaching an advanced computer architecture class and had the class run the 
GPU example that was run in the 2024 bootcamp:

docker run --volume $(pwd):$(pwd) -w $(pwd) 
ghcr.io/gem5/gcn-gpu:v24-0<http://ghcr.io/gem5/gcn-gpu:v24-0> 
gem5/build/VEGA_X86/gem5.opt gem5/configs/example/apu_se.py -n 3  
--gfx-version=gfx902 -
c gem5-resources/src/gpu/square/bin/square

The example ran, however the stats.txt file had two Simulation statistics runs. 
The second did not appear to have any activity on the CUs. Can someone tell me 
why the simulation had two runs? We are using the first run as the GPU 
simulation statistics.

We also ran the simulation while varying the number of CU’s. We did not see 
much change in performance. I thought it was due to the benchmark that was run. 
One of my students modified the benchmark to use more threads, but we did not 
see much change. My thoughts were that this was due to the benchmark again, 
that the resources required were not stressed by the 4 CU’s and changing the 
number to larger one’s also did not stress the CU’s.

Nick


_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org<mailto:gem5-users@gem5.org>
To unsubscribe send an email to 
gem5-users-le...@gem5.org<mailto:gem5-users-le...@gem5.org>
_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org

Reply via email to