[gem5-users] Re: Question about running GPU emulation in gem5

Matt Sinclair via gem5-users Sat, 05 Oct 2024 18:02:03 -0700

Hi Nicholas,

Really glad to hear these GPU tests are useful for your class!  I am not in
front of a terminal, so I can't confirm every single thing, but here is
what I think is happening:


- You mention there are 2 sets of stats.  This is potentially because a
recent commit (https://github.com/gem5/gem5/pull/1217) added support for
each GPU kernel to dump and reset the stats.

- Why are there 2 sets of stats if only 1 kernel seems to be launched?
Well GPUs have special kernels that are not visible to users that do things
like DMA operations (e.g., hipMemcpy's), copying kernel code, etc.  This is
what is happening in your case.  Specifically, there is a Blit/SDMA kernel
happening (probably doing a DMA operation).  The above commit made it so
these special kernels keep stats separately because otherwise the hits and
misses would not line up for the "real" kernels (e.g., DMA operations would
affect the L2, but would not have any activity on the CUs).

- However, you mentioned that the second set of stats was the "empty" one
(with no activity on the CUs).  This is slightly surprising, as I would
have expected the first one to be empty (e.g., because it was copying the
kernel to the GPU).  But perhaps in your case the second set of stats is
for a hipMemcpy after the "real" kernel... or it's just the CPU portion of
the program after the kernel completes (e.g.,
https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L79).
Based on the context you provided, the latter sounds more likely.  In any
event, the stats for the phase without CU activity can effectively be
ignored if you only want to look at the GPU phase.  You could also consider
putting in m5_work_begin and m5_work_end markers in the code to help ensure
stats from outside the ROI are not included (e.g.,
https://github.com/gem5/gem5-resources/blob/stable/src/gpu/pannotia/color/coloring_maxmin.cpp#L183).
Also, to verify how many GPU phases are actually happening, you could run
with the GPUKernelInfo debug flag -- this will basically only print for
each new GPU kernel ("real" GPU kernel or Blit/SDMA kernel).  If there is a
blit/SDMA kernel there should be (at least) 2 kernels launched.

- Finally, in terms of the input size and baseline GPU configuration.  You
are right that the baseline GPU configuration is not a particularly large
GPU.  My group has an artifact we're releasing with an upcoming MICRO paper
that models something more substantive that I can point you to, but in the
meantime let me explain what is happening.  Increasing the number of
threads by itself is not going to increase the amount of work being done in
square.  Instead, the GPU kernel's work depends on the size of the input
array (
https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L45),
which gets set here (
https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L54).
Increasing the number of threads without increasing the work the kernel is
doing will just result in threads that almost immediately exit the GPU
kernel because they are attempting to access indices out of bounds, which
the above loop I linked ignores.  Also, square in its default state is
really sized for running very quick validation tests in gem5's daily
regression.  So, instead, you'd need to change line 54 to increase N (and
then increase the number of work groups) if you want to make square run
something larger.  Probably we could update the code that determines work
groups (
https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L72)
to be more directly configured with input size too.

- One last thing you may consider: when I have run the GPU portion of the
gem5 tutorials and bootcamps in the past, I've used other architectural
features such as register allocation to demonstrate performance impact for
applications with very short runtimes.  For example, you may consider the
example here (https://youtu.be/1a9Yj-QaQoo?t=5388) from the 2022 bootcamp
with square, which I subsequently updated to run with MFMA (AMD's
equivalent to TensorCore) operations in the 2024 bootcamp (
https://github.com/gem5bootcamp/2024/blob/main/slides/04-GPU-model/gpu-slides.pdf,
slide 58).

Hope this helps,
Matt

On Sat, Oct 5, 2024 at 10:50 AM Beser, Nicholas D. via gem5-users <
[email protected]> wrote:

> I am teaching an advanced computer architecture class and had the class
> run the GPU example that was run in the 2024 bootcamp:
>
>
>
> docker run --volume $(pwd):$(pwd) -w $(pwd) ghcr.io/gem5/gcn-gpu:v24-0
> gem5/build/VEGA_X86/gem5.opt gem5/configs/example/apu_se.py -n 3
> --gfx-version=gfx902 -
>
> c gem5-resources/src/gpu/square/bin/square
>
>
>
> The example ran, however the stats.txt file had two Simulation statistics
> runs. The second did not appear to have any activity on the CUs. Can
> someone tell me why the simulation had two runs? We are using the first run
> as the GPU simulation statistics.
>
>
>
> We also ran the simulation while varying the number of CU’s. We did not
> see much change in performance. I thought it was due to the benchmark that
> was run. One of my students modified the benchmark to use more threads, but
> we did not see much change. My thoughts were that this was due to the
> benchmark again, that the resources required were not stressed by the 4
> CU’s and changing the number to larger one’s also did not stress the CU’s.
>
>
>
> Nick
>
>
>
>
> _______________________________________________
> gem5-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>

_______________________________________________
gem5-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[gem5-users] Re: Question about running GPU emulation in gem5

Reply via email to