Hi Nicholas, Really glad to hear these GPU tests are useful for your class! I am not in front of a terminal, so I can't confirm every single thing, but here is what I think is happening:
- You mention there are 2 sets of stats. This is potentially because a recent commit (https://github.com/gem5/gem5/pull/1217) added support for each GPU kernel to dump and reset the stats. - Why are there 2 sets of stats if only 1 kernel seems to be launched? Well GPUs have special kernels that are not visible to users that do things like DMA operations (e.g., hipMemcpy's), copying kernel code, etc. This is what is happening in your case. Specifically, there is a Blit/SDMA kernel happening (probably doing a DMA operation). The above commit made it so these special kernels keep stats separately because otherwise the hits and misses would not line up for the "real" kernels (e.g., DMA operations would affect the L2, but would not have any activity on the CUs). - However, you mentioned that the second set of stats was the "empty" one (with no activity on the CUs). This is slightly surprising, as I would have expected the first one to be empty (e.g., because it was copying the kernel to the GPU). But perhaps in your case the second set of stats is for a hipMemcpy after the "real" kernel... or it's just the CPU portion of the program after the kernel completes (e.g., https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L79). Based on the context you provided, the latter sounds more likely. In any event, the stats for the phase without CU activity can effectively be ignored if you only want to look at the GPU phase. You could also consider putting in m5_work_begin and m5_work_end markers in the code to help ensure stats from outside the ROI are not included (e.g., https://github.com/gem5/gem5-resources/blob/stable/src/gpu/pannotia/color/coloring_maxmin.cpp#L183). Also, to verify how many GPU phases are actually happening, you could run with the GPUKernelInfo debug flag -- this will basically only print for each new GPU kernel ("real" GPU kernel or Blit/SDMA kernel). If there is a blit/SDMA kernel there should be (at least) 2 kernels launched. - Finally, in terms of the input size and baseline GPU configuration. You are right that the baseline GPU configuration is not a particularly large GPU. My group has an artifact we're releasing with an upcoming MICRO paper that models something more substantive that I can point you to, but in the meantime let me explain what is happening. Increasing the number of threads by itself is not going to increase the amount of work being done in square. Instead, the GPU kernel's work depends on the size of the input array ( https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L45), which gets set here ( https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L54). Increasing the number of threads without increasing the work the kernel is doing will just result in threads that almost immediately exit the GPU kernel because they are attempting to access indices out of bounds, which the above loop I linked ignores. Also, square in its default state is really sized for running very quick validation tests in gem5's daily regression. So, instead, you'd need to change line 54 to increase N (and then increase the number of work groups) if you want to make square run something larger. Probably we could update the code that determines work groups ( https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L72) to be more directly configured with input size too. - One last thing you may consider: when I have run the GPU portion of the gem5 tutorials and bootcamps in the past, I've used other architectural features such as register allocation to demonstrate performance impact for applications with very short runtimes. For example, you may consider the example here (https://youtu.be/1a9Yj-QaQoo?t=5388) from the 2022 bootcamp with square, which I subsequently updated to run with MFMA (AMD's equivalent to TensorCore) operations in the 2024 bootcamp ( https://github.com/gem5bootcamp/2024/blob/main/slides/04-GPU-model/gpu-slides.pdf, slide 58). Hope this helps, Matt On Sat, Oct 5, 2024 at 10:50 AM Beser, Nicholas D. via gem5-users < gem5-users@gem5.org> wrote: > I am teaching an advanced computer architecture class and had the class > run the GPU example that was run in the 2024 bootcamp: > > > > docker run --volume $(pwd):$(pwd) -w $(pwd) ghcr.io/gem5/gcn-gpu:v24-0 > gem5/build/VEGA_X86/gem5.opt gem5/configs/example/apu_se.py -n 3 > --gfx-version=gfx902 - > > c gem5-resources/src/gpu/square/bin/square > > > > The example ran, however the stats.txt file had two Simulation statistics > runs. The second did not appear to have any activity on the CUs. Can > someone tell me why the simulation had two runs? We are using the first run > as the GPU simulation statistics. > > > > We also ran the simulation while varying the number of CU’s. We did not > see much change in performance. I thought it was due to the benchmark that > was run. One of my students modified the benchmark to use more threads, but > we did not see much change. My thoughts were that this was due to the > benchmark again, that the resources required were not stressed by the 4 > CU’s and changing the number to larger one’s also did not stress the CU’s. > > > > Nick > > > > > _______________________________________________ > gem5-users mailing list -- gem5-users@gem5.org > To unsubscribe send an email to gem5-users-le...@gem5.org >
_______________________________________________ gem5-users mailing list -- gem5-users@gem5.org To unsubscribe send an email to gem5-users-le...@gem5.org