[AMD Official Use Only]

Hi David,

The instructions executed in each CU could change in each run depending on the 
workgroup scheduler. So in your runs, it could be possible that CU2 is getting 
different workgroups in each run - which explains your observations #1, #2, and 
#3. You would want to look at the total number of instructions executed by the 
GPU(sum of all CUs) to ensure that the total count remains the same.
Further, In certain cases, the total instruction count of the GPU could also 
defer if there are atomic operations within the workload. Changing the latency 
could mean that some CUs might be spinning (and thus executing more 
instructions).

For your observation #4, it could be possible that the bottleneck is somewhere 
else in the system, thus changing the mem_req/resp_latency is not resulting in 
any performance difference.

Thanks,
Srikant

From: David Fong via gem5-users <gem5-users@gem5.org>
Sent: Friday, February 25, 2022 1:47 PM
To: David Fong via gem5-users <gem5-users@gem5.org>
Cc: David Fong <da...@chronostech.com>
Subject: [gem5-users] gem5 + GCN3 questions

[CAUTION: External Email]
Hi,

For those familiar with gem5 + GCN3 simulations, I need some answers to 
questions.

I downloaded and followed instructions at

https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/stable/src/gpu/DNNMark/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgem5.googlesource.com%2Fpublic%2Fgem5-resources%2F%2B%2Frefs%2Fheads%2Fstable%2Fsrc%2Fgpu%2FDNNMark%2F&data=04%7C01%7Csrikant.bharadwaj%40amd.com%7C0426d1bbae7c4d81c50f08d9f8a88202%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637814225326873421%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=32RniQ%2FfZsDtTIdwkT%2FGH%2BJbjLP%2F9pYszPhSPEy%2BT5A%3D&reserved=0>

to build gem5 + GCN3.

I ran the DNN test : test_fwd_softmax

With two runs to check how latency affects the runs

Original value for mem_rd_latency and mem_resp_latency was 50 but I also ran 
with 40.
gem5/build/GCN3_X86/gpu-compute/GPU.py
    mem_req_latency = Param.Int(40, "Latency for request from the cu to ruby. "\
                                "Represents the pipeline to reach the TCP "\
                                "and specified in GPU clock cycles")
    mem_resp_latency = Param.Int(40, "Latency for responses from ruby to the "\
                                 "cu. Represents the pipeline between the "\
                                 "TCP and cu as well as TCP data array "\
                                 "access. Specified in GPU clock cycles")

  1.  Why does stats.txt for "40" show reduced number of instructions ?

system.cpu3.CUs2.numInstrExecuted               71838    (50)

system.cpu3.CUs2.numInstrExecuted               69075    (40)



  1.  Does the GPU kernel perform optimizations on the instructions due to less 
waiting time?



  1.  Do some stats like below make sense (first line is "50") and (second line 
is "40") ?



system.cpu3.CUs2.ScheduleStage.dispNrdyStalls::Ready       347264

system.cpu3.CUs2.ScheduleStage.dispNrdyStalls::Ready       351634



system.cpu3.CUs2.instCyclesLdsPerSimd::0          800

system.cpu3.CUs2.instCyclesLdsPerSimd::0          696



system.cpu3.CUs2.tlbRequests                   104000

system.cpu3.CUs2.tlbRequests                   100000



system.cpu3.CUs2.tlbCycles                10101232000

system.cpu3.CUs2.tlbCycles                  9141288000



system.cpu3.CUs2.numInstrExecuted               71838

system.cpu3.CUs2.numInstrExecuted               69075



system.cpu3.CUs2.headTailLatency::mean   68651.850962

system.cpu3.CUs2.headTailLatency::mean   64891.258333



system.cpu3.CUs2.headTailLatency::stdev   157090.173635

system.cpu3.CUs2.headTailLatency::stdev   155057.054245



  1.  The runtime is the same.

Is there a way to end simulation based upon completion of all instructions 
instead of a fixed time ?

This would be another way for me to know that the run with latency = 40 should 
end sooner.



---------- Begin Simulation Statistics ---------- "50"

simSeconds                                   0.126230                       # 
Number of seconds simulated (Second)

simTicks                                 126229990500                       # 
Number of ticks simulated (Tick)

finalTick                                126229990500                       # 
Number of ticks from beginning of simulation (restored from checkpoints and 
never reset) (Tick)

simFreq                                  1000000000000                       # 
The number of ticks per simulated second ((Tick/Second))

hostSeconds                                    199.21                       # 
Real time elapsed on the host (Second)

hostTickRate                                633665289                       # 
The number of ticks simulated per host second (ticks/s) ((Tick/Second))

hostMemory                                    3596200                       # 
Number of bytes of host memory used (Byte)

simInsts                                     38011242                       # 
Number of instructions simulated (Count)

simOps                                       72305276                       # 
Number of ops (including micro ops) simulated (Count)

hostInstRate                                   190811                       # 
Simulator instruction rate (inst/s) ((Count/Second))

hostOpRate                                     362962                       # 
Simulator op (including micro ops) rate (op/s) ((Count/Second))



---------- Begin Simulation Statistics ---------- "40"

simSeconds                                   0.126230                       # 
Number of seconds simulated (Second)

simTicks                                 126229990500                       # 
Number of ticks simulated (Tick)

finalTick                                126229990500                       # 
Number of ticks from beginning of simulation (restored from checkpoints and 
never reset) (Tick)

simFreq                                  1000000000000                       # 
The number of ticks per simulated second ((Tick/Second))

hostSeconds                                    199.32                       # 
Real time elapsed on the host (Second)

hostTickRate                                633294503                       # 
The number of ticks simulated per host second (ticks/s) ((Tick/Second))

hostMemory                                    3598508                       # 
Number of bytes of host memory used (Byte)

simInsts                                     38010420                       # 
Number of instructions simulated (Count)

simOps                                       72303566                       # 
Number of ops (including micro ops) simulated (Count)

hostInstRate                                   190696                       # 
Simulator instruction rate (inst/s) ((Count/Second))

hostOpRate                                     362742                       # 
Simulator op (including micro ops) rate (op/s) ((Count/Second))



Thanks,



David


_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

Reply via email to