[AMD Official Use Only] Hi David,
The instructions executed in each CU could change in each run depending on the workgroup scheduler. So in your runs, it could be possible that CU2 is getting different workgroups in each run - which explains your observations #1, #2, and #3. You would want to look at the total number of instructions executed by the GPU(sum of all CUs) to ensure that the total count remains the same. Further, In certain cases, the total instruction count of the GPU could also defer if there are atomic operations within the workload. Changing the latency could mean that some CUs might be spinning (and thus executing more instructions). For your observation #4, it could be possible that the bottleneck is somewhere else in the system, thus changing the mem_req/resp_latency is not resulting in any performance difference. Thanks, Srikant From: David Fong via gem5-users <gem5-users@gem5.org> Sent: Friday, February 25, 2022 1:47 PM To: David Fong via gem5-users <gem5-users@gem5.org> Cc: David Fong <da...@chronostech.com> Subject: [gem5-users] gem5 + GCN3 questions [CAUTION: External Email] Hi, For those familiar with gem5 + GCN3 simulations, I need some answers to questions. I downloaded and followed instructions at https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/stable/src/gpu/DNNMark/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgem5.googlesource.com%2Fpublic%2Fgem5-resources%2F%2B%2Frefs%2Fheads%2Fstable%2Fsrc%2Fgpu%2FDNNMark%2F&data=04%7C01%7Csrikant.bharadwaj%40amd.com%7C0426d1bbae7c4d81c50f08d9f8a88202%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637814225326873421%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=32RniQ%2FfZsDtTIdwkT%2FGH%2BJbjLP%2F9pYszPhSPEy%2BT5A%3D&reserved=0> to build gem5 + GCN3. I ran the DNN test : test_fwd_softmax With two runs to check how latency affects the runs Original value for mem_rd_latency and mem_resp_latency was 50 but I also ran with 40. gem5/build/GCN3_X86/gpu-compute/GPU.py mem_req_latency = Param.Int(40, "Latency for request from the cu to ruby. "\ "Represents the pipeline to reach the TCP "\ "and specified in GPU clock cycles") mem_resp_latency = Param.Int(40, "Latency for responses from ruby to the "\ "cu. Represents the pipeline between the "\ "TCP and cu as well as TCP data array "\ "access. Specified in GPU clock cycles") 1. Why does stats.txt for "40" show reduced number of instructions ? system.cpu3.CUs2.numInstrExecuted 71838 (50) system.cpu3.CUs2.numInstrExecuted 69075 (40) 1. Does the GPU kernel perform optimizations on the instructions due to less waiting time? 1. Do some stats like below make sense (first line is "50") and (second line is "40") ? system.cpu3.CUs2.ScheduleStage.dispNrdyStalls::Ready 347264 system.cpu3.CUs2.ScheduleStage.dispNrdyStalls::Ready 351634 system.cpu3.CUs2.instCyclesLdsPerSimd::0 800 system.cpu3.CUs2.instCyclesLdsPerSimd::0 696 system.cpu3.CUs2.tlbRequests 104000 system.cpu3.CUs2.tlbRequests 100000 system.cpu3.CUs2.tlbCycles 10101232000 system.cpu3.CUs2.tlbCycles 9141288000 system.cpu3.CUs2.numInstrExecuted 71838 system.cpu3.CUs2.numInstrExecuted 69075 system.cpu3.CUs2.headTailLatency::mean 68651.850962 system.cpu3.CUs2.headTailLatency::mean 64891.258333 system.cpu3.CUs2.headTailLatency::stdev 157090.173635 system.cpu3.CUs2.headTailLatency::stdev 155057.054245 1. The runtime is the same. Is there a way to end simulation based upon completion of all instructions instead of a fixed time ? This would be another way for me to know that the run with latency = 40 should end sooner. ---------- Begin Simulation Statistics ---------- "50" simSeconds 0.126230 # Number of seconds simulated (Second) simTicks 126229990500 # Number of ticks simulated (Tick) finalTick 126229990500 # Number of ticks from beginning of simulation (restored from checkpoints and never reset) (Tick) simFreq 1000000000000 # The number of ticks per simulated second ((Tick/Second)) hostSeconds 199.21 # Real time elapsed on the host (Second) hostTickRate 633665289 # The number of ticks simulated per host second (ticks/s) ((Tick/Second)) hostMemory 3596200 # Number of bytes of host memory used (Byte) simInsts 38011242 # Number of instructions simulated (Count) simOps 72305276 # Number of ops (including micro ops) simulated (Count) hostInstRate 190811 # Simulator instruction rate (inst/s) ((Count/Second)) hostOpRate 362962 # Simulator op (including micro ops) rate (op/s) ((Count/Second)) ---------- Begin Simulation Statistics ---------- "40" simSeconds 0.126230 # Number of seconds simulated (Second) simTicks 126229990500 # Number of ticks simulated (Tick) finalTick 126229990500 # Number of ticks from beginning of simulation (restored from checkpoints and never reset) (Tick) simFreq 1000000000000 # The number of ticks per simulated second ((Tick/Second)) hostSeconds 199.32 # Real time elapsed on the host (Second) hostTickRate 633294503 # The number of ticks simulated per host second (ticks/s) ((Tick/Second)) hostMemory 3598508 # Number of bytes of host memory used (Byte) simInsts 38010420 # Number of instructions simulated (Count) simOps 72303566 # Number of ops (including micro ops) simulated (Count) hostInstRate 190696 # Simulator instruction rate (inst/s) ((Count/Second)) hostOpRate 362742 # Simulator op (including micro ops) rate (op/s) ((Count/Second)) Thanks, David
_______________________________________________ gem5-users mailing list -- gem5-users@gem5.org To unsubscribe send an email to gem5-users-le...@gem5.org %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s