[gem5-users] Re: Error in an application running on gem5 GCN3 (with apu_se.py)

Poremba, Matthew via gem5-users Fri, 08 Sep 2023 07:53:10 -0700

[Public]

Hi Anoop,



Based on that register count, I am going to guess you built the application 
with -O0 or some other debugging flags?  If you do this, the compiler makes 
some super large number of registers. I assume that is so a real GPU will not 
run any other applications simultaneously.

Similarly, if you are seeing s_sendmsg I am going to guess there is a printf() 
in your GPU kernel.  These aren’t currently supported in gem5, but something 
that would be very nice to have.

If these are true you will need to remove any printfs and compile with at least 
-O1 to run in gem5.


-Matt

From: Anoop Mysore <mysan...@gmail.com>
Sent: Friday, September 8, 2023 7:33 AM
To: Matt Sinclair <mattdsinclair.w...@gmail.com>
Cc: The gem5 Users mailing list <gem5-users@gem5.org>; Poremba, Matthew 
<matthew.pore...@amd.com>
Subject: Re: [gem5-users] Re: Error in an application running on gem5 GCN3 
(with apu_se.py)

Caution: This message originated from an External Source. Use proper caution 
when opening attachments, clicking links, or responding.

Hi Matt,
I'm facing a few other problems:
1. `panic: panic condition (numWfs * vregDemandPerWI) > (numVectorALUs * 
numVecRegsPerSimd) occurred: WG with 1 WFs and 29285 VGPRs per WI can not be 
allocated to CU that has 8192 VGPRs`
The corresponding line of the code in gem5: 
https://github.com/gem5/gem5/blob/f29bfc0640c88a79eb7f94454ce31b3237ec0066/src/gpu-compute/compute_unit.cc#L565
One of the variables (vregDemandPerWI) is ultimately derived from reading the 
executable for the kernel code. Is it possible to reduce this VGRP demand 
somehow, or is increasing the VGPRs (to what seems like an unrealistically high 
value) be the only solution? Similar error for SGPRs as well.
2. Some kernels (compiled for gfx801/3) have instructions such as 
ds_add_u32<https://www.amd.com/system/files/TechDocs/gcn3-instruction-set-architecture.pdf>
 (Data Store instruction page: 12-161), s_sendmsg (send message to host CPU) -- 
which do not have their relevant decoding code 
available<https://github.com/gem5/gem5/blob/48a40cf2f5182a82de360b7efa497d82e06b1631/src/arch/amdgpu/gcn3/insts/instructions.cc#L30929>.
 Is this intentional or was this just punted for later -- anything to keep in 
mind when coding for these?




On Thu, Aug 17, 2023 at 5:13 PM Matt Sinclair 
<mattdsinclair.w...@gmail.com<mailto:mattdsinclair.w...@gmail.com>> wrote:
Hi Anoop,

I'm glad that increasing -n helped.  It's hard to say what exactly the problem 
is without digging in further, but often the ROCm stack will launch additional 
processes to do a variety of things (e.g., check which version of LLVM is being 
used).  In gem5, each of these require a separate CPU thread context -- which 
increasing -n handles in SE mode.  So if I had to guess, I would say that this 
is what is happening.

If you added gdb locally to your docker, and you built the docker properly, 
then I would expect gdb to work with gem5.

Thanks,
Matt

On Wed, Aug 16, 2023 at 11:41 PM Anoop Mysore 
<mysan...@gmail.com<mailto:mysan...@gmail.com>> wrote:
Thank you, Matt, having 10 CPUs (up from previous 3) in the simulated system 
seems to make it work! (At least, I don't see that error at that point 
anymore). Is "resource temporarily unavailable" commonly due to CPU count? 
Curious to know how you made that connection.

Re gdb: I am indeed using a local docker build (gem5/util/dockerfiles/gcn-gpu) 
with an added gdb installation -- is that what you meant?

Will send in a PR to the repo soon as I'm done :)
On Wed, Aug 16, 2023, 5:03 PM Matt Sinclair 
<mattdsinclair.w...@gmail.com<mailto:mattdsinclair.w...@gmail.com>> wrote:
Hi Anoop,

A few things here:

- Regarding the original failure (at least the !FS part), this is normally 
happening either because of the GPU Target ISA (e.g., gfx900) you used in your 
Makefile (e.g., it is not supported) or because you didn't properly specify 
what GPU ISA you are using when running the program.  So, what is your command 
line for running this application and what ISA are you specifying in your 
Makefile?
- If the "what()" is the real source of the error, then I think this could be 
related to the number of CPU thread contexts you are running with gem5.  What 
did you set "-n" to?
- Regarding gdb, @Matt P: did you remove gdb from what is installed in the 
Docker a while back?  If so, I think Anoop would need to add it back and create 
a local docker or something like that.
- Setting aside the above, it would be wonderful if you contribute the CHAI 
benchmarks to gem5-resources once you get them working!  Please let us know if 
we can do anything to help with that.

Thanks,
Matt

On Wed, Aug 16, 2023 at 9:51 AM Anoop Mysore via gem5-users 
<gem5-users@gem5.org<mailto:gem5-users@gem5.org>> wrote:
Curiously, running the gem5.debug executable with gdb within docker results in:
Reading symbols from gem5/build/GCN3_X86/gem5.debug...
(gdb) quit
(the quit wasn't a command I provided, it just quits automatically). Is gdb 
working with gem5 GCN3 in Docker?

I ran gem5.opt with ExecAll and SyscallAll debug flags, the debug tail and the 
simerr logs are attached.
I don't see anything peculiar other than a tgkill syscall with a SIGABRT sent 
to a thread thereafter halting within a few instructions.

On Tue, Aug 15, 2023 at 9:00 PM Anoop Mysore 
<mysan...@gmail.com<mailto:mysan...@gmail.com>> wrote:
I am trying to port CHAI benchmarks <https://github.com/chai-benchmarks/chai> 
similarly to 
gem5-resources/src/gpu/pannotia<https://github.com/gem5/gem5-resources/tree/stable/src/gpu/pannotia>.
 I was able to HIPify (through the perl script + some manual changes) all the 
code files, and ran the BFS program. I see the following error message at the 
point of launching the CPU threads 
here<https://github.com/mysoreanoop/chai/blob/678c18fd551fbf12f4abbb05ab7164f1b588be68/HIP-U-gem5/BFS/main.cpp#L273>
 (fork of HIPified CHAI). I do not see any of the prints from the CPU threads 
which leads me to believe the error is to do with the threads not being 
launched or a related error.

(This looks related; incorporated the suggestion of linking against -pthread: 
https://stackoverflow.com/a/6485728)

The stderr log is below; any help is appreciated.
_________
....
AM: Launching CPU
terminate called after throwing an instance of 'std::system_error'
what():  Resource temporarily unavailable
build/GCN3_X86/sim/faults.cc:60: panic: panic condition !FullSystem occurred: 
fault (General-Protection) detected @ PC (0x7ffff6afa941=>0x7ffff6afa942).(0=>1)
Memory Usage: 19704072 KBytes
Program aborted at tick 441590522500
--- BEGIN LIBC BACKTRACE ---
gem5/build/GCN3_X86/gem5.opt(+0x550200)[0x55a709b31200]
gem5/build/GCN3_X86/gem5.opt(+0x57d46e)[0x55a709b5e46e]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f18881a0420]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f188734800b]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f1887327859]
gem5/build/GCN3_X86/gem5.opt(+0x4be295)[0x55a709a9f295]
gem5/build/GCN3_X86/gem5.opt(+0x5f6169)[0x55a709bd7169]
gem5/build/GCN3_X86/gem5.opt(+0x9fd9ed)[0x55a709fde9ed]
gem5/build/GCN3_X86/gem5.opt(+0x15b1d10)[0x55a70ab92d10]
gem5/build/GCN3_X86/gem5.opt(+0x15b2fd5)[0x55a70ab93fd5]
gem5/build/GCN3_X86/gem5.opt(+0x15b5620)[0x55a70ab96620]
gem5/build/GCN3_X86/gem5.opt(+0x15b6348)[0x55a70ab97348]
gem5/build/GCN3_X86/gem5.opt(+0x15c2954)[0x55a70aba3954]
gem5/build/GCN3_X86/gem5.opt(+0x56a082)[0x55a709b4b082]
gem5/build/GCN3_X86/gem5.opt(+0x59e2c4)[0x55a709b7f2c4]
gem5/build/GCN3_X86/gem5.opt(+0x59e8a3)[0x55a709b7f8a3]
gem5/build/GCN3_X86/gem5.opt(+0x4ed462)[0x55a709ace462]
gem5/build/GCN3_X86/gem5.opt(+0x4af427)[0x55a709a90427]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x2a8738)[0x7f1888459738]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x8dd8)[0x7f188822ef48]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x8fb)[0x7f188837be3b]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x94)[0x7f1888459114]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d)[0x7f1888225d6d]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x7d86)[0x7f188822def6]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x8fb)[0x7f188837be3b]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(PyEval_EvalCodeEx+0x42)[0x7f188837c1c2]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(PyEval_EvalCode+0x1f)[0x7f188837c5af]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x1cfbf1)[0x7f1888380bf1]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x25f537)[0x7f1888410537]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d)[0x7f1888225d6d]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x12fd)[0x7f188822746d]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x8006b)[0x7f188823106b]
--- END LIBC BACKTRACE ---
Failed to execute default signal handler!
_________
_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org<mailto:gem5-users@gem5.org>
To unsubscribe send an email to 
gem5-users-le...@gem5.org<mailto:gem5-users-le...@gem5.org>

_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org

[gem5-users] Re: Error in an application running on gem5 GCN3 (with apu_se.py)

Reply via email to