[gem5-users] Re: Error in an application running on gem5 GCN3 (with apu_se.py)

Matt Sinclair via gem5-users Mon, 11 Sep 2023 12:17:14 -0700

Yeah, I haven't tried CHAI but I believe gfx902 would work with it (if you
need APUs).


Matt S.

On Mon, Sep 11, 2023 at 12:56 PM Poremba, Matthew <matthew.pore...@amd.com>
wrote:

> [Public]
>
> Hi Anoop,
>
>
>
>
>
> That instruction was recently added to gem5, but for Vega ISA only:
> https://gem5-review.googlesource.com/c/public/gem5/+/67072 .  It could be
> ported to GCN3 probably by copying the code exactly into the corresponding
> GCN3 files.  You’ll notice however in that relation chain there are many
> more instructions implemented for Vega only, so there will be similar
> issues to this.  Alternately, I think there is a Vega APU working
> (gfx902?).  MattS would know more about the status of that.   I am not sure
> of your use case but if you can use a dGPU, Vega with gfx900 version or
> full system mode is another option to use Vega ISA.
>
>
>
> For the docker automatically quitting, you will have to do `docker run
> *-it* …` to start an interactive session.
>
>
>
>
>
> -Matt
>
>
>
> *From:* Anoop Mysore <mysan...@gmail.com>
> *Sent:* Monday, September 11, 2023 10:33 AM
> *To:* Poremba, Matthew <matthew.pore...@amd.com>
> *Cc:* Matt Sinclair <mattdsinclair.w...@gmail.com>; The gem5 Users
> mailing list <gem5-users@gem5.org>
> *Subject:* Re: [gem5-users] Re: Error in an application running on gem5
> GCN3 (with apu_se.py)
>
>
>
> *Caution:* This message originated from an External Source. Use proper
> caution when opening attachments, clicking links, or responding.
>
>
>
> Thanks, Matt. Yes, the printfs in the GPU kernel code were the issue for
> s_sendmsg.
>
> However, the ds_add_u32 instruction is still an issue. I am already
> compiling with -O1 like so:
>
> /opt/rocm/hip/bin/hipcc --amdgpu-target=gfx801,gfx803
>
>     main.cpp kernel.cu kernel.cpp
>
>     -o ./bin/hsto.gem5
>
>
> -I/home/anoop/new/gem5-resources/src/gpu/chai/HIP-U-gem5/HSTO/../.gem5/include
>
>     -lz -lm -lc -lpthread -O1
>
>
> -L/home/anoop/new/gem5-resources/src/gpu/chai/HIP-U-gem5/HSTO/../.gem5/util/m5/build/x86/out
> -lm5
>
>
>
> The exact error is:
> src/gpu-compute/scoreboard_check_stage.cc:158: panic: next instruction:
> ds_add_u32 v7, v8 is of unknown type
>
>
>
> The corresponding line in the simulator
> <https://github.com/gem5/gem5/blob/48a40cf2f5182a82de360b7efa497d82e06b1631/src/gpu-compute/scoreboard_check_stage.cc#L158>,
> and decoder section of it
> <https://github.com/gem5/gem5/blob/48a40cf2f5182a82de360b7efa497d82e06b1631/src/arch/amdgpu/gcn3/insts/instructions.cc#L30929>.
> Because of the involvement of the LDS/GDS, I'm unsure how to implement this
> -- any help would be appreciated.
>
>
>
> Also, GDB still doesn't seem to be working with my gem5. And without
> prints in the kernel, it's cumbersome to get any useful insight on failing
> programs.
>
> I added within the Dockerfile: RUN apt install -y gdb
>
> I am invoking gdb with:
>
> docker run -u $UID:$GID --volume $(pwd):$(pwd) -w $(pwd) gem5:new gdb
> --args gem5/build/GCN3_X86/gem5.debug gem5/configs/example/apu_se.py
> --cpu-type=DerivO3CPU --num-cpus=4 --mem-size=1GB --ruby
> --mem-type=SimpleMemory -c
> gem5-resources/src/gpu/chai/HIP-U-gem5/HSTO/bin/hsto.gem5
>
>
>
> Log:
>
> GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
> Copyright (C) 2020 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <
> http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> Type "show copying" and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>.
> Find the GDB manual and other documentation resources online at:
>     <http://www.gnu.org/software/gdb/documentation/>.
>
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from gem5/build/GCN3_X86/gem5.debug...
> (gdb) quit
>
>
>
> PS: `quit` was automatically taken in.
>
> Is there anything wrong I'm doing here?
>
>
>
>
>
>
>
> On Fri, Sep 8, 2023 at 4:50 PM Poremba, Matthew <matthew.pore...@amd.com>
> wrote:
>
> [Public]
>
>
>
> Hi Anoop,
>
>
>
>
>
> Based on that register count, I am going to guess you built the
> application with -O0 or some other debugging flags?  If you do this, the
> compiler makes some super large number of registers. I assume that is so a
> real GPU will not run any other applications simultaneously.
>
>
>
> Similarly, if you are seeing s_sendmsg I am going to guess there is a
> printf() in your GPU kernel.  These aren’t currently supported in gem5, but
> something that would be very nice to have.
>
>
>
> If these are true you will need to remove any printfs and compile with at
> least -O1 to run in gem5.
>
>
>
>
>
> -Matt
>
>
>
> *From:* Anoop Mysore <mysan...@gmail.com>
> *Sent:* Friday, September 8, 2023 7:33 AM
> *To:* Matt Sinclair <mattdsinclair.w...@gmail.com>
> *Cc:* The gem5 Users mailing list <gem5-users@gem5.org>; Poremba, Matthew
> <matthew.pore...@amd.com>
> *Subject:* Re: [gem5-users] Re: Error in an application running on gem5
> GCN3 (with apu_se.py)
>
>
>
> *Caution:* This message originated from an External Source. Use proper
> caution when opening attachments, clicking links, or responding.
>
>
>
> Hi Matt,
> I'm facing a few other problems:
>
> 1. `panic: panic condition (numWfs * vregDemandPerWI) > (numVectorALUs *
> numVecRegsPerSimd) occurred: WG with 1 WFs and 29285 VGPRs per WI can not
> be allocated to CU that has 8192 VGPRs`
>
> The corresponding line of the code in gem5:
> https://github.com/gem5/gem5/blob/f29bfc0640c88a79eb7f94454ce31b3237ec0066/src/gpu-compute/compute_unit.cc#L565
>
> One of the variables (vregDemandPerWI) is ultimately derived from reading
> the executable for the kernel code. Is it possible to reduce this VGRP
> demand somehow, or is increasing the VGPRs (to what seems like an
> unrealistically high value) be the only solution? Similar error for SGPRs
> as well.
>
> 2. Some kernels (compiled for gfx801/3) have instructions such as
> ds_add_u32
> <https://www.amd.com/system/files/TechDocs/gcn3-instruction-set-architecture.pdf>
>  (Data
> Store instruction page: 12-161), s_sendmsg (send message to host CPU) --
> which do not have their relevant decoding code available
> <https://github.com/gem5/gem5/blob/48a40cf2f5182a82de360b7efa497d82e06b1631/src/arch/amdgpu/gcn3/insts/instructions.cc#L30929>.
> Is this intentional or was this just punted for later -- anything to keep
> in mind when coding for these?
>
>
>
>
>
>
>
>
>
> On Thu, Aug 17, 2023 at 5:13 PM Matt Sinclair <
> mattdsinclair.w...@gmail.com> wrote:
>
> Hi Anoop,
>
>
>
> I'm glad that increasing -n helped.  It's hard to say what exactly the
> problem is without digging in further, but often the ROCm stack will launch
> additional processes to do a variety of things (e.g., check which version
> of LLVM is being used).  In gem5, each of these require a separate CPU
> thread context -- which increasing -n handles in SE mode.  So if I had to
> guess, I would say that this is what is happening.
>
>
>
> If you added gdb locally to your docker, and you built the docker
> properly, then I would expect gdb to work with gem5.
>
>
>
> Thanks,
>
> Matt
>
>
>
> On Wed, Aug 16, 2023 at 11:41 PM Anoop Mysore <mysan...@gmail.com> wrote:
>
> Thank you, Matt, having 10 CPUs (up from previous 3) in the simulated
> system seems to make it work! (At least, I don't see that error at that
> point anymore). Is "resource temporarily unavailable" commonly due to CPU
> count? Curious to know how you made that connection.
>
>
>
> Re gdb: I am indeed using a local docker build
> (gem5/util/dockerfiles/gcn-gpu) with an added gdb installation -- is that
> what you meant?
>
>
>
> Will send in a PR to the repo soon as I'm done :)
>
> On Wed, Aug 16, 2023, 5:03 PM Matt Sinclair <mattdsinclair.w...@gmail.com>
> wrote:
>
> Hi Anoop,
>
>
>
> A few things here:
>
>
>
> - Regarding the original failure (at least the !FS part), this is normally
> happening either because of the GPU Target ISA (e.g., gfx900) you used in
> your Makefile (e.g., it is not supported) or because you didn't properly
> specify what GPU ISA you are using when running the program.  So, what is
> your command line for running this application and what ISA are you
> specifying in your Makefile?
>
> - If the "what()" is the real source of the error, then I think this could
> be related to the number of CPU thread contexts you are running with gem5.
> What did you set "-n" to?
>
> - Regarding gdb, @Matt P: did you remove gdb from what is installed in the
> Docker a while back?  If so, I think Anoop would need to add it back and
> create a local docker or something like that.
>
> - Setting aside the above, it would be wonderful if you contribute the
> CHAI benchmarks to gem5-resources once you get them working!  Please let us
> know if we can do anything to help with that.
>
>
>
> Thanks,
>
> Matt
>
>
>
> On Wed, Aug 16, 2023 at 9:51 AM Anoop Mysore via gem5-users <
> gem5-users@gem5.org> wrote:
>
> Curiously, running the gem5.debug executable with gdb within docker
> results in:
>
> Reading symbols from gem5/build/GCN3_X86/gem5.debug...
> (gdb) quit
> (the quit wasn't a command I provided, it just quits automatically). Is
> gdb working with gem5 GCN3 in Docker?
>
>
>
> I ran gem5.opt with ExecAll and SyscallAll debug flags, the debug tail and
> the simerr logs are attached.
>
> I don't see anything peculiar other than a tgkill syscall with a SIGABRT
> sent to a thread thereafter halting within a few instructions.
>
>
>
> On Tue, Aug 15, 2023 at 9:00 PM Anoop Mysore <mysan...@gmail.com> wrote:
>
> I am trying to port CHAI benchmarks
> <https://github.com/chai-benchmarks/chai>similarly to
> gem5-resources/src/gpu/pannotia
> <https://github.com/gem5/gem5-resources/tree/stable/src/gpu/pannotia>. I
> was able to HIPify (through the perl script + some manual changes) all the
> code files, and ran the BFS program. I see the following error message at
> the point of launching the CPU threads here
> <https://github.com/mysoreanoop/chai/blob/678c18fd551fbf12f4abbb05ab7164f1b588be68/HIP-U-gem5/BFS/main.cpp#L273>
>  (fork
> of HIPified CHAI). I do not see any of the prints from the CPU threads
> which leads me to believe the error is to do with the threads not being
> launched or a related error.
>
>
>
> (This looks related; incorporated the suggestion of linking against
> -pthread: https://stackoverflow.com/a/6485728)
>
>
>
> The stderr log is below; any help is appreciated.
>
> _________
>
> ....
>
> AM: Launching CPU
>
> terminate called after throwing an instance of 'std::system_error'
>
> what():  Resource temporarily unavailable
>
> build/GCN3_X86/sim/faults.cc:60: panic: panic condition !FullSystem
> occurred: fault (General-Protection) detected @ PC
> (0x7ffff6afa941=>0x7ffff6afa942).(0=>1)
> Memory Usage: 19704072 KBytes
>
> Program aborted at tick 441590522500
>
> --- BEGIN LIBC BACKTRACE ---
> gem5/build/GCN3_X86/gem5.opt(+0x550200)[0x55a709b31200]
> gem5/build/GCN3_X86/gem5.opt(+0x57d46e)[0x55a709b5e46e]
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f18881a0420]
> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f188734800b]
> /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f1887327859]
> gem5/build/GCN3_X86/gem5.opt(+0x4be295)[0x55a709a9f295]
> gem5/build/GCN3_X86/gem5.opt(+0x5f6169)[0x55a709bd7169]
> gem5/build/GCN3_X86/gem5.opt(+0x9fd9ed)[0x55a709fde9ed]
> gem5/build/GCN3_X86/gem5.opt(+0x15b1d10)[0x55a70ab92d10]
> gem5/build/GCN3_X86/gem5.opt(+0x15b2fd5)[0x55a70ab93fd5]
> gem5/build/GCN3_X86/gem5.opt(+0x15b5620)[0x55a70ab96620]
> gem5/build/GCN3_X86/gem5.opt(+0x15b6348)[0x55a70ab97348]
> gem5/build/GCN3_X86/gem5.opt(+0x15c2954)[0x55a70aba3954]
> gem5/build/GCN3_X86/gem5.opt(+0x56a082)[0x55a709b4b082]
> gem5/build/GCN3_X86/gem5.opt(+0x59e2c4)[0x55a709b7f2c4]
> gem5/build/GCN3_X86/gem5.opt(+0x59e8a3)[0x55a709b7f8a3]
> gem5/build/GCN3_X86/gem5.opt(+0x4ed462)[0x55a709ace462]
> gem5/build/GCN3_X86/gem5.opt(+0x4af427)[0x55a709a90427]
> /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x2a8738)[0x7f1888459738]
>
> /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x8dd8)[0x7f188822ef48]
>
> /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x8fb)[0x7f188837be3b]
>
> /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x94)[0x7f1888459114]
> /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d)[0x7f1888225d6d]
>
> /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x7d86)[0x7f188822def6]
>
> /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x8fb)[0x7f188837be3b]
>
> /lib/x86_64-linux-gnu/libpython3.8.so.1.0(PyEval_EvalCodeEx+0x42)[0x7f188837c1c2]
>
> /lib/x86_64-linux-gnu/libpython3.8.so.1.0(PyEval_EvalCode+0x1f)[0x7f188837c5af]
> /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x1cfbf1)[0x7f1888380bf1]
> /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x25f537)[0x7f1888410537]
> /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d)[0x7f1888225d6d]
>
> /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x12fd)[0x7f188822746d]
> /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x8006b)[0x7f188823106b]
> --- END LIBC BACKTRACE ---
> Failed to execute default signal handler!
>
> _________
>
> _______________________________________________
> gem5-users mailing list -- gem5-users@gem5.org
> To unsubscribe send an email to gem5-users-le...@gem5.org
>
>

_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org

[gem5-users] Re: Error in an application running on gem5 GCN3 (with apu_se.py)

Reply via email to