Hi,
I have some sad news... I reverted that patch, but I again met deadlocks in the
develop branch.
The key error message is as follows:
```
build/GCN3_X86/sim/syscall_emul.cc:74: warn: ignoring syscall mprotect(...)
build/GCN3_X86/mem/ruby/system/Sequencer.cc:240: panic: Possible Deadlock
detected. Aborting!
version: 2 request.paddr: 0x6f2d0 m_readRequestTable: 1 current time:
1733054487000 issue_time: 1732717515000 difference: 336972000
Memory Usage: 19958768 KBytes
Program aborted at tick 1733054487000
--- BEGIN LIBC BACKTRACE ---
/home/ubuntu/lmy/gem5-gcn3/gem5-dev/build/GCN3_X86/gem5.opt(+0x50cab0)[0x55a1a0b03ab0]
/home/ubuntu/lmy/gem5-gcn3/gem5-dev/build/GCN3_X86/gem5.opt(+0x53af4e)[0x55a1a0b31f4e]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x143c0)[0x7fea0cb303c0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fea0bcd803b]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fea0bcb7859]
/home/ubuntu/lmy/gem5-gcn3/gem5-dev/build/GCN3_X86/gem5.opt(+0x524275)[0x55a1a0b1b275]
/home/ubuntu/lmy/gem5-gcn3/gem5-dev/build/GCN3_X86/gem5.opt(+0x104ee09)[0x55a1a1645e09]
/home/ubuntu/lmy/gem5-gcn3/gem5-dev/build/GCN3_X86/gem5.opt(+0x527632)[0x55a1a0b1e632]
/home/ubuntu/lmy/gem5-gcn3/gem5-dev/build/GCN3_X86/gem5.opt(+0x5650a4)[0x55a1a0b5c0a4]
.....................
/home/ubuntu/lmy/gem5-gcn3/gem5-dev/build/GCN3_X86/gem5.opt(+0x4a311e)[0x55a1a0a9a11e]
--- END LIBC BACKTRACE ---
finished time: 2022-11-09 11:07:30 , duration: 43m32s
```
Can you see why? And I'll try to debug it, too.
Thanks
------------------ ???????? ------------------
??????:
"The gem5 Users mailing list"
<gem5-users@gem5.org>;
????????: 2022??11??8??(??????) ????7:04
??????: "The gem5 Users mailing list"<gem5-users@gem5.org>;
????: "1575883782"<1575883...@qq.com>;"Poremba,
Matthew"<matthew.pore...@amd.com>;"Matt Sinclair"<sincl...@cs.wisc.edu>;
????: [gem5-users] Re: ??????Re: ??????Re: Gem5 GCN3 (GPUCoalescer
detected deadlock when running pagerank.)
Thanks Matt P, I hadn??t gotten a chance to try reverting that patch. I
agree reverting it and running SE mode or using FS mode is the simplest
solution in the meantime.
In terms of the deadlock: I think it??s just that many ticks because the
threshold for deadlocks is very long/big. I wasn??t reading too much into
that ?C and anyways since the develop branch seems to not have that failure, I
don??t think debugging it is the top priority?
We??ll definitely need to dig further into the CPUID stuff as you mentioned.
Thanks,
Matt S.
From: Poremba, Matthew via gem5-users <gem5-users@gem5.org>
Sent: Monday, November 7, 2022 5:01 PM
To: The gem5 Users mailing list <gem5-users@gem5.org>
Cc: 1575883782 <1575883...@qq.com>; Poremba, Matthew
<matthew.pore...@amd.com>
Subject: [gem5-users] Re: ??????Re: ??????Re: Gem5 GCN3 (GPUCoalescer detected
deadlock when running pagerank.)
[AMD Official Use Only - General]
Hi,
The rocclr, panic, and unimplemented instructions errors/warnings seem to be
caused by this patch:
https://gem5-review.googlesource.com/c/public/gem5/+/64831. It is likely
the ROCm stack is taking a different code path with the different processor
vendor string and doing things that gem5 doesn??t do properly yet. I
don??t know what exactly the problem is but pagerank runs with this change
reverted. It (probably) won??t be easy to track down the exact problem
without more directed tests for the new processor features.
Your options are (1) to revert this patch or do the equivalent of setting the
string back in your python config but I think that means you cannot run this
benchmark on Ubuntu 22.04. Alternately, (2) this application is known to
work in the GPU fullsystem environment on the develop branch with the default
parameters (4 CUs as well). You could try that if you don??t require SE
mode. I am testing 16 CUs in fullsystem and it is slowly making progress, so
that appears to be working too.
Regarding the coalescer timeouts, I don??t have much advice. Based on the tick
number being in the trillions and deadlock being triggered with a timeout in
the hundreds of millions, it looks like the simulation is *very slowly* making
progress until you eventually get unlucky with a deadlock. One thing to
check when you are increasing the CU count is to ensure the rest of the system
is balanced along with it, such as increasing the number of memory channels
accordingly to get the bandwidth needed to avoid deadlock.
-MattP
From: 1575883782 via gem5-users <gem5-users@gem5.org>
Sent: Sunday, November 6, 2022 4:57 PM
To: The gem5 Users mailing list <gem5-users@gem5.org>
Cc: 1575883782 <1575883...@qq.com>
Subject: [gem5-users] ??????Re: ??????Re: Gem5 GCN3 (GPUCoalescer detected
deadlock when running pagerank.)
Caution: This message originated from an External Source. Use proper caution
when opening attachments, clicking links, or responding.
Hi, Matt
I didn't change Makefile for PageRank. Actually, I use the same PageRank obj in
different gem5 (It was compiled only once).
Looking forward to your good news. And please let me know if there's anything I
can do.
Thanks.
------------------ ???????? ------------------
??????: "The gem5 Users mailing list" <gem5-users@gem5.org>;
????????: 2022??11??7??(??????) ????6:16
??????: "The gem5 Users mailing list"<gem5-users@gem5.org>;
????: "1575883782"<1575883...@qq.com>;"Matt
Sinclair"<sincl...@cs.wisc.edu>;
????: [gem5-users] Re: ??????Re: Gem5 GCN3 (GPUCoalescer detected
deadlock when running pagerank.)
Thanks, this is helpful. Regarding the trace: if this is the failure on
develop, then I don??t think you need to get a trace, as the failure is
different here. But yes, ProtocolTrace would be the flag to use for this.
Regarding PageRank, I am running just the PageRank SPMV variant from the weekly
tests in isolation, to validate if that is working. If that works and
what you ran doesn??t, then perhaps there is something wrong with the docker
config ?C TBD though. In terms of the error, I don??t think it??s a
problem with the compilation. The error comes from:
https://github.com/ROCm-Developer-Tools/HIP/blob/rocm-4.0.x/rocclr/hip_global.cpp#L69,
which is happening because the kernel is not being found by HIP. I
don??t know exactly why this is happening yet, but unless you changed the
Makefile for PageRank I don??t see why it would be the HIP version failing.
More once I can dig further into this.
Matt
From: 1575883782 via gem5-users <gem5-users@gem5.org>
Sent: Sunday, November 6, 2022 11:37 AM
To: The gem5 Users mailing list <gem5-users@gem5.org>
Cc: 1575883782 <1575883...@qq.com>
Subject: [gem5-users] ??????Re: Gem5 GCN3 (GPUCoalescer detected deadlock when
running pagerank.)
Hi, Matt
I tried to run pagerank in the develop branch
(5d0a7b6a6cca0dc20e8b8c366db2ccc150c7480a, Thu Nov 3 16:42:53 2022). But I met
a new error (details are below).
The error message:
```
/HIP/rocclr/hip_global.cpp:69: guarantee(false && "Cannot find Symbol")
build/GCN3_X86/sim/faults.cc:60: panic: panic condition !FullSystem occurred:
fault (General-Protection) detected @ PC
(0x7ffff6afa941=>0x7ffff6afa942).(0=>1)
Memory Usage: 19719528 KBytes
Program aborted at tick 1904281529500
```
It seems the hip version is not correct. I wonder if this problem is because my
docker image version is old. (I used gcn-gpu v22-0).
The good news is that pagerank runs more instructions and prints more output
(although it did not run successfully to the end). I am not sure whether it's
random. But for now, I think it's good news.
Finally, I'm relatively new to gem5 debugging. Could you give some tips about
debugging the trace? For example, the debug flag (should I use the
--debug-flags=ProtocolTrace or another accurate flag about GPU?).
Thanks.
------------------ ???????? ------------------
??????: "The gem5 Users mailing list" <gem5-users@gem5.org>;
????????: 2022??11??6??(??????) ????2:15
??????: "The gem5 Users mailing list"<gem5-users@gem5.org>;
????: "1575883782"<1575883...@qq.com>;"Matt
Sinclair"<sincl...@cs.wisc.edu>;
????: [gem5-users] Re: Gem5 GCN3 (GPUCoalescer detected deadlock when
running pagerank.)
Can you please try the develop branch as well? While this is good to know
it doesn??t pass on stable, if develop solves already then that is good to
know.
Matt
Sent from my iPhone
On Nov 5, 2022, at 10:51 PM, 1575883782 via gem5-users <gem5-users@gem5.org>
wrote:
?1?3
Thanks. I will try to use `--reg-alloc-policy=dynamic`(I didn't specify a
specific policy, I just used the default policy). And I will further read the
trace.
Then, I am using the stable branch. The commit is:
```
commit 39f85b7a3be1ee0ff6e375c9791dd62d23eb8a3e (HEAD -> stable, tag:
v22.0.0.1, origin/stable, origin/master, origin/HEAD)
Author: Bobby R. Bruce <bbr...@ucdavis.edu>
Date: Sat Jun 18 04:59:02 2022 -0700
misc: Update version info to v22.0.0.1
```
------------------ Original ------------------
From: "The gem5 Users mailing list" <gem5-users@gem5.org>;
Date: Sun, Nov 6, 2022 02:55 AM
To: "The gem5 Users mailing list"<gem5-users@gem5.org>;
Cc: "1575883782"<1575883...@qq.com>;"Matt
Sinclair"<sincl...@cs.wisc.edu>;
Subject: [gem5-users] Re: Gem5 GCN3 (GPUCoalescer detected deadlock when
running pagerank.)
Hi,
Ultimately this message is telling you there is a deadlock in the cache
coherence protocol when running PageRank with the specifications you did.
To fix it, you would need to get a trace
(https://www.gem5.org/documentation/learning_gem5/part3/MSIdebugging/) and
look through to see what the problem is. If you do this and find a fix,
we definitely welcome any patches you may find to help with this!
Having said that, I??ve been trying to replicate your problem. However,
the input size you are running means that gem5 will be running for a while, so
it will take a while before I can say something more definitive. We do
test PageRank as part of the weekly tests, but not specifically for 16
CUs. What branch (stable vs. develop) are you using? Also, I
recommend using --reg-alloc-policy=dynamic, as this is a more realistic
register allocation policy than the simple one (which I can??t tell if you are
using or not). In the meantime, if you can answer the above questions,
that may help us debug.
Thanks,
Matt
From: 1575883782 via gem5-users <gem5-users@gem5.org>
Sent: Saturday, November 5, 2022 3:58 AM
To: gem5-users <gem5-users@gem5.org>
Cc: 1575883782 <1575883...@qq.com>
Subject: [gem5-users] Gem5 GCN3 (GPUCoalescer detected deadlock when running
pagerank.)
Hi, I was trying to run PageRank benchmark with its GCN3 GPU model. I
succeed running PageRank with 4 CUs, but when I run it with 16CUs, I met some
problems. The key error message is
"build/GCN3_X86/mem/ruby/system/GPUCoalescer.cc:292: warn: GPUCoalescer 10
Possible deadlock detected!" Was I missing something? I don't know how to solve
it. Someone could help me? 4CUs command line (default CU number is 4) ```
command line: build/GCN3_X86/gem5.opt -n 3 --mem-size=8GB
--benchmark-root=/home/ubuntu/lmy/gem5-gcn3/gem5-resources/src/gpu/pannotia -c
pagerank/bin/pagerank_spmv
'--options=/home/ubuntu/lmy/gem5-gcn3/gem5-resources/src/gpu/pannotia/pagerank/coAuthorsDBLP.graph
1' ``` 16CUs command line ``` command line: build/GCN3_X86/gem5.opt
configs/example/apu_se.py -n 3 --num-compute-units 16 --mem-size=8GB
--benchmark-root=/home/ubuntu/lmy/gem5-gcn3/gem5-resources/src/gpu/pannotia -c
pagerank/bin/pagerank_spmv
'--options=/home/ubuntu/lmy/gem5-resources/src/gpu/pannotia/pagerank/coAuthorsDBLP.graph
1' ``` gem5 version ``` gem5 version 22.0.0.1 gem5 compiled Jun 29 2022
10:34:02 gem5 started Nov 3 2022 14:32:39 gem5 executing on 1bcbbec61aaf,
pid 1287240 ``` Error message: ```
build/GCN3_X86/mem/ruby/system/GPUCoalescer.cc:292: warn: GPUCoalescer 10
Possible deadlock detected! Printing out 763 outstanding requests in the
coalesced table Addr: [0x3b8b1c0, line 0x3b8b1c0]
Instruction sequence number: 16871
Type: LD
Number of
associated packets: 2
Issue time: 1732620214000
Difference from current tick: 280298000 Addr: [0x3b8b300,
line 0x3b8b300] Instruction sequence number: 16871
Type: LD
Number of
associated packets: 3
Issue time: 1732620214000
Difference from current tick: 280298000 Addr: [0x3b8b380,
line 0x3b8b380] Instruction sequence number: 16871
Type: LD
Number of
associated packets: 1
Issue time: 1732620214000
Difference from current tick: 280298000 Addr: [0x3b8b3c0,
line 0x3b8b3c0] Instruction sequence number: 16871
Type: LD
Number of
associated packets: 3
Issue time: 1732620214000
Difference from current tick: 280298000 Addr: [0x3b8b440,
line 0x3b8b440] Instruction sequence number: 16871
Type: LD
Number of
associated packets: 1
Issue time: 1732620214000
Difference from current tick: 280298000 Addr: [0x3b8b480,
line 0x3b8b480] Instruction sequence number: 16871
Type: LD
Number of
associated packets: 2
Issue time: 1732620214000
Difference from current tick: 280298000 Addr: [0x3b8b4c0,
line 0x3b8b4c0] Instruction sequence number: 16871
Type: LD
Number of
associated packets: 1
Issue time: 1732620214000
Difference from current tick: 280298000 Addr: [0x3b8b540,
line 0x3b8b540] Instruction sequence number: 16871
Type: LD
Number of
associated packets: 1
Issue time: 1732620214000
Difference from current tick: 280298000 Addr: [0x3b8b5c0,
line 0x3b8b5c0] Instruction sequence number: 16871
Type: LD
Number of
associated packets: 2
Issue time: 1732620214000
Difference from current tick: 280298000 Addr: [0x3b8b680,
line 0x3b8b680] Instruction sequence number: 16871
Type: LD
Number of
associated packets: 1
Issue time: 1732620214000
Difference from current tick: 280298000 Addr: [0x3b8b740,
line 0x3b8b740] Instruction sequence number: 16871
Type: LD
Number of
associated packets: 3
Issue time: 1732620214000
Difference from current tick: 280298000 Addr: [0x3b8b7c0,
line 0x3b8b7c0] ...................................
Difference
from current tick: 17915000 Addr: [0x4c60b40, line 0x4c60b40]
Instruction sequence number: 16552
Type: LD
Number of
associated packets: 1
Issue time: 1732882652000
Difference from current tick: 17860000Listing pending
packets from 0 instructions build/GCN3_X86/mem/ruby/system/GPUCoalescer.cc:294:
panic: Aborting due to deadlock! Memory Usage: 19939216 KBytes Program aborted
at tick 1732900512000 --- BEGIN LIBC BACKTRACE ---
/home/ubuntu/lmy/gem5-gcn3/gem5/build/GCN3_X86/gem5.opt(+0x4fb330)[0x55f2ea122330]
/home/ubuntu/lmy/gem5-gcn3/gem5/build/GCN3_X86/gem5.opt(+0x5297ee)[0x55f2ea1507ee]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x143c0)[0x7fe799cb63c0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fe798e5e03b]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fe798e3d859]
/home/ubuntu/lmy/gem5-gcn3/gem5/build/GCN3_X86/gem5.opt(+0x512b15)[0x55f2ea139b15]
/home/ubuntu/lmy/gem5-gcn3/gem5/build/GCN3_X86/gem5.opt(+0xffa194)[0x55f2eac21194]
/home/ubuntu/lmy/gem5-gcn3/gem5/build/GCN3_X86/gem5.opt(+0x515ed2)[0x55f2ea13ced2]
/home/ubuntu/lmy/gem5-gcn3/gem5/build/GCN3_X86/gem5.opt(+0x553944)[0x55f2ea17a944]
/home/ubuntu/lmy/gem5-gcn3/gem5/build/GCN3_X86/gem5.opt(+0x55469e)[0x55f2ea17b69e]
/home/ubuntu/lmy/gem5-gcn3/gem5/build/GCN3_X86/gem5.opt(+0x1c5b422)[0x55f2eb882422]
/home/ubuntu/lmy/gem5-gcn3/gem5/build/GCN3_X86/gem5.opt(+0x4a3e27)[0x55f2ea0cae27]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x2a8738)[0x7fe799f6f738]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x8dd8)[0x7fe799d44f48]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x8fb)[0x7fe799e91e3b]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x94)[0x7fe799f6f114]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d)[0x7fe799d3bd6d]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x7d86)[0x7fe799d43ef6]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x8fb)[0x7fe799e91e3b]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(PyEval_EvalCodeEx+0x42)[0x7fe799e921c2]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(PyEval_EvalCode+0x1f)[0x7fe799e925af]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x1cfbf1)[0x7fe799e96bf1]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x25f537)[0x7fe799f26537]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d)[0x7fe799d3bd6d]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x12fd)[0x7fe799d3d46d]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x8006b)[0x7fe799d4706b]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(PyVectorcall_Call+0x60)[0x7fe799f6f830]
/home/ubuntu/lmy/gem5-gcn3/gem5/build/GCN3_X86/gem5.opt(+0x52b704)[0x55f2ea152704]
/home/ubuntu/lmy/gem5-gcn3/gem5/build/GCN3_X86/gem5.opt(+0x423666)[0x55f2ea04a666]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fe798e3f0b3]
/home/ubuntu/lmy/gem5-gcn3/gem5/build/GCN3_X86/gem5.opt(+0x492f0e)[0x55f2ea0b9f0e]
--- END LIBC BACKTRACE --- ```
_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org