If you cannot use docker, then I recommend using the commands Kyle had in
the old dockers when installing ROCm.  Manually building like you are is
extremely error prone.  I don't know exactly what the problem(s) is, but
I'm pretty sure it's HCC_AMDGPU_TARGET, not HSA_AMDGPU_GPU_TARGET.  I'm
pretty sure cmake just ignored that when you specified it, since it's not a
real variable?  Kyle, do you have a pointer to what commit updated the
Dockerfile to 4.0 (Maybe this:
https://github.com/KyleRoarty/gem5_docker/blob/ml/Dockerfile?  Or is this
out of date relative to what was used before the ROCm 4.0 update?)?

You can certainly try cherry-picking those commits to your older branch.
VIPER is not frequently updated, so it's reasonably likely you'll be able
to apply the patches cleanly.

Matt

On Sat, Oct 9, 2021 at 12:12 PM Sampad Mohapatra via gem5-users <
gem5-users@gem5.org> wrote:

> Hi Matt,
>
> Thanks for the quick reply.
>
> I am running the benchmarks on research clusters where running docker is
> not permitted and hence I have to build everything and install locally.
> I have made modifications to the coherence protocol and porting it to a
> newer Gem5 version may take some time and hence I am stuck with v21.0.0 for
> now.
> Although the modifications are basically flags to identify certain packet
> types, so I am assuming that I haven't broken the protocol.
> Also, I have run the *square *benchmark and *2DConvolution, FDTD-2D *to
> completion (compared with cpu execution result) for smaller input sizes.
> If this version of GEM5 supports anything higher than rocm 1.6.x, I will
> try to build and use it.
>
> To build hcc, I have used the following command. I looked at the
> CMakelist.txt of other dependencies, but, they don't seem to be using
> HSA_AMDGPU_GPU_TARGET  variable:
> cmake -DCMAKE_INSTALL_PREFIX=rocm/hcc -DROCM_ROOT=rocm
> -DHSA_AMDGPU_GPU_TARGET="gfx801" -DCMAKE_BUILD_TYPE=Release ..
>
> And I build polybench using:
> hipcc --amdgpu-target=gfx801 -O2 2DConvolution.cpp -Igem5/include
> -Lgem5/util/m5/build/x86/out -Lgcc/lib64 -o 2DConvolution.exe -lm5
>
> I do remember that while compiling HCC, *bin/cmake-tests* build was
> failing because it was using the generated *clang++* which was unable to
> find *libstdc++.so.*
> LIBRARY_PATH is ignored (compile time) by the generated clang++ maybe.
> So, I modified the generated CMake file to add a " -Lgcc/lib64" to it so
> that it completes *make* and *make install*. The downside is I have to
> explicitly place *" -Lgcc/lib64 *"
> while compiling benchmarks using hipcc. Also, *square  *completes, so I
> think LD_LIBRARY_PATH works(runtime).
>
> I did see the commits you recently merged, but I wasn't sure whether I can
> retroactively add them to v21.0.0 which also has my own modifications.
> Should I go ahead and make the VIPER_TCC changes ?
>
> Also, I will definitely try to submit the benchmarks if they work out.
>
> Regards,
> Sampad
>
> On Sat, Oct 9, 2021 at 12:34 PM Matt Sinclair via gem5-users <
> gem5-users@gem5.org> wrote:
>
>> Hi Sampad,
>>
>> I have not seen anyone attempt to run workloads in a way you are
>> attempting, so I can't offer every solution, but here are a few things I
>> noticed:
>>
>> - Why are you still using ROCm 1.6.x?  And why did you build it from
>> source?  I strongly recommend using the built-in docker support (which
>> supports ROCm 4.0 now).  The error #4 you are having is almost definitely
>> because something you built from source is not built correctly. But the
>> possible causes of this error are disparate, so I can't suggest anything
>> specific about how to fix it.  Basically, that error means something went
>> wrong when running the application, which almost always (in my experience)
>> is due to not installing ROCm correctly.  If you need to continue on with
>> ROCm 1.6.x, I would recommend looking at the old commits before ROCm 4.0
>> support was added, and use the docker support there.
>>
>> - Error #3 likely comes from how you are compiling the program with
>> hipcc/hcc.  Depending on which commit you are using, you need to only use
>> gfx801, gfx803, gfx900, or gfx902.  Since you seem to be using a slightly
>> older setup, probably the issue is you are compiling for something other
>> than gfx801 (also if you are compiling for gfx803 or gfx900, did you use
>> the -dgpu flag on the command line?).  It is likely error #1 is related to
>> this too.
>>
>> - Error #2 will require getting a Ruby trace and looking at what's
>> happening with those addresses (ProtocolTrace debug flag is the most
>> important flag to use).  You may find the following useful:
>> https://www.gem5.org/documentation/learning_gem5/part3/MSIdebugging/.
>> Having said that, note that I recently merged two fixes to the VIPER TCC
>> that may be relevant/useful:
>> https://gem5-review.googlesource.com/c/public/gem5/+/51368,
>> https://gem5-review.googlesource.com/c/public/gem5/+/51367
>>
>> Finally, Polybench is not officially supported.  If you get them working,
>> it would be great if you submit them to gem5-resources (
>> resources.gem5.org/) to allow others to also use them!
>>
>> Thanks,
>> Matt
>>
>> On Sat, Oct 9, 2021 at 9:47 AM Sampad Mohapatra via gem5-users <
>> gem5-users@gem5.org> wrote:
>>
>>> Hi All,
>>>
>>> I am running gem5 v21.0.0.0, rocm v1.6.x (built from source). The
>>> simulations run one host CPU (its pair runs a tiny binary and ends exec
>>> quickly) to launch GPU benchmark (hipified Polybench GPU) and one CPU of a
>>> separate core-pair(its 2nd core runs a lightweight binary and ends exec
>>> quickly) to launch a SPEC-17 CPU benchmark on a 3x3 Mesh network. And I am
>>> facing 4 different kinds of errors and am requesting some help regarding
>>> them. The GPU benchmarks do "malloc"s of  size ranging from 2GB - 10GB. The
>>> errors appear on various combination of CPU and GPU benchmarks.
>>>
>>> (1) The below error appears and disappears on different simulation runs
>>> """""
>>> fdtd2d: ../ROCR-Runtime/src/core/runtime/amd_gpu_agent.cpp:577: virtual
>>> void amd::GpuAgent::InitDma(): Assertion `queues_[QueueBlitOnly] != __null
>>> && "Queue creation failed"' failed.
>>> """""
>>>
>>> (2) Similar errors with varying values
>>> """""
>>> panic: Possible Deadlock detected. Aborting!
>>> version: 4 request.paddr: 0x190b80c uncoalescedTable: 4 current time:
>>> 12393604096000 issue_time: 12393350811000 difference: 253285000
>>> Request Tables:
>>>
>>> Listing pending packets from 4 instructions     Addr: [0x2379b, line
>>> 0x23780] with 0 pending packets
>>>         Addr: [0x237ae, line 0x23780] with 64 pending packets
>>>         Addr: [0x237b0, line 0x23780] with 56 pending packets
>>>         Addr: [0x237b5, line 0x23780] with 61 pending packets
>>> Memory Usage: 57420616 KBytes
>>> """""
>>>
>>> (3) The below error appears and disappears on different simulation runs:
>>> """""
>>> There is no device can be used to do the computation
>>> """""
>>>
>>> (4) The below error appears and disappears on different simulation runs:
>>> """""
>>> fatal: syscall mincore (#27) unimplemented.
>>> """""
>>>
>>> Thanks and Regards,
>>> Sampad Mohapatra
>>> _______________________________________________
>>> gem5-users mailing list -- gem5-users@gem5.org
>>> To unsubscribe send an email to gem5-users-le...@gem5.org
>>> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
>>
>> _______________________________________________
>> gem5-users mailing list -- gem5-users@gem5.org
>> To unsubscribe send an email to gem5-users-le...@gem5.org
>> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
>
> _______________________________________________
> gem5-users mailing list -- gem5-users@gem5.org
> To unsubscribe send an email to gem5-users-le...@gem5.org
> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

Reply via email to