[gem5-users] Re: Replacing CPU model in GPU-FS

Matt Sinclair via gem5-users Wed, 05 Jul 2023 10:12:33 -0700

Answers:

1.  Yes, I believe so.  However, I have never personally tried using the O3
model with the GPU.  Matt P has, I believe, so he may have better feedback
there.


2.  I have not followed the chain of events all the way through here, but I
*believe* that the builtin you highlighted is used at the compiler level by
HIPCC/LLVM to generate the appropriate assembly for a given AMD GPU.  In
this case (gfx900), I believe there is a 1-1 correlation with this builtin
becoming an s_sleep assembly instruction (maybe with the addition of a
v_mov-type instruction before it to set the register to the appropriate
sleep value).  I am not aware of s_sleep()'s builtin requiring OS calls (or
emulation).  But what you have described is more generally the issue with
SE mode (CPU, GPU, etc.) -- because SE mode does not model OS calls, the
fidelity of anything involving the OS will be less.  Perhaps a trite way to
answer this is: if the fidelity of the OS calls is important for the
applications you are studying, then I strongly recommend using FS mode.

Hope this helps,
Matt S.

On Tue, Jul 4, 2023 at 6:01 AM Anoop Mysore <mysan...@gmail.com> wrote:

> Thank you so much for the kind and detailed explanations!
>
> Just to clarify: I can use the APU config (apu_se.py) and switch out to an
> O3 CPU, and I would still have the detailed GPU model, and the disconnected
> Ruby model that synchronizes between CPU and GPU at the system-level
> directory -- is that correct?
>
> Last question: when using the APU config for simulating HeteroSync which,
> for example, has a sleep mutex primitive that invokes a
> __builtin_amdgcn_s_sleep(), is there any OS involvement? If yes, would SE
> mode's emulation of those syscalls inexorably sacrifice any fidelity that
> could be argued leads to inaccurate evaluations of heterogeneous coherence
> implementations? Or are any there other factors of insufficient fidelity
> that might be important in this regard?
>
>
> On Fri, Jun 30, 2023 at 7:40 PM Matt Sinclair <
> mattdsinclair.w...@gmail.com> wrote:
>
>> Just to follow-up on 4 and 5:
>>
>> 4.  The synchronization should happen at the directory-level here, since
>> this is the first level of the memory system where both the CPU and GPU are
>> connected.  However, I have not tested if the programmer sets the GLC bit
>> (which should perform the atomic at the GPU's LLC) if Ruby has the
>> functionality to send invalidations as appropriate to allow this.  I
>> suspect it would work as is, but would have to check ...
>>
>> 5.  Yeah, for the reasons Matt P already stated O3 is not currently
>> supported in GPUFS.  So GPUSE would be a better option here.  Yes, you can
>> use the apu_se.py script as the base script for running GPUSE experiments.
>> There are a number of examples on gem5-resources for how to get started
>> with this (including HeteroSync), but I normally recommend starting with
>> square if you haven't used the GPU model before:
>> https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/develop/src/gpu/square/.
>> In terms of support for synchronization at different levels of the memory
>> hierarchy, but default the GPU VIPER coherence protocol assumes that all
>> synchronization happens at the system-level (at the directory, in the
>> current implementation).  However, one of my students will be pushing
>> updates (hopefully today) that allow non-system level support (e.g., the
>> GPU LLC "GLC" level as mentioned above).  It sounds like you want to change
>> the cache hierarchy and coherence protocol to add another level of cache
>> (the L3) before the directory and after the CPU/GPU LLCs?  If so, you would
>> need to change the current Ruby support to add this additional level and
>> the appropriate transitions to do so.  However, if you instead meant that
>> you are thinking of the directory level as synchronizing between the CPU
>> and GPU, then you could use the support as is without any changes (I think).
>>
>> Hope this helps,
>> Matt S.
>>
>> On Fri, Jun 30, 2023 at 12:05 PM Poremba, Matthew via gem5-users <
>> gem5-users@gem5.org> wrote:
>>
>>> [Public]
>>>
>>> Hi,
>>>
>>>
>>>
>>>
>>>
>>> No worries about the questions! I will try to answer them all, so this
>>> will be a long email 😊:
>>>
>>>
>>>
>>> The disconnected (or disjoint) Ruby network is essentially the same as
>>> the APU Ruby network used in SE mode -  That is, it combines two Ruby
>>> protocols in one protocol (MOESI_AMD_base and GPU_VIPER).  They are
>>> disjointed because there are no paths / network links between the GPU and
>>> CPU side, simulating a discrete GPU. These protocols work together because
>>> they use the same network messages / virtual channels to the directory –
>>> Basically you cannot simply drop in another CPU protocol and have it work.
>>>
>>>
>>>
>>> Atomic CPU is working **very** recently – As in this week.  It is on
>>> review board right now and I believe might be part of the gem5 v23.0
>>> release.  However, the reason Atomic and KVM CPUs are required is because
>>> they use the atomic_noncaching memory mode and basically bypass the CPU
>>> cache. The timing CPUs (timing and O3) are trying to generate routes to the
>>> GPU side which is causing deadlocks.  I have not had any time to look into
>>> this further, but that is the status.
>>>
>>>
>>>
>>> | are the GPU applications run on KVM?
>>>
>>>
>>>
>>> The CPU portion of GPU applications runs on KVM.  The GPU is simulated
>>> in timing mode so the compute units, cache, memory, etc. are all simulated
>>> with events.  For an application that simply launches GPU kernels, the CPU
>>> is just waiting for the kernels to finish.
>>>
>>>
>>>
>>> For your other questions:
>>>
>>> 1.  Unfortunately no, it is not this easy. There is an issue with timing
>>> CPUs that is still an outstanding bug – we focused on atomic CPU recently
>>> as a way to allow users who aren’t able to use KVM to be able to use the
>>> GPU model.
>>>
>>> 2.  KVM exits whenever there is a memory request outside of its VM
>>> range. The PCI address range is outside the VM range, so for example when
>>> the CPU writes to PCI space it will trigger an event for the GPU. The only
>>> Ruby involvement here is that Ruby will send all requests outside of its
>>> memory range to the IO bus (KVM or not).
>>>
>>> 3.  The MMIO trace is only to load the GPU driver and not used in
>>> applications. It basically contains some reasonable register values for
>>> anything that is not modeled in gem5 so that we do not need to model them
>>> (e.g., graphics, power management, video encode/decode, etc.).  This is not
>>> required for compute-only GPU variants but that is a different topic.
>>>
>>> 4.  I’m not familiar enough with this particular application to answer
>>> this question.
>>>
>>> 5.  I think you will need to use SE mode to do what you are trying to
>>> do.  Full system mode is using the real GPU driver, ROCm stack, etc. which
>>> currently does not support any APU-like devices. SE mode is able to do this
>>> by making use of an emulated driver.
>>>
>>>
>>>
>>>
>>>
>>> -Matt
>>>
>>>
>>>
>>> *From:* Anoop Mysore via gem5-users <gem5-users@gem5.org>
>>> *Sent:* Friday, June 30, 2023 8:43 AM
>>> *To:* The gem5 Users mailing list <gem5-users@gem5.org>
>>> *Cc:* Anoop Mysore <mysan...@gmail.com>
>>> *Subject:* [gem5-users] Re: Replacing CPU model in GPU-FS
>>>
>>>
>>>
>>> *Caution:* This message originated from an External Source. Use proper
>>> caution when opening attachments, clicking links, or responding.
>>>
>>>
>>>
>>> It appears the host part of GPU applications are indeed executed on KVM,
>>> from:
>>> https://www.gem5.org/assets/files/workshop-isca-2023/slides/improving-gem5s-gpufs-support.pdf
>>> .
>>>
>>> A few more questions:
>>>
>>> 1. I missed that it isn't mentioned that O3 CPU models aren't supported
>>> -- would that be as easy as changing the `cpu_type` in the config file and
>>> running? I intend to run with the latest O3 CPU config I have (an Intel
>>> CPU).
>>> 2. The Ruby network that's used -- is it intercepting (perhaps just
>>> MMIO) memory operations from the KVM CPU? Could you please briefly describe
>>> how Ruby is working with both KVM and GPU (or point me to any document)?
>>> 3. The GPU MMIO trace we pass during simulator invocation -- what
>>> exactly is this? If it's a trace of the kernel driver/CPU's MMIO calls into
>>> GPU, how is it portable across different programs within a benchmark-suite
>>> -- HeteroSync, for example?
>>> 4. In HeteroSync, there's fine-grain synchronization between CPU and GPU
>>> in many apps. If I use the vega10_kvm.py, which has a discrete GPU with a
>>> KVM CPU, where do the synchronizations happen?
>>>
>>> 5. If I want to move to an integrated GPU model with an O3 CPU (the only
>>> requirement is the shared LLC) -- are there any resources that can help me?
>>> I do see a bootcamp that uses the apu_se.py -- can this be utilized at
>>> least partially to support full system O3 CPU + integrated GPU? Are there
>>> any modifications that need to be made to support synchronizations in L3?
>>>
>>>
>>>
>>> Please excuse the jumbled questions, I am in the process of gaining more
>>> clarity.
>>>
>>>
>>>
>>> On Fri, Jun 30, 2023 at 12:10 PM Anoop Mysore <mysan...@gmail.com>
>>> wrote:
>>>
>>> According to the GPU-FS blog
>>> <https://urldefense.com/v3/__https:/www.gem5.org/2023/02/13/moving-to-full-system-gpu.html__;!!K-Hz7m0Vt54!k9hG9tkVg8rKCoOxEFpXaQmrvFKmQeDhFkiaPWuQkMFWMhio1S4d8IWkF32x0Nyo7bBbV3LJKv6eLEKcc8oh0uyzua4$>
>>> ,
>>>
>>>     "*Currently KVM and X86 are required to run full system. Atomic and
>>> Timing CPUs are not yet compatible with the disconnected Ruby network
>>> required for GPUFS and is a work in progress*."
>>>
>>> My understanding is that KVM is used to boot Ubuntu; so, are the GPU
>>> applications run on KVM? Also, what does "disconnected" Ruby network mean
>>> there?
>>>
>>> If so, is there any work in progress that I can use to develop on, or a
>>> (noob-friendly) documentation of what needs to be done to extend the
>>> support to Atomic/O3 CPU?
>>>
>>> For a project I'm working on, I need complete visibility into the
>>> CPU+GPU cache hierarchy + perhaps a few more custom probes; could you
>>> comment on whether this would be restrictive if going with KVM in the
>>> meantime given that it leverages the host for the virtualized HW?
>>>
>>>
>>>
>>> Please let me know if I have got any of this wrong or if there are other
>>> details you think would be useful.
>>>
>>> _______________________________________________
>>> gem5-users mailing list -- gem5-users@gem5.org
>>> To unsubscribe send an email to gem5-users-le...@gem5.org
>>>
>>

_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org

[gem5-users] Re: Replacing CPU model in GPU-FS

Reply via email to