[gem5-users] Re: Replacing CPU model in GPU-FS

Anoop Mysore via gem5-users Thu, 06 Jul 2023 03:52:47 -0700

I understand; thanks again for the details.

On Wed, Jul 5, 2023 at 7:10 PM Matt Sinclair <mattdsinclair.w...@gmail.com>
wrote:


> Answers:
>
> 1.  Yes, I believe so.  However, I have never personally tried using the
> O3 model with the GPU.  Matt P has, I believe, so he may have better
> feedback there.
>
> 2.  I have not followed the chain of events all the way through here, but
> I *believe* that the builtin you highlighted is used at the compiler level
> by HIPCC/LLVM to generate the appropriate assembly for a given AMD GPU.  In
> this case (gfx900), I believe there is a 1-1 correlation with this builtin
> becoming an s_sleep assembly instruction (maybe with the addition of a
> v_mov-type instruction before it to set the register to the appropriate
> sleep value).  I am not aware of s_sleep()'s builtin requiring OS calls (or
> emulation).  But what you have described is more generally the issue with
> SE mode (CPU, GPU, etc.) -- because SE mode does not model OS calls, the
> fidelity of anything involving the OS will be less.  Perhaps a trite way to
> answer this is: if the fidelity of the OS calls is important for the
> applications you are studying, then I strongly recommend using FS mode.
>
> Hope this helps,
> Matt S.
>
> On Tue, Jul 4, 2023 at 6:01 AM Anoop Mysore <mysan...@gmail.com> wrote:
>
>> Thank you so much for the kind and detailed explanations!
>>
>> Just to clarify: I can use the APU config (apu_se.py) and switch out to
>> an O3 CPU, and I would still have the detailed GPU model, and the
>> disconnected Ruby model that synchronizes between CPU and GPU at the
>> system-level directory -- is that correct?
>>
>> Last question: when using the APU config for simulating HeteroSync which,
>> for example, has a sleep mutex primitive that invokes a
>> __builtin_amdgcn_s_sleep(), is there any OS involvement? If yes, would SE
>> mode's emulation of those syscalls inexorably sacrifice any fidelity that
>> could be argued leads to inaccurate evaluations of heterogeneous coherence
>> implementations? Or are any there other factors of insufficient fidelity
>> that might be important in this regard?
>>
>>
>> On Fri, Jun 30, 2023 at 7:40 PM Matt Sinclair <
>> mattdsinclair.w...@gmail.com> wrote:
>>
>>> Just to follow-up on 4 and 5:
>>>
>>> 4.  The synchronization should happen at the directory-level here, since
>>> this is the first level of the memory system where both the CPU and GPU are
>>> connected.  However, I have not tested if the programmer sets the GLC bit
>>> (which should perform the atomic at the GPU's LLC) if Ruby has the
>>> functionality to send invalidations as appropriate to allow this.  I
>>> suspect it would work as is, but would have to check ...
>>>
>>> 5.  Yeah, for the reasons Matt P already stated O3 is not currently
>>> supported in GPUFS.  So GPUSE would be a better option here.  Yes, you can
>>> use the apu_se.py script as the base script for running GPUSE experiments.
>>> There are a number of examples on gem5-resources for how to get started
>>> with this (including HeteroSync), but I normally recommend starting with
>>> square if you haven't used the GPU model before:
>>> https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/develop/src/gpu/square/.
>>> In terms of support for synchronization at different levels of the memory
>>> hierarchy, but default the GPU VIPER coherence protocol assumes that all
>>> synchronization happens at the system-level (at the directory, in the
>>> current implementation).  However, one of my students will be pushing
>>> updates (hopefully today) that allow non-system level support (e.g., the
>>> GPU LLC "GLC" level as mentioned above).  It sounds like you want to change
>>> the cache hierarchy and coherence protocol to add another level of cache
>>> (the L3) before the directory and after the CPU/GPU LLCs?  If so, you would
>>> need to change the current Ruby support to add this additional level and
>>> the appropriate transitions to do so.  However, if you instead meant that
>>> you are thinking of the directory level as synchronizing between the CPU
>>> and GPU, then you could use the support as is without any changes (I think).
>>>
>>> Hope this helps,
>>> Matt S.
>>>
>>> On Fri, Jun 30, 2023 at 12:05 PM Poremba, Matthew via gem5-users <
>>> gem5-users@gem5.org> wrote:
>>>
>>>> [Public]
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> No worries about the questions! I will try to answer them all, so this
>>>> will be a long email 😊:
>>>>
>>>>
>>>>
>>>> The disconnected (or disjoint) Ruby network is essentially the same as
>>>> the APU Ruby network used in SE mode -  That is, it combines two Ruby
>>>> protocols in one protocol (MOESI_AMD_base and GPU_VIPER).  They are
>>>> disjointed because there are no paths / network links between the GPU and
>>>> CPU side, simulating a discrete GPU. These protocols work together because
>>>> they use the same network messages / virtual channels to the directory –
>>>> Basically you cannot simply drop in another CPU protocol and have it work.
>>>>
>>>>
>>>>
>>>> Atomic CPU is working **very** recently – As in this week.  It is on
>>>> review board right now and I believe might be part of the gem5 v23.0
>>>> release.  However, the reason Atomic and KVM CPUs are required is because
>>>> they use the atomic_noncaching memory mode and basically bypass the CPU
>>>> cache. The timing CPUs (timing and O3) are trying to generate routes to the
>>>> GPU side which is causing deadlocks.  I have not had any time to look into
>>>> this further, but that is the status.
>>>>
>>>>
>>>>
>>>> | are the GPU applications run on KVM?
>>>>
>>>>
>>>>
>>>> The CPU portion of GPU applications runs on KVM.  The GPU is simulated
>>>> in timing mode so the compute units, cache, memory, etc. are all simulated
>>>> with events.  For an application that simply launches GPU kernels, the CPU
>>>> is just waiting for the kernels to finish.
>>>>
>>>>
>>>>
>>>> For your other questions:
>>>>
>>>> 1.  Unfortunately no, it is not this easy. There is an issue with
>>>> timing CPUs that is still an outstanding bug – we focused on atomic CPU
>>>> recently as a way to allow users who aren’t able to use KVM to be able to
>>>> use the GPU model.
>>>>
>>>> 2.  KVM exits whenever there is a memory request outside of its VM
>>>> range. The PCI address range is outside the VM range, so for example when
>>>> the CPU writes to PCI space it will trigger an event for the GPU. The only
>>>> Ruby involvement here is that Ruby will send all requests outside of its
>>>> memory range to the IO bus (KVM or not).
>>>>
>>>> 3.  The MMIO trace is only to load the GPU driver and not used in
>>>> applications. It basically contains some reasonable register values for
>>>> anything that is not modeled in gem5 so that we do not need to model them
>>>> (e.g., graphics, power management, video encode/decode, etc.).  This is not
>>>> required for compute-only GPU variants but that is a different topic.
>>>>
>>>> 4.  I’m not familiar enough with this particular application to answer
>>>> this question.
>>>>
>>>> 5.  I think you will need to use SE mode to do what you are trying to
>>>> do.  Full system mode is using the real GPU driver, ROCm stack, etc. which
>>>> currently does not support any APU-like devices. SE mode is able to do this
>>>> by making use of an emulated driver.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -Matt
>>>>
>>>>
>>>>
>>>> *From:* Anoop Mysore via gem5-users <gem5-users@gem5.org>
>>>> *Sent:* Friday, June 30, 2023 8:43 AM
>>>> *To:* The gem5 Users mailing list <gem5-users@gem5.org>
>>>> *Cc:* Anoop Mysore <mysan...@gmail.com>
>>>> *Subject:* [gem5-users] Re: Replacing CPU model in GPU-FS
>>>>
>>>>
>>>>
>>>> *Caution:* This message originated from an External Source. Use proper
>>>> caution when opening attachments, clicking links, or responding.
>>>>
>>>>
>>>>
>>>> It appears the host part of GPU applications are indeed executed on
>>>> KVM, from:
>>>> https://www.gem5.org/assets/files/workshop-isca-2023/slides/improving-gem5s-gpufs-support.pdf
>>>> .
>>>>
>>>> A few more questions:
>>>>
>>>> 1. I missed that it isn't mentioned that O3 CPU models aren't supported
>>>> -- would that be as easy as changing the `cpu_type` in the config file and
>>>> running? I intend to run with the latest O3 CPU config I have (an Intel
>>>> CPU).
>>>> 2. The Ruby network that's used -- is it intercepting (perhaps just
>>>> MMIO) memory operations from the KVM CPU? Could you please briefly describe
>>>> how Ruby is working with both KVM and GPU (or point me to any document)?
>>>> 3. The GPU MMIO trace we pass during simulator invocation -- what
>>>> exactly is this? If it's a trace of the kernel driver/CPU's MMIO calls into
>>>> GPU, how is it portable across different programs within a benchmark-suite
>>>> -- HeteroSync, for example?
>>>> 4. In HeteroSync, there's fine-grain synchronization between CPU and
>>>> GPU in many apps. If I use the vega10_kvm.py, which has a discrete GPU with
>>>> a KVM CPU, where do the synchronizations happen?
>>>>
>>>> 5. If I want to move to an integrated GPU model with an O3 CPU (the
>>>> only requirement is the shared LLC) -- are there any resources that can
>>>> help me? I do see a bootcamp that uses the apu_se.py -- can this be
>>>> utilized at least partially to support full system O3 CPU + integrated GPU?
>>>> Are there any modifications that need to be made to support
>>>> synchronizations in L3?
>>>>
>>>>
>>>>
>>>> Please excuse the jumbled questions, I am in the process of gaining
>>>> more clarity.
>>>>
>>>>
>>>>
>>>> On Fri, Jun 30, 2023 at 12:10 PM Anoop Mysore <mysan...@gmail.com>
>>>> wrote:
>>>>
>>>> According to the GPU-FS blog
>>>> <https://urldefense.com/v3/__https:/www.gem5.org/2023/02/13/moving-to-full-system-gpu.html__;!!K-Hz7m0Vt54!k9hG9tkVg8rKCoOxEFpXaQmrvFKmQeDhFkiaPWuQkMFWMhio1S4d8IWkF32x0Nyo7bBbV3LJKv6eLEKcc8oh0uyzua4$>
>>>> ,
>>>>
>>>>     "*Currently KVM and X86 are required to run full system. Atomic
>>>> and Timing CPUs are not yet compatible with the disconnected Ruby network
>>>> required for GPUFS and is a work in progress*."
>>>>
>>>> My understanding is that KVM is used to boot Ubuntu; so, are the GPU
>>>> applications run on KVM? Also, what does "disconnected" Ruby network mean
>>>> there?
>>>>
>>>> If so, is there any work in progress that I can use to develop on, or a
>>>> (noob-friendly) documentation of what needs to be done to extend the
>>>> support to Atomic/O3 CPU?
>>>>
>>>> For a project I'm working on, I need complete visibility into the
>>>> CPU+GPU cache hierarchy + perhaps a few more custom probes; could you
>>>> comment on whether this would be restrictive if going with KVM in the
>>>> meantime given that it leverages the host for the virtualized HW?
>>>>
>>>>
>>>>
>>>> Please let me know if I have got any of this wrong or if there are
>>>> other details you think would be useful.
>>>>
>>>> _______________________________________________
>>>> gem5-users mailing list -- gem5-users@gem5.org
>>>> To unsubscribe send an email to gem5-users-le...@gem5.org
>>>>
>>>

_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org

[gem5-users] Re: Replacing CPU model in GPU-FS

Reply via email to