I understand; thanks again for the details. On Wed, Jul 5, 2023 at 7:10 PM Matt Sinclair <mattdsinclair.w...@gmail.com> wrote:
> Answers: > > 1. Yes, I believe so. However, I have never personally tried using the > O3 model with the GPU. Matt P has, I believe, so he may have better > feedback there. > > 2. I have not followed the chain of events all the way through here, but > I *believe* that the builtin you highlighted is used at the compiler level > by HIPCC/LLVM to generate the appropriate assembly for a given AMD GPU. In > this case (gfx900), I believe there is a 1-1 correlation with this builtin > becoming an s_sleep assembly instruction (maybe with the addition of a > v_mov-type instruction before it to set the register to the appropriate > sleep value). I am not aware of s_sleep()'s builtin requiring OS calls (or > emulation). But what you have described is more generally the issue with > SE mode (CPU, GPU, etc.) -- because SE mode does not model OS calls, the > fidelity of anything involving the OS will be less. Perhaps a trite way to > answer this is: if the fidelity of the OS calls is important for the > applications you are studying, then I strongly recommend using FS mode. > > Hope this helps, > Matt S. > > On Tue, Jul 4, 2023 at 6:01 AM Anoop Mysore <mysan...@gmail.com> wrote: > >> Thank you so much for the kind and detailed explanations! >> >> Just to clarify: I can use the APU config (apu_se.py) and switch out to >> an O3 CPU, and I would still have the detailed GPU model, and the >> disconnected Ruby model that synchronizes between CPU and GPU at the >> system-level directory -- is that correct? >> >> Last question: when using the APU config for simulating HeteroSync which, >> for example, has a sleep mutex primitive that invokes a >> __builtin_amdgcn_s_sleep(), is there any OS involvement? If yes, would SE >> mode's emulation of those syscalls inexorably sacrifice any fidelity that >> could be argued leads to inaccurate evaluations of heterogeneous coherence >> implementations? Or are any there other factors of insufficient fidelity >> that might be important in this regard? >> >> >> On Fri, Jun 30, 2023 at 7:40 PM Matt Sinclair < >> mattdsinclair.w...@gmail.com> wrote: >> >>> Just to follow-up on 4 and 5: >>> >>> 4. The synchronization should happen at the directory-level here, since >>> this is the first level of the memory system where both the CPU and GPU are >>> connected. However, I have not tested if the programmer sets the GLC bit >>> (which should perform the atomic at the GPU's LLC) if Ruby has the >>> functionality to send invalidations as appropriate to allow this. I >>> suspect it would work as is, but would have to check ... >>> >>> 5. Yeah, for the reasons Matt P already stated O3 is not currently >>> supported in GPUFS. So GPUSE would be a better option here. Yes, you can >>> use the apu_se.py script as the base script for running GPUSE experiments. >>> There are a number of examples on gem5-resources for how to get started >>> with this (including HeteroSync), but I normally recommend starting with >>> square if you haven't used the GPU model before: >>> https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/develop/src/gpu/square/. >>> In terms of support for synchronization at different levels of the memory >>> hierarchy, but default the GPU VIPER coherence protocol assumes that all >>> synchronization happens at the system-level (at the directory, in the >>> current implementation). However, one of my students will be pushing >>> updates (hopefully today) that allow non-system level support (e.g., the >>> GPU LLC "GLC" level as mentioned above). It sounds like you want to change >>> the cache hierarchy and coherence protocol to add another level of cache >>> (the L3) before the directory and after the CPU/GPU LLCs? If so, you would >>> need to change the current Ruby support to add this additional level and >>> the appropriate transitions to do so. However, if you instead meant that >>> you are thinking of the directory level as synchronizing between the CPU >>> and GPU, then you could use the support as is without any changes (I think). >>> >>> Hope this helps, >>> Matt S. >>> >>> On Fri, Jun 30, 2023 at 12:05 PM Poremba, Matthew via gem5-users < >>> gem5-users@gem5.org> wrote: >>> >>>> [Public] >>>> >>>> Hi, >>>> >>>> >>>> >>>> >>>> >>>> No worries about the questions! I will try to answer them all, so this >>>> will be a long email 😊: >>>> >>>> >>>> >>>> The disconnected (or disjoint) Ruby network is essentially the same as >>>> the APU Ruby network used in SE mode - That is, it combines two Ruby >>>> protocols in one protocol (MOESI_AMD_base and GPU_VIPER). They are >>>> disjointed because there are no paths / network links between the GPU and >>>> CPU side, simulating a discrete GPU. These protocols work together because >>>> they use the same network messages / virtual channels to the directory – >>>> Basically you cannot simply drop in another CPU protocol and have it work. >>>> >>>> >>>> >>>> Atomic CPU is working **very** recently – As in this week. It is on >>>> review board right now and I believe might be part of the gem5 v23.0 >>>> release. However, the reason Atomic and KVM CPUs are required is because >>>> they use the atomic_noncaching memory mode and basically bypass the CPU >>>> cache. The timing CPUs (timing and O3) are trying to generate routes to the >>>> GPU side which is causing deadlocks. I have not had any time to look into >>>> this further, but that is the status. >>>> >>>> >>>> >>>> | are the GPU applications run on KVM? >>>> >>>> >>>> >>>> The CPU portion of GPU applications runs on KVM. The GPU is simulated >>>> in timing mode so the compute units, cache, memory, etc. are all simulated >>>> with events. For an application that simply launches GPU kernels, the CPU >>>> is just waiting for the kernels to finish. >>>> >>>> >>>> >>>> For your other questions: >>>> >>>> 1. Unfortunately no, it is not this easy. There is an issue with >>>> timing CPUs that is still an outstanding bug – we focused on atomic CPU >>>> recently as a way to allow users who aren’t able to use KVM to be able to >>>> use the GPU model. >>>> >>>> 2. KVM exits whenever there is a memory request outside of its VM >>>> range. The PCI address range is outside the VM range, so for example when >>>> the CPU writes to PCI space it will trigger an event for the GPU. The only >>>> Ruby involvement here is that Ruby will send all requests outside of its >>>> memory range to the IO bus (KVM or not). >>>> >>>> 3. The MMIO trace is only to load the GPU driver and not used in >>>> applications. It basically contains some reasonable register values for >>>> anything that is not modeled in gem5 so that we do not need to model them >>>> (e.g., graphics, power management, video encode/decode, etc.). This is not >>>> required for compute-only GPU variants but that is a different topic. >>>> >>>> 4. I’m not familiar enough with this particular application to answer >>>> this question. >>>> >>>> 5. I think you will need to use SE mode to do what you are trying to >>>> do. Full system mode is using the real GPU driver, ROCm stack, etc. which >>>> currently does not support any APU-like devices. SE mode is able to do this >>>> by making use of an emulated driver. >>>> >>>> >>>> >>>> >>>> >>>> -Matt >>>> >>>> >>>> >>>> *From:* Anoop Mysore via gem5-users <gem5-users@gem5.org> >>>> *Sent:* Friday, June 30, 2023 8:43 AM >>>> *To:* The gem5 Users mailing list <gem5-users@gem5.org> >>>> *Cc:* Anoop Mysore <mysan...@gmail.com> >>>> *Subject:* [gem5-users] Re: Replacing CPU model in GPU-FS >>>> >>>> >>>> >>>> *Caution:* This message originated from an External Source. Use proper >>>> caution when opening attachments, clicking links, or responding. >>>> >>>> >>>> >>>> It appears the host part of GPU applications are indeed executed on >>>> KVM, from: >>>> https://www.gem5.org/assets/files/workshop-isca-2023/slides/improving-gem5s-gpufs-support.pdf >>>> . >>>> >>>> A few more questions: >>>> >>>> 1. I missed that it isn't mentioned that O3 CPU models aren't supported >>>> -- would that be as easy as changing the `cpu_type` in the config file and >>>> running? I intend to run with the latest O3 CPU config I have (an Intel >>>> CPU). >>>> 2. The Ruby network that's used -- is it intercepting (perhaps just >>>> MMIO) memory operations from the KVM CPU? Could you please briefly describe >>>> how Ruby is working with both KVM and GPU (or point me to any document)? >>>> 3. The GPU MMIO trace we pass during simulator invocation -- what >>>> exactly is this? If it's a trace of the kernel driver/CPU's MMIO calls into >>>> GPU, how is it portable across different programs within a benchmark-suite >>>> -- HeteroSync, for example? >>>> 4. In HeteroSync, there's fine-grain synchronization between CPU and >>>> GPU in many apps. If I use the vega10_kvm.py, which has a discrete GPU with >>>> a KVM CPU, where do the synchronizations happen? >>>> >>>> 5. If I want to move to an integrated GPU model with an O3 CPU (the >>>> only requirement is the shared LLC) -- are there any resources that can >>>> help me? I do see a bootcamp that uses the apu_se.py -- can this be >>>> utilized at least partially to support full system O3 CPU + integrated GPU? >>>> Are there any modifications that need to be made to support >>>> synchronizations in L3? >>>> >>>> >>>> >>>> Please excuse the jumbled questions, I am in the process of gaining >>>> more clarity. >>>> >>>> >>>> >>>> On Fri, Jun 30, 2023 at 12:10 PM Anoop Mysore <mysan...@gmail.com> >>>> wrote: >>>> >>>> According to the GPU-FS blog >>>> <https://urldefense.com/v3/__https:/www.gem5.org/2023/02/13/moving-to-full-system-gpu.html__;!!K-Hz7m0Vt54!k9hG9tkVg8rKCoOxEFpXaQmrvFKmQeDhFkiaPWuQkMFWMhio1S4d8IWkF32x0Nyo7bBbV3LJKv6eLEKcc8oh0uyzua4$> >>>> , >>>> >>>> "*Currently KVM and X86 are required to run full system. Atomic >>>> and Timing CPUs are not yet compatible with the disconnected Ruby network >>>> required for GPUFS and is a work in progress*." >>>> >>>> My understanding is that KVM is used to boot Ubuntu; so, are the GPU >>>> applications run on KVM? Also, what does "disconnected" Ruby network mean >>>> there? >>>> >>>> If so, is there any work in progress that I can use to develop on, or a >>>> (noob-friendly) documentation of what needs to be done to extend the >>>> support to Atomic/O3 CPU? >>>> >>>> For a project I'm working on, I need complete visibility into the >>>> CPU+GPU cache hierarchy + perhaps a few more custom probes; could you >>>> comment on whether this would be restrictive if going with KVM in the >>>> meantime given that it leverages the host for the virtualized HW? >>>> >>>> >>>> >>>> Please let me know if I have got any of this wrong or if there are >>>> other details you think would be useful. >>>> >>>> _______________________________________________ >>>> gem5-users mailing list -- gem5-users@gem5.org >>>> To unsubscribe send an email to gem5-users-le...@gem5.org >>>> >>>
_______________________________________________ gem5-users mailing list -- gem5-users@gem5.org To unsubscribe send an email to gem5-users-le...@gem5.org