Answers: 1. Yes, I believe so. However, I have never personally tried using the O3 model with the GPU. Matt P has, I believe, so he may have better feedback there.
2. I have not followed the chain of events all the way through here, but I *believe* that the builtin you highlighted is used at the compiler level by HIPCC/LLVM to generate the appropriate assembly for a given AMD GPU. In this case (gfx900), I believe there is a 1-1 correlation with this builtin becoming an s_sleep assembly instruction (maybe with the addition of a v_mov-type instruction before it to set the register to the appropriate sleep value). I am not aware of s_sleep()'s builtin requiring OS calls (or emulation). But what you have described is more generally the issue with SE mode (CPU, GPU, etc.) -- because SE mode does not model OS calls, the fidelity of anything involving the OS will be less. Perhaps a trite way to answer this is: if the fidelity of the OS calls is important for the applications you are studying, then I strongly recommend using FS mode. Hope this helps, Matt S. On Tue, Jul 4, 2023 at 6:01 AM Anoop Mysore <mysan...@gmail.com> wrote: > Thank you so much for the kind and detailed explanations! > > Just to clarify: I can use the APU config (apu_se.py) and switch out to an > O3 CPU, and I would still have the detailed GPU model, and the disconnected > Ruby model that synchronizes between CPU and GPU at the system-level > directory -- is that correct? > > Last question: when using the APU config for simulating HeteroSync which, > for example, has a sleep mutex primitive that invokes a > __builtin_amdgcn_s_sleep(), is there any OS involvement? If yes, would SE > mode's emulation of those syscalls inexorably sacrifice any fidelity that > could be argued leads to inaccurate evaluations of heterogeneous coherence > implementations? Or are any there other factors of insufficient fidelity > that might be important in this regard? > > > On Fri, Jun 30, 2023 at 7:40 PM Matt Sinclair < > mattdsinclair.w...@gmail.com> wrote: > >> Just to follow-up on 4 and 5: >> >> 4. The synchronization should happen at the directory-level here, since >> this is the first level of the memory system where both the CPU and GPU are >> connected. However, I have not tested if the programmer sets the GLC bit >> (which should perform the atomic at the GPU's LLC) if Ruby has the >> functionality to send invalidations as appropriate to allow this. I >> suspect it would work as is, but would have to check ... >> >> 5. Yeah, for the reasons Matt P already stated O3 is not currently >> supported in GPUFS. So GPUSE would be a better option here. Yes, you can >> use the apu_se.py script as the base script for running GPUSE experiments. >> There are a number of examples on gem5-resources for how to get started >> with this (including HeteroSync), but I normally recommend starting with >> square if you haven't used the GPU model before: >> https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/develop/src/gpu/square/. >> In terms of support for synchronization at different levels of the memory >> hierarchy, but default the GPU VIPER coherence protocol assumes that all >> synchronization happens at the system-level (at the directory, in the >> current implementation). However, one of my students will be pushing >> updates (hopefully today) that allow non-system level support (e.g., the >> GPU LLC "GLC" level as mentioned above). It sounds like you want to change >> the cache hierarchy and coherence protocol to add another level of cache >> (the L3) before the directory and after the CPU/GPU LLCs? If so, you would >> need to change the current Ruby support to add this additional level and >> the appropriate transitions to do so. However, if you instead meant that >> you are thinking of the directory level as synchronizing between the CPU >> and GPU, then you could use the support as is without any changes (I think). >> >> Hope this helps, >> Matt S. >> >> On Fri, Jun 30, 2023 at 12:05 PM Poremba, Matthew via gem5-users < >> gem5-users@gem5.org> wrote: >> >>> [Public] >>> >>> Hi, >>> >>> >>> >>> >>> >>> No worries about the questions! I will try to answer them all, so this >>> will be a long email 😊: >>> >>> >>> >>> The disconnected (or disjoint) Ruby network is essentially the same as >>> the APU Ruby network used in SE mode - That is, it combines two Ruby >>> protocols in one protocol (MOESI_AMD_base and GPU_VIPER). They are >>> disjointed because there are no paths / network links between the GPU and >>> CPU side, simulating a discrete GPU. These protocols work together because >>> they use the same network messages / virtual channels to the directory – >>> Basically you cannot simply drop in another CPU protocol and have it work. >>> >>> >>> >>> Atomic CPU is working **very** recently – As in this week. It is on >>> review board right now and I believe might be part of the gem5 v23.0 >>> release. However, the reason Atomic and KVM CPUs are required is because >>> they use the atomic_noncaching memory mode and basically bypass the CPU >>> cache. The timing CPUs (timing and O3) are trying to generate routes to the >>> GPU side which is causing deadlocks. I have not had any time to look into >>> this further, but that is the status. >>> >>> >>> >>> | are the GPU applications run on KVM? >>> >>> >>> >>> The CPU portion of GPU applications runs on KVM. The GPU is simulated >>> in timing mode so the compute units, cache, memory, etc. are all simulated >>> with events. For an application that simply launches GPU kernels, the CPU >>> is just waiting for the kernels to finish. >>> >>> >>> >>> For your other questions: >>> >>> 1. Unfortunately no, it is not this easy. There is an issue with timing >>> CPUs that is still an outstanding bug – we focused on atomic CPU recently >>> as a way to allow users who aren’t able to use KVM to be able to use the >>> GPU model. >>> >>> 2. KVM exits whenever there is a memory request outside of its VM >>> range. The PCI address range is outside the VM range, so for example when >>> the CPU writes to PCI space it will trigger an event for the GPU. The only >>> Ruby involvement here is that Ruby will send all requests outside of its >>> memory range to the IO bus (KVM or not). >>> >>> 3. The MMIO trace is only to load the GPU driver and not used in >>> applications. It basically contains some reasonable register values for >>> anything that is not modeled in gem5 so that we do not need to model them >>> (e.g., graphics, power management, video encode/decode, etc.). This is not >>> required for compute-only GPU variants but that is a different topic. >>> >>> 4. I’m not familiar enough with this particular application to answer >>> this question. >>> >>> 5. I think you will need to use SE mode to do what you are trying to >>> do. Full system mode is using the real GPU driver, ROCm stack, etc. which >>> currently does not support any APU-like devices. SE mode is able to do this >>> by making use of an emulated driver. >>> >>> >>> >>> >>> >>> -Matt >>> >>> >>> >>> *From:* Anoop Mysore via gem5-users <gem5-users@gem5.org> >>> *Sent:* Friday, June 30, 2023 8:43 AM >>> *To:* The gem5 Users mailing list <gem5-users@gem5.org> >>> *Cc:* Anoop Mysore <mysan...@gmail.com> >>> *Subject:* [gem5-users] Re: Replacing CPU model in GPU-FS >>> >>> >>> >>> *Caution:* This message originated from an External Source. Use proper >>> caution when opening attachments, clicking links, or responding. >>> >>> >>> >>> It appears the host part of GPU applications are indeed executed on KVM, >>> from: >>> https://www.gem5.org/assets/files/workshop-isca-2023/slides/improving-gem5s-gpufs-support.pdf >>> . >>> >>> A few more questions: >>> >>> 1. I missed that it isn't mentioned that O3 CPU models aren't supported >>> -- would that be as easy as changing the `cpu_type` in the config file and >>> running? I intend to run with the latest O3 CPU config I have (an Intel >>> CPU). >>> 2. The Ruby network that's used -- is it intercepting (perhaps just >>> MMIO) memory operations from the KVM CPU? Could you please briefly describe >>> how Ruby is working with both KVM and GPU (or point me to any document)? >>> 3. The GPU MMIO trace we pass during simulator invocation -- what >>> exactly is this? If it's a trace of the kernel driver/CPU's MMIO calls into >>> GPU, how is it portable across different programs within a benchmark-suite >>> -- HeteroSync, for example? >>> 4. In HeteroSync, there's fine-grain synchronization between CPU and GPU >>> in many apps. If I use the vega10_kvm.py, which has a discrete GPU with a >>> KVM CPU, where do the synchronizations happen? >>> >>> 5. If I want to move to an integrated GPU model with an O3 CPU (the only >>> requirement is the shared LLC) -- are there any resources that can help me? >>> I do see a bootcamp that uses the apu_se.py -- can this be utilized at >>> least partially to support full system O3 CPU + integrated GPU? Are there >>> any modifications that need to be made to support synchronizations in L3? >>> >>> >>> >>> Please excuse the jumbled questions, I am in the process of gaining more >>> clarity. >>> >>> >>> >>> On Fri, Jun 30, 2023 at 12:10 PM Anoop Mysore <mysan...@gmail.com> >>> wrote: >>> >>> According to the GPU-FS blog >>> <https://urldefense.com/v3/__https:/www.gem5.org/2023/02/13/moving-to-full-system-gpu.html__;!!K-Hz7m0Vt54!k9hG9tkVg8rKCoOxEFpXaQmrvFKmQeDhFkiaPWuQkMFWMhio1S4d8IWkF32x0Nyo7bBbV3LJKv6eLEKcc8oh0uyzua4$> >>> , >>> >>> "*Currently KVM and X86 are required to run full system. Atomic and >>> Timing CPUs are not yet compatible with the disconnected Ruby network >>> required for GPUFS and is a work in progress*." >>> >>> My understanding is that KVM is used to boot Ubuntu; so, are the GPU >>> applications run on KVM? Also, what does "disconnected" Ruby network mean >>> there? >>> >>> If so, is there any work in progress that I can use to develop on, or a >>> (noob-friendly) documentation of what needs to be done to extend the >>> support to Atomic/O3 CPU? >>> >>> For a project I'm working on, I need complete visibility into the >>> CPU+GPU cache hierarchy + perhaps a few more custom probes; could you >>> comment on whether this would be restrictive if going with KVM in the >>> meantime given that it leverages the host for the virtualized HW? >>> >>> >>> >>> Please let me know if I have got any of this wrong or if there are other >>> details you think would be useful. >>> >>> _______________________________________________ >>> gem5-users mailing list -- gem5-users@gem5.org >>> To unsubscribe send an email to gem5-users-le...@gem5.org >>> >>
_______________________________________________ gem5-users mailing list -- gem5-users@gem5.org To unsubscribe send an email to gem5-users-le...@gem5.org