[AMD Public Use] Hi Paul,
Checkpoints (and similarly switchcpu) in GCN3 do not work once the GPU has launched a kernel because the drain functions are not implemented in the GPU model. In general, there is a lot of state to be drained for the GPU. If you drain/serialize between kernels it would probably be the easiest. This has been low priority for us since we've found it sufficient to use KVM fast-forward to get to the first kernel and then switch to a timing CPU. The use case for most users is to simulate single kernel or a handful of kernels at a time, so for that starting from the beginning of the application and exiting after a certain number of kernels is an unreasonable amount of time. If the only concern is to avoid the time simulating the runtime initialization (before first kernel launch) there is a trick you can do to checkpoint or switchcpus (e.g., from KVM) before it launches. MATTHEW POREMBA MTS Silicon Design Engineer | AMD AMD Research O +(1) 425-586-6472 C +(1) 425-518-1014 ---------------------------------------------------------------------------------------------------------------------------------- 2002 156th Ave NE, Suite 300, Bellevue, WA 98007 Facebook<https://www.facebook.com/AMD> | Twitter<https://twitter.com/AMD> | amd.com<http://www.amd.com/> [cid:[email protected]] From: Jason Lowe-Power via gem5-dev <[email protected]> Sent: Thursday, April 22, 2021 8:34 AM To: [email protected] Cc: Tschirhart, Paul K [US] (MS) <[email protected]>; Jason Lowe-Power <[email protected]> Subject: [gem5-dev] Re: Gem5-GCN3 Checkpointing Support [CAUTION: External Email] Hi Paul, I've included gem5-dev mailing list here. If you're not subscribed, I would suggest sending these kinds of questions there where the gem5-gcn/AMD developers will see things. To answer your question, adding checkpointing should be relatively straightforward. The main changes are exactly as you described: saving the architectural state of the GPU threads. There are probably a few other pieces of GPU state that need to be saved too (e.g., the control processor, etc.). Hopefully one of the devs at AMD can reply with more details. You'll have to add the serialize/unserialize functions, and you'll probably have to also implement a drain() function to flush out the current in-progress instructions. I imagine you'll at least want to finish the in-progress wavefronts, and you may want to wait until the currently scheduled workgroups are finished as well. As far as Ruby goes, as long as the protocol you're using support checkpointing, it will "just work". To support checkpointing, I believe all the protocol needs to do is implement the flush RubyRequestMsg. I'm not sure if VIPER supports this or not. Finally, the only other detail is whether or not the current implementation of SE mode fully supports checkpointing. Right now, this isn't something we regularly test, so it's possible that there are some details that are broken. Hopefully the current devs at AMD will correct me where I'm wrong! Let us know on gem5-dev if there's any questions we can answer, etc. We would greatly appreciate this contribution! Cheers, Jason On Wed, Apr 21, 2021 at 6:55 PM Tschirhart, Paul K [US] (MS) <[email protected]<mailto:[email protected]>> wrote: Hello Professor Lowe-Power, Thank you for your helpful replies to the emails from some other members of my group regarding various aspects of the Gem5 simulator. I have been working with Gem5-GCN3 and I was wondering if you knew anything about the status of checkpointing support for that model. I saw in a post that adding checkpoints was something that was planned but I have not seen anything since. Is there a significant technical challenge involved with expanding Gem5's checkpointing mechanism to support the GCN3 model or is this just a matter of writing the necessary serialize/unserialize functions? In other words, is this something that someone with experience making significant modifications to Gem5 might be able to tackle by implementing an approach that is similar to the one used in the O3 model? If the modifications should be mostly straightforward , do you know of anything that needs to be done besides adding the functions to serialize/unserialize threads in the GCN3 model? It seems like modifications might be required to support checkpointing for the VIPER protocol in Ruby but I don't see where I need to make the changes. Am I missing something either in Ruby or elsewhere in the simulator? Thanks again for all of your help. Paul -- Jason Lowe-Power (he/him/his) Assistant Professor, Computer Science Department University of California, Davis 3049 Kemper Hall https://arch.cs.ucdavis.edu/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Farch.cs.ucdavis.edu%2F&data=04%7C01%7Cmatthew.poremba%40amd.com%7C7c5e67a9b19041c8d0e208d905a437e4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637547025146597662%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=bWdgmGcdo3pliD9h2dyizRnWK07uvFZVNPi8YsqiOl0%3D&reserved=0>
_______________________________________________ gem5-dev mailing list -- [email protected] To unsubscribe send an email to [email protected] %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
