[AMD Public Use]

Hi Paul,


Checkpoints (and similarly switchcpu) in GCN3 do not work once the GPU has 
launched a kernel because the drain functions are not implemented in the GPU 
model.  In general, there is a lot of state to be drained for the GPU.  If you 
drain/serialize between kernels it would probably be the easiest.  This has 
been low priority for us since we've found it sufficient to use KVM 
fast-forward to get to the first kernel and then switch to a timing CPU.

The use case for most users is to simulate single kernel or a handful of 
kernels at a time, so for that starting from the beginning of the application 
and exiting after a certain number of kernels is an unreasonable amount of time.

If the only concern is to avoid the time simulating the runtime initialization 
(before first kernel launch) there is a trick you can do to checkpoint or 
switchcpus (e.g., from KVM) before it launches.



MATTHEW POREMBA
MTS Silicon Design Engineer  |  AMD
AMD Research
O +(1) 425-586-6472  C +(1) 425-518-1014
----------------------------------------------------------------------------------------------------------------------------------
2002 156th Ave NE, Suite 300, Bellevue, WA 98007
Facebook<https://www.facebook.com/AMD> |  Twitter<https://twitter.com/AMD> |  
amd.com<http://www.amd.com/>
[cid:[email protected]]



From: Jason Lowe-Power via gem5-dev <[email protected]>
Sent: Thursday, April 22, 2021 8:34 AM
To: [email protected]
Cc: Tschirhart, Paul K [US] (MS) <[email protected]>; Jason Lowe-Power 
<[email protected]>
Subject: [gem5-dev] Re: Gem5-GCN3 Checkpointing Support

[CAUTION: External Email]
Hi Paul,

I've included gem5-dev mailing list here. If you're not subscribed, I would 
suggest sending these kinds of questions there where the gem5-gcn/AMD 
developers will see things.

To answer your question, adding checkpointing should be relatively 
straightforward. The main changes are exactly as you described: saving the 
architectural state of the GPU threads. There are probably a few other pieces 
of GPU state that need to be saved too (e.g., the control processor, etc.). 
Hopefully one of the devs at AMD can reply with more details.

You'll have to add the serialize/unserialize functions, and you'll probably 
have to also implement a drain() function to flush out the current in-progress 
instructions. I imagine you'll at least want to finish the in-progress 
wavefronts, and you may want to wait until the currently scheduled workgroups 
are finished as well.

As far as Ruby goes, as long as the protocol you're using support 
checkpointing, it will "just work". To support checkpointing, I believe all the 
protocol needs to do is implement the flush RubyRequestMsg. I'm not sure if 
VIPER supports this or not.

Finally, the only other detail is whether or not the current implementation of 
SE mode fully supports checkpointing. Right now, this isn't something we 
regularly test, so it's possible that there are some details that are broken.

Hopefully the current devs at AMD will correct me where I'm wrong! Let us know 
on gem5-dev if there's any questions we can answer, etc. We would greatly 
appreciate this contribution!

Cheers,
Jason

On Wed, Apr 21, 2021 at 6:55 PM Tschirhart, Paul K [US] (MS) 
<[email protected]<mailto:[email protected]>> wrote:
Hello Professor Lowe-Power,

Thank you for your helpful replies to the emails from some other members of my 
group regarding various aspects of the Gem5 simulator.

I have been working with Gem5-GCN3 and I was wondering if you knew anything 
about the status of checkpointing support for that model. I saw in a post that 
adding checkpoints was something that was planned but I have not seen anything 
since.

Is there a significant technical challenge involved with expanding Gem5's 
checkpointing mechanism to support the GCN3 model or is this just a matter of 
writing the necessary serialize/unserialize functions? In other words, is this 
something that someone with experience making significant modifications to Gem5 
might be able to tackle by implementing an approach that is similar to the one 
used in the O3 model?

If the modifications should be mostly straightforward , do you know of anything 
that needs to be done besides adding the functions to serialize/unserialize 
threads in the GCN3 model? It seems like modifications might be required to 
support checkpointing for the VIPER protocol in Ruby but I don't see where I 
need to make the changes. Am I missing something either in Ruby or elsewhere in 
the simulator?

Thanks again for all of your help.

Paul



--
Jason Lowe-Power (he/him/his)
Assistant Professor, Computer Science Department
University of California, Davis
3049 Kemper Hall
https://arch.cs.ucdavis.edu/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Farch.cs.ucdavis.edu%2F&data=04%7C01%7Cmatthew.poremba%40amd.com%7C7c5e67a9b19041c8d0e208d905a437e4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637547025146597662%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=bWdgmGcdo3pliD9h2dyizRnWK07uvFZVNPi8YsqiOl0%3D&reserved=0>
_______________________________________________
gem5-dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

Reply via email to