Hello,

I want to evaluate MatMult on GPU.  I took a 2M x 2M matrix and ran with 6
mpi ranks and 6 GPUs.  It took about 0.9 seconds.  A kernel launch or a
stream synchronization took about 10us.  Compared with MatMult, they are
tiny. Does it mean we can ignore them?  What is a proper size to evaluate
MatMult?  I heard it is a few thousand rows per MPI rank.  Why?
Thanks.
--Junchao Zhang

Reply via email to