Hello,
I want to evaluate MatMult on GPU. I took a 2M x 2M matrix and ran with 6 mpi ranks and 6 GPUs. It took about 0.9 seconds. A kernel launch or a stream synchronization took about 10us. Compared with MatMult, they are tiny. Does it mean we can ignore them? What is a proper size to evaluate MatMult? I heard it is a few thousand rows per MPI rank. Why? Thanks. --Junchao Zhang