Hi Bill, In order to shutdown the slurm process on the compute node, is it fine to kill /usr/sbin/slurm? Or there is a better and safer way for that?
Regards, Mahmood On Sun, Apr 22, 2018 at 5:44 PM, Bill Barth <bba...@tacc.utexas.edu> wrote: > Mahmood, > > If you have exclusive control of this system and can afford to have > compute-0-0 out of production for awhile, you can do a simple test: > > Shut Slurm down on compute-0-0 > Login directly to compute-0-0 > Run the timing experiment there > Compare the results to both of the other experiments you have already run on > this node and the head node. > > The big deal here it to make sure that Slurm is stopped during one of your > experiements, and you didn’t say whether you did that or not. If you did, > then maybe you have something to worry about. > > This takes Slurm out of the loop. It’s possible that something else about > compute-0-0 will show itself after you do this test, but this way you can > eliminate the overhead of the running Slurm processes. One possibility that > comes to my mind is that if compute-0-0 is a multi-socket node, then you may > have no or incorrect task and memory binding under Slurm (i.e. your processes > may be unbound with memory being allocated on one socket but Linux letting > them run on the other), which could easily lead to large performance > differences. We don’t require or let Slurm do bindings for us but require our > users to use numactl or the MPI runtime to handle it for them. Maybe you > should look into that after you eliminate direct interference from Slurm. > > Best, > Bill. > > -- > Bill Barth, Ph.D., Director, HPC > bba...@tacc.utexas.edu | Phone: (512) 232-7069 > Office: ROC 1.435 | Fax: (512) 475-9445 >