Hi Bill,
In order to shutdown the slurm process on the compute node, is it fine
to kill /usr/sbin/slurm? Or there is a better and safer way for that?

Regards,
Mahmood




On Sun, Apr 22, 2018 at 5:44 PM, Bill Barth <bba...@tacc.utexas.edu> wrote:
> Mahmood,
>
> If you have exclusive control of this system and can afford to have 
> compute-0-0 out of production for awhile, you can do a simple test:
>
> Shut Slurm down on compute-0-0
> Login directly to compute-0-0
> Run the timing experiment there
> Compare the results to both of the other experiments you have already run on 
> this node and the head node.
>
> The big deal here it to make sure that Slurm is stopped during one of your 
> experiements, and you didn’t say whether you did that or not. If you did, 
> then maybe you have something to worry about.
>
> This takes Slurm out of the loop. It’s possible that something else about 
> compute-0-0 will show itself after you do this test, but this way you can 
> eliminate the overhead of the running Slurm processes. One possibility that 
> comes to my mind is that if compute-0-0 is a multi-socket node, then you may 
> have no or incorrect task and memory binding under Slurm (i.e. your processes 
> may be unbound with memory being allocated on one socket but Linux letting 
> them run on the other), which could easily lead to large performance 
> differences. We don’t require or let Slurm do bindings for us but require our 
> users to use numactl or the MPI runtime to handle it for them. Maybe you 
> should look into that after you eliminate direct interference from Slurm.
>
> Best,
> Bill.
>
> --
> Bill Barth, Ph.D., Director, HPC
> bba...@tacc.utexas.edu        |   Phone: (512) 232-7069
> Office: ROC 1.435            |   Fax:   (512) 475-9445
>

Reply via email to