[slurm-dev] Re: slurm database purge,

2017-10-23 Thread Mehdi Denou
Hello Veronique, What is the value of innodb_buffer_pool_size in my.cnf ? (assuming you're using mariadb) Don't hesitate to set it to some GBs, ideally a little more than the size of your DB, if you have enough memory on the server. This improves the overall performance of the database, spec

[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

2017-10-23 Thread Merlin Hartley
A workaround is to pre-configure future nodes and mark them as down - then when you add them you can just mark them as up. (see the DownNodes parameter) Hope this helps! Merlin -- Merlin Hartley Computer Officer MRC Mitochondrial Biology Unit Cambridge, CB2 0XY United Kingdom > On 22 Oct 2017,

[slurm-dev] Restriction by users resources askings

2017-10-23 Thread David WALTER
Hello SLURM afficionados, I would like to know if it's possible to restrict nodes/partitions utilization depending on users resources asked ? And restrict resources that users can ask ? For example, I would like each user can only ask for an amount of RAM for their jobs and can't specify others

[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

2017-10-23 Thread Ole Holm Nielsen
I have added nodes to an existing partition several times using the same procedure which you describe, and no bad side effects have been noticed. This is a very normal kind of operation in a cluster, where hardware may be added or retired from time to time, while the cluster of course contin

[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

2017-10-23 Thread Bjørn-Helge Mevik
Ole Holm Nielsen writes: > I have added nodes to an existing partition several times using the same > procedure which you describe, and no bad side effects have been noticed. This > is a very normal kind of operation in a cluster, where hardware may be added > or retired from time to time, while

[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

2017-10-23 Thread Ole Holm Nielsen
Hi Jin, Your slurmctld.log says "Node compute004 appears to have a different slurm.conf than the slurmctld" etc. This will happen if you didn't copy correctly the slurm.conf to the nodes. Please correct this potential error. Also, please specify which version of Slurm you're running. /Ole

[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

2017-10-23 Thread JinSung Kang
Hi Thanks everyone for your response. I have also tested my setup to remove nodes from the cluster, and the same thing happens. *To answer some of the previous questions.* "Node compute004 appears to have a different slurm.conf than the slurmctld" error comes up when I replace slurm.conf in all t

[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

2017-10-23 Thread Ole Holm Nielsen
Hi Jin, I think that I always do your steps 3,4 in the opposite order: Restart slurmctld, then slurmd on nodes: > 3. Restart the slurmd on all nodes > 4. Restart the slurmctld Since you run a very old Slurm 15.08, perhaps you should upgrade 15.08 -> 16.05 -> 17.02. Soon there will be a 17.

[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

2017-10-23 Thread Ole Holm Nielsen
The reason for restarting slurmctld before slurmd on nodes is Moe Jette's advise in http://thread.gmane.org/gmane.comp.distributed.slurm.devel/3039 I would recommend 1. Stop slurmctld 2. Update slurm.conf on all nodes 3. Restart slurmctld 4. Start slurmd on the new nodes /Ole On 10/23/201

[slurm-dev] Re: How can I run multi job on one gpu

2017-10-23 Thread Wensheng Deng
There is a early thread related to this: https://groups.google.com/forum/#!searchin/slurm-devel/gres$20gpu$20oversubscribe%7Csort:date/slurm-devel/WPmkNPedKeM/r7EDvX7jujgJ On Sat, Oct 21, 2017 at 10:58 PM, Chaofeng Zhang wrote: > CUDA support it, gpu is shared mode by default, we can have mor

[slurm-dev] Re: Tasks distribution

2017-10-23 Thread Jeffrey T Frey
The deeper I dig at the select/cons_res plugin, the more of a mess it appears to be. Inconsistencies with the documentations, etc. The primary issue seems to be with the select/cons_res node selection lacking "--ntasks-per-node" et al. By default, the algorithm selects "--nodes=N" nodes, the