Re: [slurm-users] Power9 ACC922

2019-04-16 Thread Sergi More
Hi, Yes, HW identification and task affinity are working as expected. We have smt=4, and slurm is able to get it without problems. See output from "slurmd -C": [root@node01 ~]# slurmd -C NodeName=node01 CPUs=160 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=4 RealMemory=583992

Re: [slurm-users] How to apply for multiple GPU cards from different worker nodes?

2019-04-16 Thread Christopher Samuel
On 4/16/19 1:15 AM, Ran Du wrote:       And another question is : how to apply for multiple cards could not be divided exactly by 8? For example, to apply for 10 GPU cards, 8 cards on one node and 2 cards on another node? There are new features coming in 19.05 for GPUs to better support them

Re: [slurm-users] How to apply for multiple GPU cards from different worker nodes?

2019-04-16 Thread Ran Du
Dear Marcus, Thanks a lot for your reply. I will write it into our User Manual, and let users know how to apply for multiple GPU cards. Best regards, Ran On Tue, Apr 16, 2019 at 5:40 PM Marcus Wagner wrote: > Dear Ran, > > you can only ask for GPUS PER NODE, as gres are ressources per no

Re: [slurm-users] Power9 ACC922

2019-04-16 Thread Fulcomer, Samuel
We went straight to ESSL. It also has FFTs and selected LAPACK, some with GPU support ( https://www-01.ibm.com/common/ssi/ShowDoc.wss?docURL=/common/ssi/rep_sm/1/872/ENUS5765-L61/index.html&lang=en&request_locale=en ). I also try to push people to use MKL on Intel, as it has multi-code-path execut

Re: [slurm-users] Power9 ACC922

2019-04-16 Thread Prentice Bisbal
Thanks for the info. Did you try building/using any of the open-source math libraries for Power9, like OpenBLAS, or did you just use ESSL for everything? Prentice On 4/16/19 1:12 PM, Fulcomer, Samuel wrote: We had an AC921 and AC922 as a while as loaners. We had no problems with SLURM. Gett

Re: [slurm-users] Power9 ACC922

2019-04-16 Thread Fulcomer, Samuel
We had an AC921 and AC922 as a while as loaners. We had no problems with SLURM. Getting POWERAI running correctly (bugs since fixed in newer release) and apps properly built and linked to ESSL was the long march. regards, s On Tue, Apr 16, 2019 at 12:59 PM Prentice Bisbal wrote: > Sergi, > >

Re: [slurm-users] Effect of PriorityMaxAge on job throughput

2019-04-16 Thread Michael Gutteridge
(sorry, kind of fell asleep on you there...) I wouldn't expect backfill to be a problem since it shouldn't be starting jobs that won't complete before the priority reservations start. We allow jobs to go over (overtimelimit) so in our case it can be a problem. On one of our cloud clusters we ha

Re: [slurm-users] Power9 ACC922

2019-04-16 Thread Prentice Bisbal
Sergi, I'm working with Bill on this project. Is all the hardware identification/mapping and task affinity working as expected/desired with the Power9? I assume your answer implies "yes", but I just want to make sure. Prentice On 4/16/19 10:37 AM, Sergi More wrote: Hi, We have a Power9 cl

Re: [slurm-users] Power9 ACC922

2019-04-16 Thread Sergi More
Hi, We have a Power9 cluster (AC922) working without problems. Now with 18.08, but have been running as well with 17.11. No extra steps/problems found during installation because of Power9. Thank you, Sergi. On 16/04/2019 16:05, Bill Wichser wrote: Does anyone on this list run Slurm on the

[slurm-users] Having issue in running Job using tensorflow

2019-04-16 Thread sudhagar s
sh-4.3# srun -N 2 -n 40 -t 24:00:00 job.sh srun: error: timeout waiting for task launch, started 0 of 40 tasks srun: Job step 13.0 aborted before step completely launched. srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 13.0 ON ozd2485u CANCELLE

[slurm-users] Power9 ACC922

2019-04-16 Thread Bill Wichser
Does anyone on this list run Slurm on the Sierra-like machines from IBM? I believe they are the ACC922 nodes. We are looking to purchase a small cluster of these nodes but have concerns about the scheduler. Just looking for a nod that, yes it works fine, as well as any issues seen during dep

Re: [slurm-users] How to apply for multiple GPU cards from different worker nodes?

2019-04-16 Thread Marcus Wagner
Dear Ran, you can only ask for GPUS PER NODE, as gres are ressources per node. So, you can ask for 5 gpus and then get 5 gpus on each of the two nodes. At the moment it is not possible to ask for 8 gpus on one node and 2 on another. That MIGHT change with slurm 19.05, since SchedMD is overhauli

Re: [slurm-users] How to apply for multiple GPU cards from different worker nodes?

2019-04-16 Thread Ran Du
Dear Antony, It's worked! I checked the allocation, and here is the record: Nodes=gpu012 CPU_IDs=0-2 Mem=3072 GRES_IDX=gpu:v100(IDX:0-7) Nodes=gpu013 CPU_IDs=0 Mem=1024 GRES_IDX=gpu:v100(IDX:0-7) The job has got what it applied for. And another question is : how t