Hi,
Yes, HW identification and task affinity are working as expected. We
have smt=4, and slurm is able to get it without problems. See output
from "slurmd -C":
[root@node01 ~]# slurmd -C
NodeName=node01 CPUs=160 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
ThreadsPerCore=4 RealMemory=583992
On 4/16/19 1:15 AM, Ran Du wrote:
And another question is : how to apply for multiple cards could
not be divided exactly by 8? For example, to apply for 10 GPU cards, 8
cards on one node and 2 cards on another node?
There are new features coming in 19.05 for GPUs to better support them
Dear Marcus,
Thanks a lot for your reply. I will write it into our User Manual,
and let users know how to apply for multiple GPU cards.
Best regards,
Ran
On Tue, Apr 16, 2019 at 5:40 PM Marcus Wagner
wrote:
> Dear Ran,
>
> you can only ask for GPUS PER NODE, as gres are ressources per no
We went straight to ESSL. It also has FFTs and selected LAPACK, some with
GPU support (
https://www-01.ibm.com/common/ssi/ShowDoc.wss?docURL=/common/ssi/rep_sm/1/872/ENUS5765-L61/index.html&lang=en&request_locale=en
).
I also try to push people to use MKL on Intel, as it has multi-code-path
execut
Thanks for the info. Did you try building/using any of the open-source
math libraries for Power9, like OpenBLAS, or did you just use ESSL for
everything?
Prentice
On 4/16/19 1:12 PM, Fulcomer, Samuel wrote:
We had an AC921 and AC922 as a while as loaners.
We had no problems with SLURM.
Gett
We had an AC921 and AC922 as a while as loaners.
We had no problems with SLURM.
Getting POWERAI running correctly (bugs since fixed in newer release) and
apps properly built and linked to ESSL was the long march.
regards,
s
On Tue, Apr 16, 2019 at 12:59 PM Prentice Bisbal wrote:
> Sergi,
>
>
(sorry, kind of fell asleep on you there...)
I wouldn't expect backfill to be a problem since it shouldn't be starting
jobs that won't complete before the priority reservations start. We allow
jobs to go over (overtimelimit) so in our case it can be a problem.
On one of our cloud clusters we ha
Sergi,
I'm working with Bill on this project. Is all the hardware
identification/mapping and task affinity working as expected/desired
with the Power9? I assume your answer implies "yes", but I just want to
make sure.
Prentice
On 4/16/19 10:37 AM, Sergi More wrote:
Hi,
We have a Power9 cl
Hi,
We have a Power9 cluster (AC922) working without problems. Now with
18.08, but have been running as well with 17.11. No extra steps/problems
found during installation because of Power9.
Thank you,
Sergi.
On 16/04/2019 16:05, Bill Wichser wrote:
Does anyone on this list run Slurm on the
sh-4.3# srun -N 2 -n 40 -t 24:00:00 job.sh
srun: error: timeout waiting for task launch, started 0 of 40 tasks
srun: Job step 13.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 13.0 ON ozd2485u CANCELLE
Does anyone on this list run Slurm on the Sierra-like machines from IBM?
I believe they are the ACC922 nodes. We are looking to purchase a
small cluster of these nodes but have concerns about the scheduler.
Just looking for a nod that, yes it works fine, as well as any issues
seen during dep
Dear Ran,
you can only ask for GPUS PER NODE, as gres are ressources per node.
So, you can ask for 5 gpus and then get 5 gpus on each of the two nodes.
At the moment it is not possible to ask for 8 gpus on one node and 2 on
another.
That MIGHT change with slurm 19.05, since SchedMD is overhauli
Dear Antony,
It's worked!
I checked the allocation, and here is the record:
Nodes=gpu012 CPU_IDs=0-2 Mem=3072 GRES_IDX=gpu:v100(IDX:0-7)
Nodes=gpu013 CPU_IDs=0 Mem=1024 GRES_IDX=gpu:v100(IDX:0-7)
The job has got what it applied for.
And another question is : how t
13 matches
Mail list logo