Re: [slurm-users] jobs stuck in ReqNodeNotAvail,

2017-11-29 Thread Chris Samuel
On Thursday, 30 November 2017 2:21:36 AM AEDT Christian Anthon wrote: > The nodes are fully allocated in terms of memory, but not all cpu > resources are consumed I suspect that's your problem, the job wants 16 cores on a single node and 32GB of RAM free. If you've got no RAM free it's not goi

Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Chris Samuel
On Thursday, 30 November 2017 5:28:26 PM AEDT Chris Samuel wrote: > Are you starting it with systemctl? If so it might be taking too long for > systemd's liking to upgrade the tables and it might kill it. Ignore that - I skimmed your logs too quickly! [2017-11-29T16:15:22.086] slurmdbd version

Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Chris Samuel
On Thursday, 30 November 2017 3:26:25 AM AEDT Bruno Santos wrote: > Managed to do some more progress on this. The problem seems to be related to > somehow the service still linking to an older version of slurmdbd I had > installed with apt. I have now hopefully fully cleaned the old version but >

[slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

2017-11-29 Thread Andy Riebs
We've just installed 17.11.0 on our 100+ node x86_64 cluster running CentOS 7.4 this afternoon, and periodically see a single node (perhaps the first node in an allocation?) get drained with the message "batch job complete failure". On one node in question, slurmd.log reports pam_unix(slur

[slurm-users] Slurm version 17.11.0 is now available [PMIx with UCX]

2017-11-29 Thread Artem Polyakov
Dear friends and colleagues On behalf of Mellanox HPC R&D I would like to emphasize a feature that we introduced in Slurm 17.11 that has been show [1] to significantly improve the speed and scalability of Slurm jobstart. Starting from this release PMIx plugin supports: (a) Direct point-to-point c

Re: [slurm-users] Show job command after completion with sacct

2017-11-29 Thread Chris Samuel
On 30/11/17 8:57 am, Jacob Chappell wrote: Using "scontrol show jobid X" I can see info about running jobs, including the command used to launch the job, the user's working directory, values of stdout, stdin, stderr, etc. Note that the announcement for 17.11.0 mentions that the job script wil

[slurm-users] Show job command after completion with sacct

2017-11-29 Thread Jacob Chappell
All, Using "scontrol show jobid X" I can see info about running jobs, including the command used to launch the job, the user's working directory, values of stdout, stdin, stderr, etc. With Slurm accounting configured, sacct seems to show *some* of this information about jobs that have completed. H

Re: [slurm-users] '--x11' or no '--x11' when using srun when both methods work for X11 graphical applications

2017-11-29 Thread Matthieu Hautreux
Hi Kevin, Based on my understanding and a discussion with the SLURM dev team on that subject, here are some information about the new support of X11 in slurm-17.11 : - slurm's native support of X11 forwarding is based on libssh2 - slurm's native support of X11 can be disabled at configure/compila

Re: [slurm-users] jobs stuck in ReqNodeNotAvail,

2017-11-29 Thread Christian Anthon
Thanks, I believe the user must have resubmitted the job, hence the updated id. Cheers, Christian JobId=6986 JobName=Morgens UserId=ferro(2166) GroupId=ferro(22166) MCS_label=N/A Priority=1031 Nice=0 Account=rth QOS=normal JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes: Depen

Re: [slurm-users] fail when trying to set up selection=con_res

2017-11-29 Thread Ethan Van Matre
Here is some more data: Changed slurm.conf to have SelectType=select/cons_res SelectTypeParameters=CR_CPU Then restarted sudo systemctl restart slurmctld.service The log on the host said: [2017-11-29T12:23:56.384] error: we don't have select plugin type 101 [2017-11-29T12:23:56.384] erro

[slurm-users] '--x11' or no '--x11' when using srun when both methods work for X11 graphical applications

2017-11-29 Thread Kevin Manalo
Hello SLURM users?, I was reviewing the X11 documentation https://slurm.schedmd.com/faq.html#terminal https://slurm.schedmd.com/faq.html#x11 15. Can tasks be launched with a remote terminal? In Slurm version 1.3 or higher, use srun's --pty option. Until then, you can accomplish this by starting

Re: [slurm-users] fail when trying to set up selection=con_res

2017-11-29 Thread Ethan Van Matre
We do have hyperthreading enabled. Here are some log extracts fomr various attempts to get it working. [2017-11-28T15:52:30.466] error: we don't have select plugin type 101 [2017-11-28T15:52:30.466] error: select_g_select_jobinfo_unpack: unpack error [2017-11-28T15:52:30.466] error: Malformed

Re: [slurm-users] slurm conf with single machine with multi cores.

2017-11-29 Thread Le Biot, Pierre-Marie
Hello David, So linuxcluster is the Head node and also a Compute node ? Is slurmd running ? What does /var/log/slurm/slurmd.log say ? Regards, Pierre-Marie Le Biot From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of david vilanova Sent: Wednesday, November 29, 2017

Re: [slurm-users] slurm conf with single machine with multi cores.

2017-11-29 Thread Benjamin Redling
On 11/29/17 4:32 PM, david vilanova wrote: Hi, I have updated the slurm.conf as follows: SelectType=select/cons_res SelectTypeParameters=CR_CPU NodeName=linuxcluster CPUs=2 PartitionName=testq Nodes=linuxcluster Default=YES MaxTime=INFINITE State=UP Still get testq node in down status ??? Any

Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Bruno Santos
Managed to do some more progress on this. The problem seems to be related to somehow the service still linking to an older version of slurmdbd I had installed with apt. I have now hopefully fully cleaned the old version but when I try to start the service it is getting killed somehow. Any suggestio

Re: [slurm-users] jobs stuck in ReqNodeNotAvail,

2017-11-29 Thread Merlin Hartley
damn autocorrect - I meant: # scontrol show job 6982 -- Merlin Hartley Computer Officer MRC Mitochondrial Biology Unit Cambridge, CB2 0XY United Kingdom > On 29 Nov 2017, at 16:08, Merlin Hartley > wrote: > > Can you give us the output of > # control show job 6982 > > Could be an issue wi

Re: [slurm-users] jobs stuck in ReqNodeNotAvail,

2017-11-29 Thread Merlin Hartley
Can you give us the output of # control show job 6982 Could be an issue with requesting too many CPUs or something… Merlin -- Merlin Hartley Computer Officer MRC Mitochondrial Biology Unit Cambridge, CB2 0XY United Kingdom > On 29 Nov 2017, at 15:21, Christian Anthon wrote: > > Hi, > > I ha

Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Philip Kovacs
Step back from slurm and confirm that MariaDb is up and responsive. # mysql -uroot -pEnter password: Welcome to the MariaDB monitor.  Commands end with ; or \g.Your MariaDB connection id is 8Server version: 10.2.9-MariaDB MariaDB Server Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and

Re: [slurm-users] slurm conf with single machine with multi cores.

2017-11-29 Thread david vilanova
Hi, I have updated the slurm.conf as follows: SelectType=select/cons_res SelectTypeParameters=CR_CPU NodeName=linuxcluster CPUs=2 PartitionName=testq Nodes=linuxcluster Default=YES MaxTime=INFINITE State=UP Still get testq node in down status ??? Any idea ? Below log from db and controller: ==>

[slurm-users] jobs stuck in ReqNodeNotAvail,

2017-11-29 Thread Christian Anthon
Hi, I have a problem with a newly setup slurm-17.02.7-1.el6.x86_64 that jobs seems to be stuck in ReqNodeNotAvail:   6982 panic  Morgens    ferro PD   0:00 1 (ReqNodeNotAvail, UnavailableNodes:)   6981 panic SPEC    ferro PD   0:00 1 (ReqNodeNotAva

Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Bruno Santos
Hi Barbara, This is a fresh install. I have installed slurm from source on Debian stretch and now trying to set it up correctly. MariaDB is running for but I am confused about the database configuration. I followed a tutorial (I can no longer find it) that showed me how to create the database and

Re: [slurm-users] slurm conf with single machine with multi cores.

2017-11-29 Thread Steffen Grunewald
Hi David, On Wed, 2017-11-29 at 14:45:06 +, david vilanova wrote: > Hello, > I have installed latest 7.11 release and my node is shown as down. > I hava a single physical server with 12 cores so not sure the conf below is > correct ?? can you help ?? > > In slurm.conf the node is configure as

Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Barbara Krašovec
Did you upgrade SLURM or is it a fresh install? Are there any associations set? For instance, did you create the cluster with sacctmgr? sacctmgr add cluster Is mariadb/mysql server running, is slurmdbd running? Is it working? Try a simple test, such as: sacctmgr show user -s If it was an upgra

[slurm-users] slurm conf with single machine with multi cores.

2017-11-29 Thread david vilanova
Hello, I have installed latest 7.11 release and my node is shown as down. I hava a single physical server with 12 cores so not sure the conf below is correct ?? can you help ?? In slurm.conf the node is configure as follows: NodeName=linuxcluster CPUs=1 RealMemory=991 Sockets=12 CoresPerSocket=1

Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Bruno Santos
Thank you Barbara, Unfortunately, it does not seem to be a munge problem. Munge can successfully authenticate with the nodes. I have increased the verbosity level and restarted the slurmctld and now I am getting more information about this: > Nov 29 14:08:16 plantae slurmctld[30340]: Registering

Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Barbara Krašovec
Hello, does munge work? Try if decode works locally: munge -n | unmunge Try if decode works remotely: munge -n | ssh unmunge It seems as munge keys do not match... See comments inline.. > On 29 Nov 2017, at 14:40, Bruno Santos wrote: > > I actually just managed to figure that one out. > > T

Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Barbara Krašovec
I was struggling like crazy with this one a while ago. Then I saw this in the slurm.conf man page: AccountingStoragePass The password used to gain access to the database to store the accounting data. Only used for database type storage plugins, ignored otherwise. In the case of

Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Bruno Santos
I actually just managed to figure that one out. The problem was that I had setup AccountingStoragePass=magic in the slurm.conf file while after re-reading the documentation it seems this is only needed if I have a different munge instance controlling the logins to the database, which I don't. So c

Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Andy Riebs
It looks like you don't have the munged daemon running. On 11/29/2017 08:01 AM, Bruno Santos wrote: Hi everyone, I have set-up slurm to use slurm_db and all was working fine. However I had to change the slurm.conf to play with user priority and upon restarting the slurmctl is fails with the f

[slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Bruno Santos
Hi everyone, I have set-up slurm to use slurm_db and all was working fine. However I had to change the slurm.conf to play with user priority and upon restarting the slurmctl is fails with the following messages below. It seems that somehow is trying to use the mysql password as a munge socket? Any