I think that it depends on your kernel and the way the cluster is booted
(for instance initrd size). You can check the memory used by kernel in
dmesg output - search for the line starting with "Memory:". This is fixed.
It may be also good idea to "reserve" some space for cache and buffers -
check h
We’re using cgroups to limit memory of jobs, but in our slurm.conf the total
node memory capacity is currently specified.
Doing this there could be times when physical memory is over subscribed
(physical allocation per job plus kernel memory requirements) and then swapping
will occur.
Is ther
Matthieu,
> I would bet on something like LDAP requests taking too much time
> because of a missing sssd cache.
Good point! It's easy to forget to check something as "simple" as user
look-up when something is taking "too long".
John DeSantis
On Tue, 16 Jan 2018 19:13:06 +0100
Matthieu Hautreux
Ciao Elisabetta,
On Tue, Jan 16, 2018 at 04:32:47PM +0100, Elisabetta Falivene wrote:
> being again able to launch slurmctld on the master and slurmd on the nodes.
great!
> *NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN*
> to
> *NodeName=node[01-08] CPUs=16 RealMemory=15999 State=U
Hi,
In this kind if issues, one good thing to do is to get a backtrace of
slurmctld during the slowdown. You should thus easily identify the
subcomponent responsible for the issue.
I would bet on something like LDAP requests taking too much time because of
a missing sssd cache.
Regards
Matthieu
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512
Ciao Alessandro,
> setting MessageTimeout to 20 didn't solve it :(
>
> looking at slurmctld logs I noticed many warning like these
>
> Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note very large
> processing time from _slurm_rpc_dump_par
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On 01/14/2018 09:11 PM, Lachlan Musicman wrote:
> Hi all,
>
> As part of both Munge and SLURM, time synchronised servers are
> necessary.
>
> I keep finding chrony installed and running and ntpd stopped. I
> turn chrony off and restart/enable ntpd bu
Here is the solution and another (minor) problem!
Investigating in the direction of the pid problem I found that in the
setting there was a
*SlurmctldPidFile=/var/run/slurmctld.pid*
*SlurmdPidFile=/var/run/slurmd.pid*
but the pid was searched in /var/run/slurm-llnl so I changed in the
slurm.conf
> It seems like the pidfile in systemd and slurm.conf are different. Check
> if they are the same and if not adjust the slurm.conf pid files. That
> should prevent systemd from killing slurm.
>
Emh, sorry, how I can do this?
> On Mon, 15 Jan 2018, 18:24 Elisabetta Falivene,
> wrote:
>
>> The de
>
> slurmd: debug2: _slurm_connect failed: Connection refused
>> slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817:
>> Connection refused
>>
>
> This sounds like the compute node cannot connect back to
> slurmctld on the management node, you should check that the
> IP address
Hi,
setting MessageTimeout to 20 didn't solve it :(
looking at slurmctld logs I noticed many warning like these
Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note very large
processing time from _slurm_rpc_dump_partitions: usec=42850604
began=05:10:17.289
Jan 16 05:20:58 r000u17l01 slu
I ended up with a more simple solution: I tweaked the program executable
(a bash script), so that it inspects which partition it is running on,
and if its the wrong one, it exits. Just added the following lines:
if [ $SLURM_JOB_PARTITION == 'big' ]; then
exit_code=126
Hi Trevor
thank you very much
we'll give it a try
ale
- Original Message -
> From: "Trevor Cooper"
> To: "Slurm User Community List"
> Sent: Tuesday, January 16, 2018 12:10:21 AM
> Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv
> operation
>
> Alessandro,
>
Hi,
Before we started using QOS for jobs, I could restrict the number of
jobs for an individual user with, say,
sacctmgr modify user where name=alice account=physics set maxjobs=1
However, now we have configured QOS for jobs, if a user requests a QOS which
has MaxJobs set to, say, 10, the abov
14 matches
Mail list logo