> Hi Chris
>
> You are right in pointing that the job actually runs, despite of the error in
> the sbatch. The customer mention that:
> === start ===
> Problem had usual scenario - job script was submitted and executed, but
> sbatch command returned non-zero exit status to ecflow, which thus as
om: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of
Christopher Harrop - NOAA Affiliate
Sent: Donnerstag, 13. Juni 2019 16:47
To: Slurm User Community List
Subject: Re: [slurm-users] Random "sbatch" failure: "Socket timed out on
send/recv operation"
Hi,
My
Christopher Benjamin Coffey writes:
> Hi, you may want to look into increasing the sssd cache length on the
> nodes,
We have thought about that, but it will not solve the problem, only make
it less frequent, I think.
> and improving the network connectivity to your ldap
> directory.
That is so
> ...
>> One way I?m using to work around this is to inject a long random string
>> into the ?comment option. Then, if I see the socket timeout, I use squeue
>> to look for that job and retrieve its ID. It?s not ideal, but it can work.
>
> I would have expected a different approach: use a unique
I agree with Christopher Coffey - look at the sssd caching.
I have had experience with sssd and can help a bit.
Also if you are seeing long waits could you have nested groups?
sssd is notorious for not handling these well, and there are settings in
the configuration file which you can experiment wi
On Thu, 13 Jun 2019, Christopher Harrop - NOAA Affiliate wrote:
...
One way I?m using to work around this is to inject a long random string
into the ?comment option. Then, if I see the socket timeout, I use squeue
to look for that job and retrieve its ID. It?s not ideal, but it can work.
I wo
The error message cited is associated with SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT,
which is only ever raised by slurm_send_timeout() and slurm_recv_timeout().
Those functions raise that error when a generic socket-based send/receive
operation exceeds an arbitrary time limit imposed by the caller.
Hi,
My group is struggling with this also.
The worst part of this, which no one has brought up yet, is that the sbatch
command does not necessarily fail to submit the job in this situation. In
fact, most of the time (for us), it succeeds. There appears to be some sort of
race condition or
Hi,
we hit the same issue, up to 30.000 entries per day in the slurmctld log.
As we used SL6 the first time (Scientific Linux), we had massive
problems with sssd, often crashing.
We therefore decided to get rid of sssd and manually fill /etc/passwd
and /etc/groups via cronjob.
So, yes we hav
Hi, you may want to look into increasing the sssd cache length on the nodes,
and improving the network connectivity to your ldap directory. I recall when
playing with sssd in the past that it wasn't actually caching. Verify with
tcpdump, and "ls -l" through a directory. Once the uid/gid is resol
Another possible cause (we currently see it on one of our clusters):
delays in ldap lookups.
We have sssd on the machines, and occasionally, when sssd contacts the
ldap server, it takes 5 or 10 seconds (or even 15) before it gets an
answer. If that happens because slurmctld is trying to look up s
Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of
Steffen Grunewald
Sent: Dienstag, 11. Juni 2019 16:28
To: Slurm User Community List
Subject: Re: [slurm-users] Random "sbatch" failure: "Socket timed out on
send/recv operation"
On Tue, 20
I had similar problems in the past.
The 2 most common issues were:
1. Controller load - if the slurmctld was in heavy use, it
sometimes didn't respond in timely manner, exceeding the timeout
limit.
2. Topology and msg forwarding and aggregation.
For
On Tue, 2019-06-11 at 13:56:34 +, Marcelo Garcia wrote:
> Hi
>
> Since mid-March 2019 we are having a strange problem with slurm. Sometimes,
> the command "sbatch" fails:
>
> + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p
> operw /home2/mma002/ecf/home/Aos/Prod/
Hi
Since mid-March 2019 we are having a strange problem with slurm. Sometimes, the
command "sbatch" fails:
+ sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p
operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1
sbatch: error: Batch job submission failed:
15 matches
Mail list logo