Hi Jürgen,
I would take a look at the various *KmemSpace options in cgroups.conf,
they can certainly help with this.
Cheers,
--
Kilian
On Thu, Jun 13, 2019 at 2:41 PM Juergen Salk wrote:
>
> Dear all,
>
> I'm just starting to get used to Slurm and play around with it in a small test
> environm
Dear all,
I'm just starting to get used to Slurm and play around with it in a small test
environment within our old cluster.
For our next system we will probably have to abandon our current exclusive user
node access policy in favor of a shared user policy, i.e. jobs from different
users will the
> ...
>> One way I?m using to work around this is to inject a long random string
>> into the ?comment option. Then, if I see the socket timeout, I use squeue
>> to look for that job and retrieve its ID. It?s not ideal, but it can work.
>
> I would have expected a different approach: use a unique
I agree with Christopher Coffey - look at the sssd caching.
I have had experience with sssd and can help a bit.
Also if you are seeing long waits could you have nested groups?
sssd is notorious for not handling these well, and there are settings in
the configuration file which you can experiment wi
On Thu, 13 Jun 2019, Christopher Harrop - NOAA Affiliate wrote:
...
One way I?m using to work around this is to inject a long random string
into the ?comment option. Then, if I see the socket timeout, I use squeue
to look for that job and retrieve its ID. It?s not ideal, but it can work.
I wo
Thanks. Had no problem setting the individual element of the array.
Just thought that it worked differently in the past! Memory apparently
isn't what it used to be!
Thanks again,
Bill
On 6/13/19 10:25 AM, Jacob Jenson wrote:
scontrol show job
The error message cited is associated with SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT,
which is only ever raised by slurm_send_timeout() and slurm_recv_timeout().
Those functions raise that error when a generic socket-based send/receive
operation exceeds an arbitrary time limit imposed by the caller.
Hi,
My group is struggling with this also.
The worst part of this, which no one has brought up yet, is that the sbatch
command does not necessarily fail to submit the job in this situation. In
fact, most of the time (for us), it succeeds. There appears to be some sort of
race condition or
Bill,
You can always set the time limit on a job array to a specific value:
# scontrol update jobid=123 timelimit=45
You can also increment the time limit on a job array that is still in a
single job record. Separate job records are spit off as needed, say when a
task starts or an attempt is made
# scontrol update jobid=3136818 timelimit+=30-00:00:00
scontrol: error: TimeLimit increment/decrement not supported for job arrays
This is new to 18.08.7 it appears. Am I just missing something here?
Bill
10 matches
Mail list logo