I'm a little confused about how this would work. For example, where
does slurmctld run? And if on each submit host, why aren't the control
daemons stepping all over each other?
On 11/22/18 6:38 AM, Stu Midgley wrote:
> indeed.
>
> All our workstations are submit hosts and in the queue, so peo
I believe that fragmentation only happens on routers when passing traffic from
one subnet to another. Since this traffic was all on a single subnet there was
no router involved to fragment the packets.
Mike
On 11/26/18 1:49 PM, Kenneth Roberts wrote:
D’oh!
The compute nodes had different MTU o
D'oh!
The compute nodes had different MTU on the network interfaces than the
master. Once all set to 1500, it works!
So ... any ideas why that was a problem? Maybe the interfaces had no
fragmentation set and there were dropped packets?
Thanks for listening.
Ken
From: slurm-users
On Thu, 22 Nov 2018 01:51:59 +0800 (GMT+08:00)
宋亚磊 wrote:
> Hello everyone,
>
> How to check the percent cpu of a job in slurm? I tried sacct, sstat,
> squeue, but I can't find that how to check. Can someone help me?
I've written a small tool, jobload, that takes a jobid and outputs
current per
I posted about the local display issue a while back ("Built in X11
forwarding in 17.11 won't work on local displays").
I agree that having some local managed workstations that can also act as
submit nodes is not so uncommon. However we also ran into this on our
official "login nodes" because we us
I wasn't looking close enough at the times in the log file.
c2: [2018-11-26T10:09:40.963] debug3: in the service_connection
c2: [2018-11-26T10:10:00.983] debug: slurm_recv_timeout at 0 of 9589,
timeout
c2: [2018-11-26T10:10:00.983] error: slurm_receive_msg_and_forward: Socket
timed out on se
Here is the debug log on a node (c2) when the job fails
c2: [2018-11-26T07:35:56.261] debug3: in the service_connection
c2: [2018-11-26T07:36:16.281] debug: slurm_recv_timeout at 0 of 9680,
timeout
c2: [2018-11-26T07:36:16.282] error: slurm_receive_msg_and_forward: Socket
timed out on s
Hi, All,
I have a heterogenous cluster in which some users need to submit socket
exclusive jobs. All of the nodes have enough cores on a single socket for
the jobs to run. Is there a way to submit a job which is socket exclusive
without specifying the core count?
Something like this, but with rea
I'm either misunderstanding how to configure the limit "MaxCPUsPerNode" or
how it behaves. My desired end-state is that if a user submits a job to a
partition that requests more resources (CPUs) than available on any node in
that partition, the job will be immediately rejected, rather than pending
Steve,
This doesn't really address your question, and I am guessing you are
aware of this; however, since you did not mention it: "scontrol show
job " will give you a lot of detail about a job (a lot more
than squeue). It's "Reason" is the same as sinfo and squeue, though.
So no help there. I'v
Hi Chris,
I really think, it is not that uncommon. But in another way like Tina
explained.
We HAVE special loginnodes to the cluster, no institute can submit from
their workstations, they have to login to our loginnodes.
BUT, they can do it not only by logging in per ssh, but also per FastX,
The numerical values were used first, then they added the symbolic values.
Perhaps you could just look in the slurmctld.log output to see what is
the maximal log level reported there?
--
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo
signature.asc
I'm also interested in this. Another example: "Reason=(ReqNodeNotAvail)" is
all that a user sees in a situation when his/her job's walltime runs into a
system maintenance reservation.
* on Friday, 2018-11-23 09:55 -0500, Steven Dick wrote:
> I'm looking for a tool that will tell me why a spec
13 matches
Mail list logo