Just in case, increase Slurmdtimeout in slurm.conf (so that when the
controller is back, it will give you time to fix the issues with the
communication between slurmd and slurmctld - if there will be any).
Otherwise it should not affect running and pending jobs. First stop
controller, then slur
Hello,
check sshd settings (here are ours):
X11Forwarding yes
X11DisplayOffset 10
*X11UseLocalhost no*
Add PrologFlags in slurm.conf:
PrologFlags=x11
Cheers,
Barbara
On 11/16/20 7:20 PM, Russell Jones wrote:
Here's some debug logs from the compute node after launching an
interactive shel
Rewound credential error means that credential appears to have been
encoded by more than TTL seconds in the future (default munge TTL is 5
minutes). So the clock on the decoding host is slower than on the
encoding host. You can try to run munge with a different TTL (munge -t)
just to verify if it i
Afaik, there were some problems with certain versions of UCX, where UCX
expected OPAL memory hooks from OMPI, but they were disabled and the
physical pages became out-of-sync. But I don't know if this is the case.
Maybe you could run dynamic debug to see if there is something useful in
dmesg:
ech
I solved the problem by creating a symlink:
ln -s /usr/lib64/libmariadbclient.a /usr/lib64/libmariadb.a
Cheers,
Barbara
> On 11 Nov 2019, at 21:23, William Brown wrote:
>
> I have in fact found the answer by looking harder.
>
> The config.log clearly showed that the build of the test MySQL pro
We have SLURM 19.05. and implemented the cons_tres scheduling type.
It does work only by specifying the --gpus-per-node when submitting the
job. And there are many more options.
I found this presentation to be quite informative:
https://slurm.schedmd.com/SLUG18/cons_tres.pdf
We still have the gr
What if you try to run ldconfig manually before building the rpm?
Cheers,
Barbara
On 8/8/19 5:57 PM, Lou Nicotra wrote:
> I am running into an error while trying to
> install slurm-19.05.1-2.el7.centos.x86_64... Error is as follows:
> root@panther02 x86_64# rpm -Uvh slurm-19.05.1-2.el7.centos.x8
You could limit the resources with the QOS. It is not per node, but you
have some options:
https://slurm.schedmd.com/qos.html#limits
Otherwise you could just enforce the limits per partition and put weight
on the nodes, so that the CPU nodes are allocated before the GPU nodes.
Have you checked t
It could be a problem with ARP cache.
If the number of devices approaches 512, there is a kernel limitation in
dynamic ARP-cache size and it can result in the loss of connectivity
between nodes.
The garbage collector will run if the number of entries in the cache is
less than 128, by default:
*g
Resources are limited with cgroups in SLURM. Check the documentation:
https://slurm.schedmd.com/cgroups.html
You simply specify ProctrackType=proctrack/cgroup or/and
TaskPlugin=task/cgroup in slurm.conf and then configure which resources
are limited and how much in the cgroup.conf:
https://slurm
correctly. Please check your database
> connection and try again.
>
> The problem seems to somehow be related to slurmdbd?
> I am a bit lost at this point, to be honest.
>
> Best,
> Bruno
>
> On 29 November 2017 at 14:06, Barbara Krašovec <mailto:barbara.kraso.
Hello,
does munge work?
Try if decode works locally:
munge -n | unmunge
Try if decode works remotely:
munge -n | ssh unmunge
It seems as munge keys do not match...
See comments inline..
> On 29 Nov 2017, at 14:40, Bruno Santos wrote:
>
> I actually just managed to figure that one out.
>
> T
I was struggling like crazy with this one a while ago.
Then I saw this in the slurm.conf man page:
AccountingStoragePass
The password used to gain access to the database to store the accounting
data. Only used for database type storage plugins, ignored otherwise. In the
case of
13 matches
Mail list logo