Hi ,
I am looking for a stepwise guide to setup multi cluster implementation.
We wanted to set up 3 clusters and one Login Node to run the job using -M
cluster option.
can anybody have such a setup and can share some insight into how it works
and it is really a stable solution.
Regards
Navin.
Hi Navin,
well, I have two clusters & login nodes that allow access to both. That
do? I don't think a third would make any difference in setup.
They need to share a database. As long as the share a database, the
clusters have 'knowledge' of each other.
So if you set up one database server (
Thank you Tina.
so if i understood correctly.Database is global to both cluster and running
on login Node?
or is the database running on one of the master Node and shared with
another master server Node?
but as far I have read that the slurm database can also be separate on both
the master and ju
Hello,
I have the database on a separate server (it runs the database and the
database only). The login nodes run nothing SLURM related, they simply
have the binaries installed & a SLURM config.
I've never looked into having multiple databases & using
AccountingStorageExternalHost (in fact I
Thank you Tina.
It will really help
Regards
Navin
On Thu, Oct 28, 2021, 22:01 Tina Friedrich
wrote:
> Hello,
>
> I have the database on a separate server (it runs the database and the
> database only). The login nodes run nothing SLURM related, they simply
> have the binaries installed & a SLUR
Hello,
I just noticed today that when I run "sinfo --states=idle", I get all the
idle nodes, plus an additional node that is in the "DRAIN" state (notice
how xavier6 is showing up below, even though its not in the idle state):
(! 807)-> sinfo --states=idle
PARTITION AVAIL TIMELIMIT NODES STATE
Found my problem. I had synced the /etc/slurm/* files on all controllers
and compute hosts - but not the submit host. Making note of it here in
case this helps anyone else.
~~ bnacar
On 10/26/21 11:10 AM, Benjamin Nacar wrote:
Hi,
I'm setting up a slurm cluster where some subset of compute n
Hello all
Since yesterday we’ve been having some trouble with slurm where it crashes and
isn’t able to recover.
I’ve managed to track the fault to a zero sized file, launching slurmctld -D
slurmctld: File /mnt/nfs/lobo/IMM-NFS/slurm/hash.4/job.2044004/environment has
zero size
That’s the S
You may have space, but do you have enough inodes?
Two different things to look at when trying to see why you cannot write
to a disk.
Also verify that it is writeable by SlurmUser.
If something happened and it automatically remounted itself as
read-only, that can do it too.
Brian Andrus
O