[slurm-users] Slurm Multi-cluster implementation

2021-10-28 Thread navin srivastava
Hi , I am looking for a stepwise guide to setup multi cluster implementation. We wanted to set up 3 clusters and one Login Node to run the job using -M cluster option. can anybody have such a setup and can share some insight into how it works and it is really a stable solution. Regards Navin.

Re: [slurm-users] Slurm Multi-cluster implementation

2021-10-28 Thread Tina Friedrich
Hi Navin, well, I have two clusters & login nodes that allow access to both. That do? I don't think a third would make any difference in setup. They need to share a database. As long as the share a database, the clusters have 'knowledge' of each other. So if you set up one database server (

Re: [slurm-users] Slurm Multi-cluster implementation

2021-10-28 Thread navin srivastava
Thank you Tina. so if i understood correctly.Database is global to both cluster and running on login Node? or is the database running on one of the master Node and shared with another master server Node? but as far I have read that the slurm database can also be separate on both the master and ju

Re: [slurm-users] Slurm Multi-cluster implementation

2021-10-28 Thread Tina Friedrich
Hello, I have the database on a separate server (it runs the database and the database only). The login nodes run nothing SLURM related, they simply have the binaries installed & a SLURM config. I've never looked into having multiple databases & using AccountingStorageExternalHost (in fact I

Re: [slurm-users] Slurm Multi-cluster implementation

2021-10-28 Thread navin srivastava
Thank you Tina. It will really help Regards Navin On Thu, Oct 28, 2021, 22:01 Tina Friedrich wrote: > Hello, > > I have the database on a separate server (it runs the database and the > database only). The login nodes run nothing SLURM related, they simply > have the binaries installed & a SLUR

[slurm-users] Bug when I run "sinfo --states=idle"

2021-10-28 Thread David Henkemeyer
Hello, I just noticed today that when I run "sinfo --states=idle", I get all the idle nodes, plus an additional node that is in the "DRAIN" state (notice how xavier6 is showing up below, even though its not in the idle state): (! 807)-> sinfo --states=idle PARTITION AVAIL TIMELIMIT NODES STATE

Re: [slurm-users] errors requesting gpus

2021-10-28 Thread Benjamin Nacar
Found my problem. I had synced the /etc/slurm/* files on all controllers and compute hosts - but not the submit host. Making note of it here in case this helps anyone else. ~~ bnacar On 10/26/21 11:10 AM, Benjamin Nacar wrote: Hi, I'm setting up a slurm cluster where some subset of compute n

[slurm-users] Slurm Crashing - File has zero size

2021-10-28 Thread Pedro Luiz de Castro
Hello all Since yesterday we’ve been having some trouble with slurm where it crashes and isn’t able to recover. I’ve managed to track the fault to a zero sized file, launching slurmctld -D slurmctld: File /mnt/nfs/lobo/IMM-NFS/slurm/hash.4/job.2044004/environment has zero size That’s the S

Re: [slurm-users] Slurm Crashing - File has zero size

2021-10-28 Thread Brian Andrus
You may have space, but do you have enough inodes? Two different things to look at when trying to see why you cannot write to a disk. Also verify that it is writeable by SlurmUser. If something happened and it automatically remounted itself as read-only, that can do it too. Brian Andrus O