[slurm-users] Re: 回复: Re: how to set slurmdbd.conf if using two slurmdb node with HA database?

2025-02-20 Thread Brian Andrus via slurm-users
Daniel, One way to set up a true HA is to configure master-master SQL instances on both head nodes. Then have each slurmdbd point to the other SQL instance as the backup host. This is likely not necessary as all data going to slurmdbd is cached if slurmdbd is unavailable. In the real world,

[slurm-users] Re: How to clean up?

2025-02-04 Thread Brian Andrus via slurm-users
Steven, Looks like you may have had a secondary controller that took over and changed your StateSave files. IF you don't need the job info AND no jobs are running, you can just rename/delete your StateSaveLocation directory and things will be recreated. Job numbers will start over (unless y

[slurm-users] Re: Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions

2025-01-04 Thread Brian Andrus via slurm-users
Run 'sinfo -R' to see the reason any nodes may be down. It may be as simple as running 'scontrol update state=resume nodename=' to bring them back, if they are down. It depends on the reason they went down (if that is the issue). Otherwise, check the job requirements to see what it is ask

[slurm-users] Re: All GPUs are Usable if no Gres is Defined

2025-01-04 Thread Brian Andrus via slurm-users
Ensure cgroups is working and configured to limit access to devices (which includes gpus). Check your cgroup.conf to see that there is an entry for:     ConstrainDevices=yes Brian Andrus On 1/3/2025 10:49 AM, Groner, Rob via slurm-users wrote: I'm not entirely sure, and I can't vouch for di

[slurm-users] multiple conf-server entries for sackd

2024-12-03 Thread Brian Andrus via slurm-users
Not sure anyone would know, but... If you are running slurm in HA mode (multiple SlurmctldHost entries) is it possible to point sackd to more than one using the --conf-server option? So either specify --conf-server more than once, or have a comma-delimited list of them? The docs are a little

[slurm-users] Re: sinfo not listing any partitions

2024-12-02 Thread Brian Andrus via slurm-users
You only have one partition named 'default' You are not allowed to name it that. Name it something else and you should be good. Brian Andrus On 11/28/2024 6:52 AM, Patrick Begou via slurm-users wrote: Hi Kent, on your management node could you run: systemctl status slurmctld and check your

[slurm-users] Re: Change primary alloc node

2024-11-03 Thread Brian Andrus via slurm-users
107872968?pid=nativeplacement&c=Global_Acquisition_YMktg_315_Internal_EmailSignature&af_sub1=Acquisition&af_sub2=Global_YMktg&af_sub3=&af_sub4=10604&af_sub5=EmailSignature__Static_> On Friday, November 1, 2024, 1:12 AM, Brian Andrus via slurm-users wrote: Likely ma

[slurm-users] Re: Change primary alloc node

2024-10-31 Thread Brian Andrus via slurm-users
Likely many ways to do this, but if you have some code that is dependent on something, that check could be in the code itself. So instead of process 0 being the required process to run, it would be whichever process meets the requirements. eg: case hostname: harold)     Run harold's stuff he

[slurm-users] Re: How do you guys track which GPU is used by which job ?

2024-10-16 Thread Brian Andrus via slurm-users
Looks like there is a step you would need to do to create the required job mapping files: /The DCGM-exporter can include High-Performance Computing (HPC) job information into its metric labels. To achieve this, HPC environment administrators must configure their HPC environment to generate fil

[slurm-users] Re: what updates NODEADDR

2024-09-21 Thread Brian Andrus via slurm-users
IIRC, you need to ensure reverse lookup for DNS matches your nodename Brian Andrus On 9/20/2024 4:55 PM, Jakub Szarlat via slurm-users wrote: Hi I'm using dynamic nodes with "slurmd -Z" with slurm 23.11.1. Firstly I find that when you do "scontrol show node" it shows the NODEADDR as ip rather

[slurm-users] Re: salloc not starting shell despite LaunchParameters=use_interactive_step

2024-09-06 Thread Brian Andrus via slurm-users
Folks have addressed the obvious config settings, but also check your prolog/epilog scripts/settings as well as the .bashrc/.bash_profile and stuff in /etc/profile.d/ That may be hanging it up. Brian Andrus On 9/5/2024 5:17 AM, Loris Bennett via slurm-users wrote: Hi, With $ salloc --ver

[slurm-users] Re: Bug? sbatch not respecting MaxMemPerNode setting

2024-09-04 Thread Brian Andrus via slurm-users
Angel, Unless you are using cgroups and constraints, there is no limit imposed. The numbers are used by slurm to track what is available, not what you may/may not use. So you could tell slurm the node only has 1GB and it will not let you request more than that, but if you do request only 1GB,

[slurm-users] Re: playing with --nodes=

2024-08-30 Thread Brian Andrus via slurm-users
to "--nodes=1,5,9,13". ... The job will be allocated as many nodes as possible within the range specified and without delaying the initiation of the job. ____ From: Brian Andrus via slurm-users Sent: Thursday, August 29, 2024 7:27:44 PM To:slurm-users@lists.sche

[slurm-users] Re: playing with --nodes=

2024-08-29 Thread Brian Andrus via slurm-users
there is this example for : --nodes=1,5,9,13 so either one specifies [-maxnodes] OR . I checked the logs, and there are no reported errors about wrong or ignored options. MG From: Brian Andrus via slurm-users Sent: Thursday, August 29, 2024 4:11:25 PM To:

[slurm-users] Re: playing with --nodes=

2024-08-29 Thread Brian Andrus via slurm-users
Your --nodes line is incorrect: *-N*,*--nodes*=[-/maxnodes/]| Request that a minimum of/minnodes/nodes be allocated to this job. A maximum node count may also be specified with/maxnodes/. Looks like it ignored that and used ntasks with ntasks-per-node as 1, giving you 3 nodes. Check your

[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node

2024-08-19 Thread Brian Andrus via slurm-users
IIRC, slurm parses the batch file as options until it hits the first non-comment line, which includes blank lines. You may want to double-check some of the gaps in the option section of your batch script. That being said and you say you removed the '&' at the end of the command, which would

[slurm-users] Re: Upgrade compute node to 24.05.2

2024-08-15 Thread Brian Andrus via slurm-users
It sounds like the new version was built with different options, and/or an install was not done via packages. If you do use rpms, you could try:     dnf provides /usr/lib64/slurm/mpi_none.so If that shows a package that is installed, remove it. If it shows nothing, move the file elsewhere and

[slurm-users] Re: Find out submit host of past job?

2024-08-07 Thread Brian Andrus via slurm-users
If you need it, you could add it to either prologue or epilogue to store the info somewhere. I do that for the scripts themselves and keep the past two weeks backed up so we can debug if/when there is an issue. Brian Andrus On 8/7/2024 6:29 AM, Steffen Grunewald via slurm-users wrote: On We

[slurm-users] Re: LRMS error: (-1) Job missing from SLURM."

2024-08-06 Thread Brian Andrus via slurm-users
Felix, Finished jobs roll off the list shown in squeue, so that may be no surprise (depending on settings). If there was a power failure that caused the nodes to restart, it could also be that the job had not been written to slurmdbd, making it unavailable to sacct as well. Your logs look to

[slurm-users] Re: Background tasks in Slurm scripts?

2024-07-26 Thread Brian Andrus via slurm-users
Generally speaking, when the batch script exits, slurm will clean up (ie kill) any stray processes. So, I would expect that executable to be killed at cleanup. Brian Andrus On 7/26/2024 2:45 AM, Steffen Grunewald via slurm-users wrote: On Fri, 2024-07-26 at 10:42:45 +0300, Slurm users wrote:

[slurm-users] Re: CLOUD nodes with unknown IP addresses

2024-07-19 Thread Brian Andrus via slurm-users
Martin, In a nutshell, when slurmd starts, it tells that info to slurmctld. That is the "registration" event mentioned. Brian Andrus On 7/19/2024 5:44 AM, Martin Lee via slurm-users wrote: I've read the following in the slurm power saving docs: https://slurm.schedmd.com/power_save.html *cl

[slurm-users] Re: SLURM noob administrator question

2024-07-11 Thread Brian Andrus via slurm-users
You probably want to look at scontrol show node and scontrol show job for that node and the jobs on it. Compare those and you may find someone requested most all the resources, but are not running them properly. Look at the job itself to see what it is trying to do. Brian Andrus On 7/11/202

[slurm-users] Re: Nodes TRES double what is requested

2024-07-10 Thread Brian Andrus via slurm-users
Jack, To make sure things are set right, run 'slurmd -C' on the node and use that output in your config. It can also give you insight as to what is being seen on the node versus what you may expect. Brian Andrus On 7/10/2024 1:25 AM, jack.mellor--- via slurm-users wrote: Hi, We are runni

[slurm-users] Re: Using sharding

2024-07-04 Thread Brian Andrus via slurm-users
PU=16 DefMemPerGPU=128985 And in the compute node /etc/slurm/gres.conf is: Name=gpu File=/dev/nvidia[0-7] Name=shard Count=32 Thank you! -- Ricardo Cruz - https://rpmcruz.github.io <https://rpmcruz.github.io/> Brian Andrus via slurm-users escreveu (quinta, 4/07/2024 à(s) 17:16): To

[slurm-users] Re: Using sharding

2024-07-04 Thread Brian Andrus via slurm-users
To help dig into it, can you paste the full output of scontrol show node compute01 while the job is pending? Also 'sinfo' would be good. It is basically telling you there aren't enough resources in the partition to run the job. Often this is because all the nodes are in use at that moment. B

[slurm-users] Re: How can I tell the OS that was used to build SLURM?

2024-06-20 Thread Brian Andrus via slurm-users
Carl, You cannot tell from the binary alone. It looks like you just did an apt-get install slurm or such under Ubuntu. Would that be right? You may be able to look at the package and see info about the build environment. Generally, it is best to build slurm yourself for the environment it is

[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs

2024-06-20 Thread Brian Andrus via slurm-users
Well, if I am reading this right, it makes sense. Every job will need at least 1 core just to run and if there are only 4 cores on the machine, one would expect a max of 4 jobs to run. Brian Andrus On 6/20/2024 5:24 AM, Arnuld via slurm-users wrote: I have a machine with a quad-core CPU and a

[slurm-users] Re: slurmdbd not connecting to mysql (mariadb)

2024-05-30 Thread Brian Andrus via slurm-users
That SIGTERM message means something is telling slurmdbd to quit. Check your cron jobs, maintenance scripts, etc. Slurmdbd is being told to shutdown. If you are running in the foreground, a ^C does that. If you run a kill or killall on it, you will get that same message. Brian Andrus On 5/30

[slurm-users] Re: slurmdbd archive format

2024-05-28 Thread Brian Andrus via slurm-users
Oh, to address the passed train: Restore the archive data with "sacctmgr archive load", then you can do as you need. From man sacctmgr: *archive*{dump|load}     Write database information to a flat file or load information that has previously been written to a file. Brian Andrus Setup

[slurm-users] Re: slurmdbd archive format

2024-05-28 Thread Brian Andrus via slurm-users
Instead of using the archive files, couldn't you query the db directly for the info you need? I would recommend sacct/sreport if those can get the info you need. Brian Andrus On 5/28/2024 9:59 AM, O'Neal, Doug (NIH/NCI) [C] via slurm-users wrote: My organization needs to access historic job

[slurm-users] Re: Building Slurm debian package vs building from source

2024-05-23 Thread Brian Andrus via slurm-users
On 5/23/2024 6:16 AM, Christopher Samuel via slurm-users wrote: On 5/22/24 3:33 pm, Brian Andrus via slurm-users wrote: A simple example is when you have nodes with and without GPUs. You can build slurmd packages without for those nodes and with for the ones that have them. FWIW we have bot

[slurm-users] Re: Building Slurm debian package vs building from source

2024-05-22 Thread Brian Andrus via slurm-users
Not that I recommend it much, but you can build them for each environment and install the ones needed in each. A simple example is when you have nodes with and without GPUs. You can build slurmd packages without for those nodes and with for the ones that have them. Generally, so long as versi

[slurm-users] Re: Submitting from an untrusted node

2024-05-14 Thread Brian Andrus via slurm-users
Rike, Assuming the data, scripts and other dependencies are already on the cluster, you could just ssh and execute the sbatch command in a single shot: ssh submitnode sbatch some_script.sh It will ask for a password if appropriate and could use ssh keys to bypass that need. Brian Andrus O

[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Brian Andrus via slurm-users
/...). Wouldn't Slurm pick up that one? Thanks! Jeff On Fri, Apr 19, 2024 at 1:11 PM Brian Andrus via slurm-users wrote: This is because you have no slurm.conf in /etc/slurm, so it it is trying 'configless' which queries DNS to find out where to get the config. It is

[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Brian Andrus via slurm-users
This is because you have no slurm.conf in /etc/slurm, so it it is trying 'configless' which queries DNS to find out where to get the config. It is failing because you do not have DNS configured to tell nodes where to ask about the config. Simple solution: put a copy of slurm.conf in /etc/slurm

[slurm-users] Re: Slurm.conf and workers

2024-04-15 Thread Brian Andrus via slurm-users
Xaver, If you look at your slurmctld log, you likely end up seeing messages about each node's slurm.conf not being the same as that on the master. So, yes, it can work temporarily, but unless there are some very specific settings done, issues will arise. The state you are in now, you will wa

[slurm-users] Re: Upgrading nodes

2024-04-10 Thread Brian Andrus via slurm-users
Yes. You can build the 8 rpms on 9. Look at 'mock' to do so. I did similar when I still had to support EL7 Fairly generic plan, the devil is in the details and verifying each step, but those are the basic bases you need to touch. Brian Andrus On 4/10/2024 1:48 PM, Steve Berg via slurm-users

[slurm-users] Re: Elastic Computing: Is it possible to incentivize grouping power_up calls?

2024-04-08 Thread Brian Andrus via slurm-users
Xaver, You may want to look at the ResumeRate option in slurm.conf: ResumeRate The rate at which nodes in power save mode are returned to normal operation by ResumeProgram. The value is a number of nodes per minute and it can be used to prevent power surges if a large number of no

[slurm-users] Re: controller backup slurmctld error while takeover

2024-03-25 Thread Brian Andrus via slurm-users
, Brian Andrus via slurm-users ha scritto: Quick correction, it is SaveStateLocation not SlurmSaveState. Brian Andrus On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote: Dear all, I am having trouble finalizing the configuration of the backup controller for my slurm

[slurm-users] Re: controller backup slurmctld error while takeover

2024-03-25 Thread Brian Andrus via slurm-users
Quick correction, it is SaveStateLocation not SlurmSaveState. Brian Andrus On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote: Dear all, I am having trouble finalizing the configuration of the backup controller for my slurm cluster. In principle, if no job is running everything seems f

[slurm-users] Re: controller backup slurmctld error while takeover

2024-03-25 Thread Brian Andrus via slurm-users
Miriam, You need to ensure the SlurmSaveState directory is the same for both. And by 'the same', I mean all contents are exactly the same. This is usually achieved by using a shared drive or replication. Brian Andrus On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote: Dear all, I am hav

[slurm-users] Re: We're Live! Check out the new SchedMD.com now!

2024-03-13 Thread Brian Andrus via slurm-users
Wow, snazzy! Looks very good. My compliments. Brian Andrus On 3/12/2024 11:24 AM, Victoria Hobson via slurm-users wrote: Our website has gone through some much needed change and we'd love for you to explore it! The new SchedMD.com is equipped with the latest information about Slurm, your favo

[slurm-users] Re: Slurm billback and sreport

2024-03-04 Thread Brian Andrus via slurm-users
Chip, I use 'sacct' rather than sreport and get individual job data. That is ingested into a db and PowerBI, which can then aggregate as needed. sreport is pretty general and likely not the best for accurate chargeback data. Brian Andrus On 3/4/2024 6:09 AM, Chip Seraphine via slurm-users

[slurm-users] Re: Is SWAP memory mandatory for SLURM

2024-03-04 Thread Brian Andrus via slurm-users
Joseph, You will likely get many perspectives on this. I disable swap completely on our compute nodes. I can be draconian that way. For the workflow supported, this works and is a good thing. Other workflows may benefit from swap. Brian Andrus On 3/3/2024 11:04 PM, John Joseph via slurm-user

[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Brian Andrus via slurm-users
oxy> Brian Andrus On 2/28/2024 12:54 PM, Dan Healy wrote: Are most of us using HAProxy or something else? On Wed, Feb 28, 2024 at 3:38 PM Brian Andrus via slurm-users wrote: Magnus, That is a feature of the load balancer. Most of them have that these days. Brian Andrus

[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Brian Andrus via slurm-users
Magnus, That is a feature of the load balancer. Most of them have that these days. Brian Andrus On 2/28/2024 12:10 AM, Hagdorn, Magnus Karl Moritz via slurm-users wrote: On Tue, 2024-02-27 at 08:21 -0800, Brian Andrus via slurm-users wrote: for us, we put a load balancer in front of the

[slurm-users] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-27 Thread Brian Andrus via slurm-users
Josef, for us, we put a load balancer in front of the login nodes with session affinity enabled. This makes them land on the same backend node each time. Also, for interactive X sessions, users start a desktop session on the node and then use vnc to connect there. This accommodates disconnect

[slurm-users] Re: [INTERNET] Re: question on sbatch --prefer

2024-02-10 Thread Brian Andrus via slurm-users
I imagine you could create a reservation for the node and then when you are completely done, remove the reservation. Each helper could then target the reservation for the job. Brian Andrus On 2/9/2024 5:52 PM, Alan Stange via slurm-users wrote: Chip, Thank you for your prompt response.  We c