[slurm-users] Re: 回复: 回复: Re: how to set slurmdbd.conf if using two slurmdb node with HA database?

2025-02-21 Thread Daniel Letai via slurm-users
kupHost will be totally duplicated.   张天阳 网络信息中心 计算业务部   发件人: Daniel Letai 发送时间: 2025年2月21日 14:04 收件人: taleinterve...@sjtu.edu.cn 抄送: slurm-users@lists.sche

[slurm-users] Re: 回复: Re: how to set slurmdbd.conf if using two slurmdb node with HA database?

2025-02-20 Thread Daniel Letai via slurm-users
a proxy who forwards requests to the DbdBackupHost and returns the data from there to slurmctld?     发件人: Daniel Letai 发送时间: 2025年2月20日 21:56 收件人: taleinterve...@sjtu.edu.cn

[slurm-users] Re: 回复: Re: how to set slurmdbd.conf if using two slurmdb node with HA database?

2025-02-20 Thread Daniel Letai via slurm-users
DbdBackupHost and returns the data from there to slurmctld?     发件人: Daniel Letai 发送时间: 2025年2月20日 21:56 收件人: taleinterve...@sjtu.edu.cn 抄送: slurm-users

[slurm-users] Re: how to set slurmdbd.conf if using two slurmdb node with HA database?

2025-02-20 Thread Daniel Letai via slurm-users
BackupHost option and how it work?     发件人: Daniel Letai 发送时间: 2025年2月19日 18:21 收件人: slurm-users@lists.schedmd.com 主题: [slurm-users] Re: how to set slurmdbd.conf if using t

[slurm-users] Re: how to set slurmdbd.conf if using two slurmdb node with HA database?

2025-02-19 Thread Daniel Letai via slurm-users
I'm not sure it will work, didn't test it, but could you just do `dbdhost=localhost` to solve this? On 18/02/2025 11:59, hermes via slurm-users wrote: The deployment scenario is as follows:

[slurm-users] Re: Run only one time on a node

2025-02-19 Thread Daniel Letai via slurm-users
There are a couple of options here, not exactly convenient but will get the job done: 1. Use array, with `-N 1 -w ` defined for each array task. You can do the same without array, using for loop to submit different sbatchs. 2. Use `scontrol reboot`. Set the reb

[slurm-users] Re: REST API - get_user_environment

2024-08-29 Thread Daniel Letai via slurm-users
.month.minor version system a long time ago. The major releases are (now) every 6 months, so the most recent ones have been: * 23.02.0 * 23.11.0 (old 9 month system) * 24.05.0 (new 6 month system) Next major release should be in November: * 24.11.0 All the best, Chris -- Regards, Daniel Letai

[slurm-users] Re: REST API - get_user_environment

2024-08-23 Thread Daniel Letai via slurm-users
https://github.com/SchedMD/slurm/blob/ffae59d9df69aa42a090044b867be660be259620/src/plugins/openapi/v0.0.38/jobs.c#L136 but no longer in https://github.com/SchedMD/slurm/blob/slurm-23.02/src/plugins/openapi/v0.0.39/jobs.c Which underwent major revision In the next openapi version On 22/0

[slurm-users] Re: With slurm, how to allocate a whole node for a single multi-threaded process?

2024-08-05 Thread Daniel Letai via slurm-users
I think the issue is more severe than you describe. Slurm juggles the needs of many jobs. Just because there are some resources available at the exact second a job starts, doesn't mean those resource are not pre-allocated for some future job waiting for e

[slurm-users] Re: slumrestd 24.05.1: crashes when GET on /slurm/v0.0.41/nodes : unsorted double linked list corrupted

2024-07-24 Thread Daniel Letai via slurm-users
slurmserver2.koios.lan slurmrestd[1502900]: debug4: xsignal: Swap signal PIPE[13] to 0x1 from 0x408376 čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug4: xsignal: Swap signal PIPE[13] to 0x408376 from 0x1 čec 24 14:37:55 slurmserv

[slurm-users] Re: Custom Plugin Integration

2024-07-19 Thread Daniel Letai via slurm-users
input) to Slurm as a simple string of sbatch flags, and just let Slurm do it's thing. It sounds simpler than forcing all other users of the cluster to adhere to your particular needs without introducing unnecessary complexity to the cluster. Regards, Bhaskar. Regards, --Dani_L. O

[slurm-users] Re: Custom Plugin Integration

2024-07-17 Thread Daniel Letai via slurm-users
o believe Slurm would also have some possibilities.) Regards, Bhaskar. -- Regards, Daniel Letai +972 (0)505 870 456 -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Custom Plugin Integration

2024-07-12 Thread Daniel Letai via slurm-users
ery similar) is already answered, please point to the relevant thread then. Thanks in advance for any pointers. Regards, Bhaskar. -- Regards, Daniel Letai +972 (0)505 870 456 -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Replacing MUNGE with SACK (auth/slurm)

2024-07-11 Thread Daniel Letai via slurm-users
Does SACK replace MUNGE? As in - MUNGE is not required when building Slurm or on compute? If so, can the Requires and BuildRequires for munge be made optional on bcond_without_munge in the spec file? Or is there a reason MUNGE must remain a hard require for Slurm? Thanks, --Dani_L. -- sl

[slurm-users] Re: Convergence of Kube and Slurm?

2024-05-06 Thread Daniel Letai via slurm-users
There is a kubeflow offering that might be of interest: https://www.dkube.io/post/mlops-on-hpc-slurm-with-kubeflow I have not tried it myself, no idea how well it works. Regards, --Dani_L. On 05/05/2024 0:05, Dan Healy via slurm-us

Re: [slurm-users] Usage of particular GPU out of 4 GPUs while submitting jobs to DGX Server

2023-11-20 Thread Daniel Letai
Hi Ravi, On 20/11/2023 6:36, Ravi Konila wrote: Hello Everyone   My question is related to submission of jobs to those GPUs. How do a student submit the job to a particular GPU

Re: [slurm-users] stopping job array after N failed jobs in row

2023-08-01 Thread Daniel Letai
Not sure about automatically canceling a job array, except perhaps by submitting 2 consecutive arrays - first of size 20, and the other with the rest of the elements and a dependency of afterok. That said, a single job in a job array in Slurm documentation is refe

Re: [slurm-users] Slurmdbd High Availability

2023-04-15 Thread Daniel Letai
My go to solution is setting up Galera cluster using 2 slurmdbd servers (each pointing to it's local db) and a 3rd quorum server. It's fairly easy to setup and doesn't rely on block level duplication, HA semantics or shared storage. Just my 2 cents

Re: [slurm-users] Jobs can grow in RAM usage surpassing MaxMemPerNode

2023-01-12 Thread Daniel Letai
varro -- Regards, Daniel Letai +972 (0)505 870 456

Re: [slurm-users] unused job data fields?

2022-10-04 Thread Daniel Letai
te: Hello, are there additional job data fields in slurm besides the job name which can be used for additional information? The information should not be used by slurm, only included in the database for external evaluation. Thanks Mike -- Regards, Daniel Letai +972 (0)505 870 456

Re: [slurm-users] srun using infiniband

2022-09-03 Thread Daniel Letai
Hello Anne, On 01/09/2022 02:01:53, Anne Hammond wrote: We have a    CentOS 8.5 cluster    slurm 20.11   Mellanox ConnectX 6 HDR IB and Mellanox 32 port switch Our application is not scaling.  I

Re: [slurm-users] do oversubscription with algorithm other than least-loaded?

2022-03-03 Thread Daniel Letai
the number of nodes we need to run and reduce costs. Is there a way to get this behavior somehow? Herc -- Regards, Daniel Letai +972 (0)505 870 456

Re: [slurm-users] How to determine (on the ControlMachine) which cores/gpus are assigned to a job?

2021-02-18 Thread Daniel Letai
I don't have access to a cluster right now so can't test this, but possibly tres_alloc squeue -O JobID,Partition,Name,tres_alloc,NodeList -j might give some more info. On 04/02/2021 17:01, Thomas Zeiser wrot

Re: [slurm-users] Use gres to handle permissions of /dev/dri/card* and /dev/dri/renderD*?

2020-10-21 Thread Daniel Letai
Just a quick addendum - rsmi_dev_drm_render_minor_get used in the plugin references the ROCM-SMI lib from https://github.com/RadeonOpenCompute/rocm_smi_lib/blob/2e8dc4f2a91bfa7661f4ea289736b12153ce23c2/src/rocm_smi.cc#L1689 So the library (as an .so file) should be installe

Re: [slurm-users] Use gres to handle permissions of /dev/dri/card* and /dev/dri/renderD*?

2020-10-21 Thread Daniel Letai
Take a look at https://github.com/SchedMD/slurm/search?q=dri%2F If the ROCM-SMI API is present, using AutoDetect=rsmi in gres.conf might be enough, if I'm reading this right. Of course, this assumes the cards in question are AMD and not NVIDIA.

Re: [slurm-users] how to restrict jobs

2020-05-07 Thread Daniel Letai
On 06/05/2020 20:44, Mark Hahn wrote: Is there no way to set or define a custom variable like at node level and you could use a per-node Feature for this, but a partition would also work. A bit of an ugly hack,

Re: [slurm-users] not allocating jobs even resources are free

2020-05-03 Thread Daniel Letai
PriorityUsageResetPeriod=DAILY PriorityWeightFairshare=50 PriorityFlags=FAIR_TREE Regards Navin. On Mon, Apr 27, 2020 at 9:37 P

Re: [slurm-users] not allocating jobs even resources are free

2020-04-27 Thread Daniel Letai
-- Regards, Daniel Letai +972 (0)505 870 456

[slurm-users] Assigning gpu freq va;ues manually

2020-04-21 Thread Daniel Letai
Is it possible to assign gpu freq values without use of specialized plugin? Currently gpu freqs can be assigned by use of AutoDetect=nvml Or AutoDetect=rsmi In gres.conf, but I can't find any reference to assigning freq values manually

Re: [slurm-users] Alternative to munge for use with slurm?

2020-04-18 Thread Daniel Letai
in v20.02 you can use jwt, as per https://slurm.schedmd.com/jwt.html Only issue is getting libjwt for most rpm based distros. The current libjwt configure;make dist-all doesn't work. I had to cd into dist, and 'make rpm' to create the spec file, then rpm

Re: [slurm-users] Need to execute a binary with arguments on a node

2019-12-18 Thread Daniel Letai
Use sbatch's wrapper command: sbatch --wrap='ls -l /tmp' Note that the output will be in the directory on the execution node, by default with the name slurm-.out On 12/18/19 8:40 PM, William Brown wrote: Sometim

Re: [slurm-users] Limiting the number of CPU

2019-11-14 Thread Daniel Letai
lly run on node cn110, so you may want to check that out with sinfo A quick "sinfo -R" can list any down machines and the reasons. Brian Andrus -- Regards, Daniel Letai +972 (0)505 870 456

Re: [slurm-users] RPM build error - accounting_storage_mysql.so

2019-11-11 Thread Daniel Letai
On 11/12/19 9:34 AM, Ole Holm Nielsen wrote: On 11/11/19 10:14 PM, Daniel Letai wrote: Why would you need galera-4 as a build require? This is the MariaDB recommendation in https://mariadb.com/kb/en

Re: [slurm-users] RPM build error - accounting_storage_mysql.so

2019-11-11 Thread Daniel Letai
Why would you need galera-4 as a build require? If it's required by any of the mariadb packages, it'll get pulled automatically. If not, you don't need it on the build system. On 11/11/19 10:56 PM, Ole Holm Nielsen wrote: Hi William,

Re: [slurm-users] How to find core count per job per node

2019-10-21 Thread Daniel Letai
I can't test this right now, but possibly squeue -j -O 'name,nodes,tres-per-node,sct' From squeue man page https://slurm.schedmd.com/squeue.html: sct     Number of requested sockets, cores, and threads (S:C:T) per node for the job. When (S:C:T

[slurm-users] Using swap for gang mode suspended jobs only

2019-10-13 Thread Daniel Letai
Hi, I'd like to allow job suspension in my cluster, without the "penalty" of RAM utilization. The jobs are sometimes very big and can require ~100GB mem on each node. Suspending such a job would usually mean almost nothing else can run on the same node, ex

Re: [slurm-users] How can jobs request a minimum available (free) TmpFS disk space?

2019-09-14 Thread Daniel Letai
Make tmpfs a TRES, and have NHC update that as in: scontrol update nodename=... gres=tmpfree:$(stat -f /tmp -c "%f*%S" | bc)" Replace /tmp with your tmpfs mount. You'll have to define that TRES in slurm.conf and gres.conf as usual (start with count=1 and

Re: [slurm-users] Different Memory Nodes

2019-09-08 Thread Daniel Letai
Just a quick FYI - using gang mode preemption would mean the available memory would be lower, so if the preempting job requires the entire node memory, this will be an issue. On 9/4/19 8:51 PM, Tina Fora wrote: Thanks Brian! I'll take a

Re: [slurm-users] Usage splitting

2019-09-01 Thread Daniel Letai
Wouldn't fairshare with a 90/10 split achieve this? This will require accounting is set in your cluster, with the following parameters: In slurm.conf set AccountingStorageEnforce=associations # And possibly '...,limits,qos,safe' as require

Re: [slurm-users] sacctmgr dump question - how can I dump entities other than cluster?

2019-08-12 Thread Daniel Letai
onfig and load it. If you don't want to do that,  then just use the sacctmgr modify option. Cheers, Barbara On 8/5/19 12:02 PM, Daniel Letai wrote: The documentati

[slurm-users] sacctmgr dump question - how can I dump entities other than cluster?

2019-08-05 Thread Daniel Letai
The documentation clearly states dump Dump cluster data to the specified file. If the filename is not specified it uses clustername.cfg filename by default. However, the only entity sacctmgr dump seems to a

Re: [slurm-users] Slurm configuration

2019-08-05 Thread Daniel Letai
Hi. On 8/3/19 12:37 AM, Sistemas NLHPC wrote: Hi all, Currently we have two types of nodes, one with 192GB and another with 768GB of RAM, it is required that in nodes of 768 GB it is not allowed to execute tasks

Re: [slurm-users] Unexpected MPI process distribution with the --exclusive flag

2019-07-30 Thread Daniel Letai
On 7/30/19 6:03 PM, Brian Andrus wrote: I think this may be more on how you are calling mpirun and the mapping of processes. With the "--exclusive" option, the processes are given access to all the cores on each box, so mpirun has a choic

Re: [slurm-users] Can I use the manager as compute node

2019-07-30 Thread Daniel Letai
Yes, just add it to the Nodes= list of the partition. You will have to install slurm-slurmd on it as well, and enable and start as on any compute node, or it will be DOWN. HTH, --Dani_L. On 7/30/19 3:45 PM, wodel youchi wrote:

Re: [slurm-users] Weekend Partition

2019-07-23 Thread Daniel Letai
I would use a partition with very low priority and preemption. General cluster conf: PreemptType=preempt/partition_prio Preemptmode=Cancel # Anything except 'Off' Partition definition: ParttionName=weekend PreemptMode=Cancel MaxTime=Unlimited

Re: [slurm-users] [pmix] [Cross post - Slurm, PMIx, UCX] Using srun with SLURM_PMIX_DIRECT_CONN_UCX=true fails with input/output error

2019-07-10 Thread Daniel Letai
and get back to you. Best regards, Artem Y. Polyakov, PhD Senior Architect, SW Mellanox Technologies От: p...@googlegroups.com от имени Daniel Letai Отправлено: Tuesday, July 9, 2019 3:25:22

[slurm-users] [Cross post - Slurm, PMIx, UCX] Using srun with SLURM_PMIX_DIRECT_CONN_UCX=true fails with input/output error

2019-07-09 Thread Daniel Letai
Cross posting to Slurm, PMIx and UCX lists. Trying to execute a simple openmpi (4.0.1) mpi-hello-world via Slurm (19.05.0) compiled with both PMIx (3.1.2) and UCX (1.5.0) results in: [root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=true SLURM_PMI

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-11 Thread Daniel Letai
I had similar problems in the past. The 2 most common issues were: 1. Controller load - if the slurmctld was in heavy use, it sometimes didn't respond in timely manner, exceeding the timeout limit. 2. Topology and msg forwarding and aggregation. For

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-21 Thread Daniel Letai
Hi Loris, On 3/21/19 6:21 PM, Loris Bennett wrote: Chris, maybe you should look at EasyBuild (https://easybuild.readthedocs.io/en/latest/). That way you can install all the dependencies (such as zlib) as modules and be pretty much independent of

Re: [slurm-users] Sharing a node with non-gres and gres jobs

2019-03-21 Thread Daniel Letai
Hi Peter, On 3/20/19 11:19 AM, Peter Steinbach wrote: [root@ernie /]# scontrol show node -dd g1 NodeName=g1 CoresPerSocket=4    CPUAlloc=3 CPUTot=4 CPULoad=N/A    AvailableFeatures=(null)    ActiveFeat

Re: [slurm-users] problems with slurm and openmpi

2019-03-12 Thread Daniel Letai
Hi. On 12/03/2019 22:53:36, Riccardo Veraldi wrote: Hello, after trynig hard for over 10 days I am forced to write to the list.

[slurm-users] pmix and ucx versions compatibility with slurm

2019-02-26 Thread Daniel Letai
Hi all, Is there any issue regarding which versions of pmix or ucx slurm is compiled with? should I require installation of same versions in the compute nodes? I couldn't find any documentation regarding which api from pmix or ucx Slurm is us

Re: [slurm-users] Visualisation -- Slurm and (Turbo)VNC

2019-01-03 Thread Daniel Letai
, David -- Regards, Daniel Letai +972 (0)505 870 456

Re: [slurm-users] Can frequent hold-release adversely affect slurm?

2018-10-19 Thread Daniel Letai
On 18/10/2018 20:34, Eli V wrote: On Thu, Oct 18, 2018 at 1:03 PM Daniel Letai wrote: Hello all, To solve a requirement where a large number of job arrays (~10k arrays, each with at most 8M elements) with same priority should be executed

[slurm-users] Can frequent hold-release adversely affect slurm?

2018-10-18 Thread Daniel Letai
Hello all, To solve a requirement where a large number of job arrays (~10k arrays, each with at most 8M elements) with same priority should be executed with minimal starvation of any array - we don't want to wait for each array to complete before

Re: [slurm-users] Is it possible to select the BatchHost for a job through some sort of prolog script?

2018-07-09 Thread Daniel Letai
On 06/07/2018 10:22, Steffen Grunewald wrote: On Fri, 2018-07-06 at 07:47:16 +0200, Loris Bennett wrote: Hi Tim, Tim Lin writes: As the title suggests, I’m searching for a way to have tighter control of which node the batch script gets executed on. In my case it’s very hard to know which