from:"Jeffrey T Frey"

[slurm-users] Re: Print Slurm Stats on Login

2024-08-09 Thread Jeffrey T Frey via slurm-users

You'd have to do this within e.g. the system's bashrc infrastructure. The simplest idea would be to add to e.g. /etc/profile.d/zzz-slurmstats.sh and have some canned commands/scripts running. That does introduce load to the system and Slurm on every login, though, and slows the startup of logi

[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2024-07-16 Thread Jeffrey T Frey via slurm-users

I can confirm on a freshly-installed RockyLinux 9.4 system, the dbus-devel package was not installed by default. The Development Tools # dnf repoquery --groupmember dbus-devel Last metadata expiration check: 2:04:16 ago on Tue 16 Jul 2024 12:02:50 PM EDT. dbus-devel-1:1.12.20-8.el9.i686 dbus-de

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Jeffrey T Frey via slurm-users

> AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n" > is per user. The ulimit is a frontend to rusage limits, which are per-process restrictions (not per-user). The fs.file-max is the kernel's limit on how many file descriptors can be open in aggregate. You'd have to edit

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-15 Thread Jeffrey T Frey via slurm-users

https://github.com/dun/munge/issues/94 The NEWS file claims this was fixed in 0.5.15. Since your log doesn't show the additional strerror() output you're definitely running an older version, correct? If you go on one of the affected nodes and do an `lsof -p ` I'm betting you'll find a long

[slurm-users] Re: Restricting local disk storage of jobs

2024-02-07 Thread Jeffrey T Frey via slurm-users

to get this information, but this seems a bit unclean. Anyway, if I > find some time I will try it out. > Best, > Tim > On 2/6/24 16:30, Jeffrey T Frey wrote: >> Most of my ideas have revolved around creating file systems on-the-fly as >> part of the job prolog and destroyi

[slurm-users] Re: Restricting local disk storage of jobs

2024-02-06 Thread Jeffrey T Frey via slurm-users

Most of my ideas have revolved around creating file systems on-the-fly as part of the job prolog and destroying them in the epilog. The issue with that mechanism is that formatting a file system (e.g. mkfs.) can be time-consuming. E.g. formatting your local scratch SSD as an LVM PV+VG and all

Re: [slurm-users] Fairshare: Penalising unused memory rather than used memory?

2023-10-11 Thread Jeffrey T Frey

> On the automation part, it would be pretty easy to do regularly(daily?) stats > of jobs for that period of time and dump them into an sql database. > Then a select statement where cpu_efficiency is less than desired value and > get the list of not so nice users on which you can apply whatever

Re: [slurm-users] Notify users about job submit plugin actions

2023-07-19 Thread Jeffrey T Frey

In case you're developing the plugin in C and not LUA, behind the scenes the LUA mechanism is concatenating all log_user() strings into a single variable (user_msg). When the LUA code completes, the C code sets the *err_msg argument to the job_submit()/job_modify() function to that string, then

Re: [slurm-users] How do I set SBATCH_EXCLUSIVE to its default value?

2023-05-19 Thread Jeffrey T Frey

> I get that these correspond > > --exclusive=userexport SBATCH_EXCLUSIVE=user > --exclusive=mcs export SBATCH_EXCLUSIVE=mcs > But --exclusive has a default behavior if I don't assign it a value. What do > I set SBATCH_EXCLUSIVE to, to get the same default behavior? Try setting

Re: [slurm-users] slurm and singularity

2023-02-08 Thread Jeffrey T Frey

ll > " and get the apptainer prompt. If I prefix that command with "srun", > then it just hangs and I never get the prompt. So that seems to be the > sticking point. I'll have to do some experiments running singularity with > srun. > > From: slurm-users

Re: [slurm-users] slurm and singularity

2023-02-07 Thread Jeffrey T Frey

> The remaining issue then is how to put them into an allocation that is > actually running a singularity container. I don't get how what I'm doing now > is resulting in an allocation where I'm in a container on the submit node > still! Try prefixing the singularity command with "srun" e.g.

Re: [slurm-users] Why every job will sleep 100000000

2022-11-04 Thread Jeffrey T Frey

If you examine the process hierarchy, that "sleep 1" process if probably the child of a "slurmstepd: [.extern]" process. This is a housekeeping step launched for the job by slurmd -- in older Slurm releases it would handle the X11 forwarding, for example. It should have no impact on th

Re: [slurm-users] sacct output in tabular form

2021-08-25 Thread Jeffrey T Frey

You've confirmed my suspicion — no one seems to care for Slurm's standard output formats :-) At UD we did a Python curses wrapper around the parseable output to turn the terminal window into a navigable spreadsheet of output: https://gitlab.com/udel-itrci/slurm-output-wrappers > On Aug 25,

Re: [slurm-users] Bug: incorrect output directory fails silently

2021-07-08 Thread Jeffrey T Frey

> I understand that there is no output file to write an error message to, but > it might be good to check the `--output` path during the scheduling, just > like `--account` is checked. > > Does anybody know a workaround to be warned about the error? I would make a feature request of SchedMD to

Re: [slurm-users] squeue: compact pending job-array in one partition, but not in other

2021-02-23 Thread Jeffrey T Frey

Did those four jobs 6577272_21 scavenger PD 0:00 1 (Priority) 6577272_22 scavenger PD 0:00 1 (Priority) 6577272_23 scavenger PD 0:00 1 (Priority) 6577272_28 scavenger PD 0:00 1 (Priority) run before and get requeued? Seems

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-25 Thread Jeffrey T Frey

is to publish the packages to a unique repository: those who want the pre-built packages explicitly configure their YUM to pull from that repository, those who have EPEL configured (which is a LOT of us) don't get overlapping Slurm packages interfering with their local builds. :::::

[slurm-users] Constraint multiple counts not working

2020-12-16 Thread Jeffrey T Frey

uested node configuration is not available My syntax agrees with the 20.11.1 documentation (online and man pages) so it seems correct — and it works fine in 17.11.8. Any ideas? :::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer V / Cluster Man

Re: [slurm-users] Slurm versions 20.11.1 is now available

2020-12-11 Thread Jeffrey T Frey

It's in the github commits: https://github.com/SchedMD/slurm/commit/8e84db0f01ecd4c977c12581615d74d59b3ff995 The primary issue is that any state the client program established on the connection after first making it (e.g. opening a transaction, creating temp tables) won't be present if MySQL

Re: [slurm-users] Heterogeneous GPU Node MPS

2020-11-13 Thread Jeffrey T Frey

From the NVIDIA docs re: MPS: On systems with a mix of Volta / pre-Volta GPUs, if the MPS server is set to enumerate any Volta GPU, it will discard all pre-Volta GPUs. In other words, the MPS server will either operate only on the Volta GPUs and expose Volta capabilities, or operate only on pr

Re: [slurm-users] ProfileInfluxDB: Influxdb server with self-signed certificate

2020-08-14 Thread Jeffrey T Frey

Making the certificate globally-available on the host may not always be permissible. If I were you, I'd write/suggest a modification to the plugin to make the CA path (CURLOPT_CAPATH) and verification itself (CURLOPT_SSL_VERIFYPEER) configurable in Slurm. They are both straightforward options

Re: [slurm-users] slurm array with non-numeric index values

2020-07-15 Thread Jeffrey T Frey

On our HPC systems we have a lot of users attempting to organize job arrays for varying purposes -- parameter scans, SSMD (Single-Script, Multiple Datasets). We eventually wrote an abstract utility to try to help them with the process: https://github.com/jtfrey/job-templating-tool May be of

Re: [slurm-users] Slurm 20.02.3 error: CPUs=1 match no Sockets, SocketsCoresPerSocket or SocketsCoresPerSocket*ThreadsPerCore. Resetting CPUs.

2020-06-16 Thread Jeffrey T Frey

If you check the source up on Github, that's more of a warning produced when you didn't specify a CPU count and it's going to calculate from the socket-core-thread numbers (src/common/read_config.c): /* Node boards are factored into sockets */ if ((n->cpus != n-

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Jeffrey T Frey

Is the time on that node too far out-of-sync w.r.t. the slurmctld server? > On Jun 11, 2020, at 09:01 , navin srivastava wrote: > > I tried by executing the debug mode but there also it is not writing anything. > > i waited for about 5-10 minutes > > deda1x1452:/etc/sysconfig # /usr/sbin/slu

Re: [slurm-users] ssh-keys on compute nodes?

2020-06-08 Thread Jeffrey T Frey

; Durai > > On Mon, Jun 8, 2020 at 5:55 PM Jeffrey T Frey wrote: > User home directories are on a shared (NFS) filesystem that's mounted on > every node. Thus, they have the same id_rsa key and authorized_keys file > present on all nodes. > > > > > > On Jun

Re: [slurm-users] ssh-keys on compute nodes?

2020-06-08 Thread Jeffrey T Frey

An MPI library with tight integration with Slurm (e.g. Intel MPI, Open MPI) can use "srun" to start the remote workers. In some cases "srun" can be used directly for MPI startup (e.g. "srun" instead of "mpirun"). Other/older MPI libraries that start remote processes using "ssh" would, natural

Re: [slurm-users] IPv6 for slurmd and slurmctld

2020-05-01 Thread Jeffrey T Frey

Use netstat to list listening ports on the box (netstat -ln) and see if it shows up as tcp6 or tcp. On our (older) 17.11.8 server: $ netstat -ln | grep :6817 tcp0 0 0.0.0.0:68170.0.0.0:* LISTEN $ nc -6 :: 6817 Ncat: Connection refused. $ nc -4 localhos

Re: [slurm-users] How to trap a SIGINT signal in a child process of a batch ?

2020-04-21 Thread Jeffrey T Frey

You could also choose to propagate the signal to the child process of test.slurm yourself: #!/bin/bash #SBATCH --job-name=test #SBATCH --ntasks-per-node=1 #SBATCH --nodes=1 #SBATCH --time=00:03:00 #SBATCH --signal=B:SIGINT@30 # This example works, but I need it to work without "B:" in --signal

Re: [slurm-users] Problems calling mpirun in OpenMPI-3.1.6 + slurm and OpenMPI-4.0.3+slurm environments

2020-04-10 Thread Jeffrey T Frey

hat you're expecting :-) > On Apr 10, 2020, at 12:59 , Jeffrey T Frey wrote: > > Are you certain you're PATH addition is correct? The "-np" flag is still > present in a build of Open MPI 4.

Re: [slurm-users] Problems calling mpirun in OpenMPI-3.1.6 + slurm and OpenMPI-4.0.3+slurm environments

2020-04-10 Thread Jeffrey T Frey

Are you certain you're PATH addition is correct? The "-np" flag is still present in a build of Open MPI 4.0.3 I just made, in fact: $ 4.0.3/bin/mpirun -- mpirun could not find anything to do. It is possible that you forgo

Re: [slurm-users] Slurm version 20.02.0 is now available

2020-02-26 Thread Jeffrey T Frey

Did you reuse the 20.02 select/cons_res/Makefile.{in,am} in your plugin's source? You probably will have to re-model your plugin after the select/cray_aries plugin if you need to override those two functions (it also defines its own select_p_job_begin() and doesn't link against libcons_common.

Re: [slurm-users] Srun not setting DISPLAY with --x11 for one account

2020-01-27 Thread Jeffrey T Frey

> So the answer then is to either kludge the keys by making symlinks to the > cluster and cluster.pub files warewulf makes (I tried this already and I know > it works), or to update to the v19.x release and the new style x11 forwarding. Our answer was to create RSA keys for all users in their ~/

Re: [slurm-users] Srun not setting DISPLAY with --x11 for one account

2020-01-27 Thread Jeffrey T Frey

The Slurm-native X11 plugin demands you use ~/.ssh/id_rsa{,.pub} keys. It's hard-coded into the plugin: /* * Ideally these would be selected at run time. Unfortunately, * only ssh-rsa and ssh-dss are supported by libssh2 at this time, * and ssh-dss is deprecated. */ static char *hostkey_pri

Re: [slurm-users] blastx fails with "Error memory mapping"

2020-01-24 Thread Jeffrey T Frey

Does your Slurm cgroup or node OS cgroup configuration limit the virtual address space of processes? The "Error memory mapping" is thrown by blast when trying to create a virtual address space that exposes the contents of a file on disk (see "man mmap") so the file can be accessed via pointers

Re: [slurm-users] Question about networks and connectivity

2019-12-09 Thread Jeffrey T Frey

Open MPI matches available hardware in node(s) against its compiled-in capabilities. Those capabilities are expressed as modular shared libraries (see e.g. $PREFIX/lib64/openmpi). You can use environment variables or command-line flags to influence which modules get used for specific purposed.

Re: [slurm-users] $TMPDIR does not honor "TmpFS"

2018-11-21 Thread Jeffrey T Frey

If you check the applicable code in src/slurmd/slurmstepd/task.c, TMPDIR is set to "/tmp" if it's not already set in the job environment and then TMPDIR is created if permissible. It's your responsibility to set TMPDIR -- e.g. we have a plugin we wrote (autotmp) to set TMPDIR to per-job and per

Re: [slurm-users] new user simple question re sacct output line2

2018-11-14 Thread Jeffrey T Frey

The identifier after the base numeric job id -- e.g. "batch" -- is the job step. The "batch" step is where your job script executes. Each time your job script calls "srun" a new numerical step is created, e.g. "82.0," "82.1," et al. Job accounting captures information for the entire job (JobI

Re: [slurm-users] "Owner" field in scontrol show node?

2018-08-08 Thread Jeffrey T Frey

https://github.com/SchedMD/slurm/blob/master/src/slurmctld/read_config.c Line 2511 -- if the node has been scheduled exclusively, this field is set to the uid of the user whose job(s) occupy the node. > On Aug 8, 2018, at 18:14 , Ryan Novosielski wrote: > > Does anyone have any idea or a poi

Re: [slurm-users] Slurmstepd sleep processes

2018-08-03 Thread Jeffrey T Frey

See: https://github.com/SchedMD/slurm/blob/master/src/slurmd/slurmstepd/mgr.c Circa line 1072 the comment explains: /* * Need to exec() something for proctrack/linuxproc to * work, it will not keep a process name

38 matches

Mail list logo