You'd have to do this within e.g. the system's bashrc infrastructure. The
simplest idea would be to add to e.g. /etc/profile.d/zzz-slurmstats.sh and have
some canned commands/scripts running. That does introduce load to the system
and Slurm on every login, though, and slows the startup of logi
I can confirm on a freshly-installed RockyLinux 9.4 system, the dbus-devel
package was not installed by default. The Development Tools
# dnf repoquery --groupmember dbus-devel
Last metadata expiration check: 2:04:16 ago on Tue 16 Jul 2024 12:02:50 PM EDT.
dbus-devel-1:1.12.20-8.el9.i686
dbus-de
> AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
> is per user.
The ulimit is a frontend to rusage limits, which are per-process restrictions
(not per-user).
The fs.file-max is the kernel's limit on how many file descriptors can be open
in aggregate. You'd have to edit
https://github.com/dun/munge/issues/94
The NEWS file claims this was fixed in 0.5.15. Since your log doesn't show the
additional strerror() output you're definitely running an older version,
correct?
If you go on one of the affected nodes and do an `lsof -p ` I'm
betting you'll find a long
to get this information, but this seems a bit unclean. Anyway, if I
> find some time I will try it out.
> Best,
> Tim
> On 2/6/24 16:30, Jeffrey T Frey wrote:
>> Most of my ideas have revolved around creating file systems on-the-fly as
>> part of the job prolog and destroyi
Most of my ideas have revolved around creating file systems on-the-fly as part
of the job prolog and destroying them in the epilog. The issue with that
mechanism is that formatting a file system (e.g. mkfs.) can be
time-consuming. E.g. formatting your local scratch SSD as an LVM PV+VG and
all
> On the automation part, it would be pretty easy to do regularly(daily?) stats
> of jobs for that period of time and dump them into an sql database.
> Then a select statement where cpu_efficiency is less than desired value and
> get the list of not so nice users on which you can apply whatever
In case you're developing the plugin in C and not LUA, behind the scenes the
LUA mechanism is concatenating all log_user() strings into a single variable
(user_msg). When the LUA code completes, the C code sets the *err_msg argument
to the job_submit()/job_modify() function to that string, then
> I get that these correspond
>
> --exclusive=userexport SBATCH_EXCLUSIVE=user
> --exclusive=mcs export SBATCH_EXCLUSIVE=mcs
> But --exclusive has a default behavior if I don't assign it a value. What do
> I set SBATCH_EXCLUSIVE to, to get the same default behavior?
Try setting
ll
> " and get the apptainer prompt. If I prefix that command with "srun",
> then it just hangs and I never get the prompt. So that seems to be the
> sticking point. I'll have to do some experiments running singularity with
> srun.
>
> From: slurm-users
> The remaining issue then is how to put them into an allocation that is
> actually running a singularity container. I don't get how what I'm doing now
> is resulting in an allocation where I'm in a container on the submit node
> still!
Try prefixing the singularity command with "srun" e.g.
If you examine the process hierarchy, that "sleep 1" process if
probably the child of a "slurmstepd: [.extern]" process. This is a
housekeeping step launched for the job by slurmd -- in older Slurm releases it
would handle the X11 forwarding, for example. It should have no impact on th
You've confirmed my suspicion — no one seems to care for Slurm's standard
output formats :-) At UD we did a Python curses wrapper around the parseable
output to turn the terminal window into a navigable spreadsheet of output:
https://gitlab.com/udel-itrci/slurm-output-wrappers
> On Aug 25,
> I understand that there is no output file to write an error message to, but
> it might be good to check the `--output` path during the scheduling, just
> like `--account` is checked.
>
> Does anybody know a workaround to be warned about the error?
I would make a feature request of SchedMD to
Did those four jobs
6577272_21 scavenger PD 0:00 1 (Priority)
6577272_22 scavenger PD 0:00 1 (Priority)
6577272_23 scavenger PD 0:00 1 (Priority)
6577272_28 scavenger PD 0:00 1 (Priority)
run before and get requeued? Seems
is to publish the
packages to a unique repository: those who want the pre-built packages
explicitly configure their YUM to pull from that repository, those who have
EPEL configured (which is a LOT of us) don't get overlapping Slurm packages
interfering with their local builds.
:::::
uested node
configuration is not available
My syntax agrees with the 20.11.1 documentation (online and man pages) so it
seems correct — and it works fine in 17.11.8. Any ideas?
::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / Cluster Man
It's in the github commits:
https://github.com/SchedMD/slurm/commit/8e84db0f01ecd4c977c12581615d74d59b3ff995
The primary issue is that any state the client program established on the
connection after first making it (e.g. opening a transaction, creating temp
tables) won't be present if MySQL
From the NVIDIA docs re: MPS:
On systems with a mix of Volta / pre-Volta GPUs, if the MPS server is set to
enumerate any Volta GPU, it will discard all pre-Volta GPUs. In other words,
the MPS server will either operate only on the Volta GPUs and expose Volta
capabilities, or operate only on pr
Making the certificate globally-available on the host may not always be
permissible. If I were you, I'd write/suggest a modification to the plugin to
make the CA path (CURLOPT_CAPATH) and verification itself
(CURLOPT_SSL_VERIFYPEER) configurable in Slurm. They are both straightforward
options
On our HPC systems we have a lot of users attempting to organize job arrays for
varying purposes -- parameter scans, SSMD (Single-Script, Multiple Datasets).
We eventually wrote an abstract utility to try to help them with the process:
https://github.com/jtfrey/job-templating-tool
May be of
If you check the source up on Github, that's more of a warning produced when
you didn't specify a CPU count and it's going to calculate from the
socket-core-thread numbers (src/common/read_config.c):
/* Node boards are factored into sockets */
if ((n->cpus != n-
Is the time on that node too far out-of-sync w.r.t. the slurmctld server?
> On Jun 11, 2020, at 09:01 , navin srivastava wrote:
>
> I tried by executing the debug mode but there also it is not writing anything.
>
> i waited for about 5-10 minutes
>
> deda1x1452:/etc/sysconfig # /usr/sbin/slu
; Durai
>
> On Mon, Jun 8, 2020 at 5:55 PM Jeffrey T Frey wrote:
> User home directories are on a shared (NFS) filesystem that's mounted on
> every node. Thus, they have the same id_rsa key and authorized_keys file
> present on all nodes.
>
>
>
>
> > On Jun
An MPI library with tight integration with Slurm (e.g. Intel MPI, Open MPI) can
use "srun" to start the remote workers. In some cases "srun" can be used
directly for MPI startup (e.g. "srun" instead of "mpirun").
Other/older MPI libraries that start remote processes using "ssh" would,
natural
Use netstat to list listening ports on the box (netstat -ln) and see if it
shows up as tcp6 or tcp. On our (older) 17.11.8 server:
$ netstat -ln | grep :6817
tcp0 0 0.0.0.0:68170.0.0.0:* LISTEN
$ nc -6 :: 6817
Ncat: Connection refused.
$ nc -4 localhos
You could also choose to propagate the signal to the child process of
test.slurm yourself:
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=1
#SBATCH --time=00:03:00
#SBATCH --signal=B:SIGINT@30
# This example works, but I need it to work without "B:" in --signal
hat you're expecting :-)
> On Apr 10, 2020, at 12:59 , Jeffrey T Frey wrote:
>
> Are you certain you're PATH addition is correct? The "-np" flag is still
> present in a build of Open MPI 4.
Are you certain you're PATH addition is correct? The "-np" flag is still
present in a build of Open MPI 4.0.3 I just made, in fact:
$ 4.0.3/bin/mpirun
--
mpirun could not find anything to do.
It is possible that you forgo
Did you reuse the 20.02 select/cons_res/Makefile.{in,am} in your plugin's
source? You probably will have to re-model your plugin after the
select/cray_aries plugin if you need to override those two functions (it also
defines its own select_p_job_begin() and doesn't link against
libcons_common.
> So the answer then is to either kludge the keys by making symlinks to the
> cluster and cluster.pub files warewulf makes (I tried this already and I know
> it works), or to update to the v19.x release and the new style x11 forwarding.
Our answer was to create RSA keys for all users in their ~/
The Slurm-native X11 plugin demands you use ~/.ssh/id_rsa{,.pub} keys. It's
hard-coded into the plugin:
/*
* Ideally these would be selected at run time. Unfortunately,
* only ssh-rsa and ssh-dss are supported by libssh2 at this time,
* and ssh-dss is deprecated.
*/
static char *hostkey_pri
Does your Slurm cgroup or node OS cgroup configuration limit the virtual
address space of processes? The "Error memory mapping" is thrown by blast when
trying to create a virtual address space that exposes the contents of a file on
disk (see "man mmap") so the file can be accessed via pointers
Open MPI matches available hardware in node(s) against its compiled-in
capabilities. Those capabilities are expressed as modular shared libraries
(see e.g. $PREFIX/lib64/openmpi). You can use environment variables or
command-line flags to influence which modules get used for specific purposed.
If you check the applicable code in src/slurmd/slurmstepd/task.c, TMPDIR is set
to "/tmp" if it's not already set in the job environment and then TMPDIR is
created if permissible. It's your responsibility to set TMPDIR -- e.g. we have
a plugin we wrote (autotmp) to set TMPDIR to per-job and per
The identifier after the base numeric job id -- e.g. "batch" -- is the job
step. The "batch" step is where your job script executes. Each time your job
script calls "srun" a new numerical step is created, e.g. "82.0," "82.1," et
al. Job accounting captures information for the entire job (JobI
https://github.com/SchedMD/slurm/blob/master/src/slurmctld/read_config.c
Line 2511 -- if the node has been scheduled exclusively, this field is set to
the uid of the user whose job(s) occupy the node.
> On Aug 8, 2018, at 18:14 , Ryan Novosielski wrote:
>
> Does anyone have any idea or a poi
See:
https://github.com/SchedMD/slurm/blob/master/src/slurmd/slurmstepd/mgr.c
Circa line 1072 the comment explains:
/*
* Need to exec() something for proctrack/linuxproc to
* work, it will not keep a process name
38 matches
Mail list logo