t;>
/sys/fs/cgroup/system.slice/cgroup.subtree_control
/usr/sbin/slurmstepd infinity &
*From:*Josef Dvoracek via slurm-users
*Sent:* Thursday, April 11, 2024 11:14 AM
*To:* slurm-users@lists.schedmd.com
*Subject:* [slurm-users] Re: Slurmd enabled crash with CgroupV2
I observe same behavior on slurm 23
I observe same behavior on slurm 23.11.5 Rocky Linux8.9..
> [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
> memory pids
> [root@compute ~]# systemctl disable slurmd
> Removed /etc/systemd/system/multi-user.target.wants/slurmd.service.
> [root@compute ~]# cat /sys/fs/cgroup/cgroup.su
Is here anybody having nice visualization of JobComp and JobacctGather
data in Grafana?
I save JobComp data in Elasticsearch, JobacctGather data in influxDB,
and thinking about how to provide meaningful insights to $users.
Things I'd like to show..: especially memory & cpu utilization, job
r
I use telegraf (which supports "exporter" output format as well) to
capture cgroupsv2 job data:
https://github.com/jose-d/telegraf-configs/tree/master/slurm-cgroupsv2
I had to rework it when changing from cgroupsv1 to cgroupsv2, as the
format/structure of textfiles changed a bit.
cheers
jos
I think you need set reasonable "DefMemPerCPU" - otherwise jobs will
take all memory by default, and there is no remaining memory for the
second job.
We calculated DefMemPerCPU in such way, that the default allocated
memory of full node is slightly under half of total node memory. So
there i
> I'm running slurm 22.05.11 which is available with OpenHCP 3.x
> Do you think an upgrade is needed?
I feel that lot of slurm operators tend to not use 3rd party sources of
slurm binaries, as you do not have the build environment fully in your
hands.
But before making such a complex decision
I think installing/upgrading "slurm" rpm will replace this shared lib.
Indeed, as always, test it first at not-so-critical system, use vm
snapshots to be able to travel back in time ... as once you'll upgrade
DB schema (if part of upgrade) you AFAIK can not go back.
josef
On 28. 02. 24 15:51
I see this question unanswered so far.. so I'll give you my 2 cents:
Quick check reveals that mentioned symbol is in libslurmfull.so :
[root@slurmserver2 ~]# nm -gD /usr/lib64/slurm/libslurmfull.so | grep
"slurm_conf$"
000d2c06 T free_slurm_conf
000d3345 T init_slurm_conf
0
Hi Dietmar;
I tried this on ${my cluster}, as I switched to cgroupsv2 quite recently..
I must say that on my setup it looks it works as expected, see the
grepped stdout from your reproducer below.
I use recent slurm 23.11.4 .
Wild guess.. Has your build machine bpt and dbus devel packages in
From unclear reason "--wrap" was not part of my /repertoire/ so far.
thanks
On 26. 02. 24 9:47, Ward Poelmans via slurm-users wrote:
sbatch --wrap 'screen -D -m'
srun --jobid --pty screen -rd
smime.p7s
Description: S/MIME Cryptographic Signature
--
slurm-users mailing list -- slurm-users@li
What is the recommended way to run longer interactive job at your systems?
Our how-to includes starting screen at front-end node and running srun
with bash/zsh inside,
but that indeed brings dependency between login node (with screen) and
the compute node job.
On systems with multiple front-e
> Just looking for some feedback, please. Is this OK? Is there a better
way?
> I’m tempted to spec all new HPCs with only a high speed (200Gbps) IB
network,
Well you need Ethernet for OOB management (bmc/ipmi/ilo/whatever)
anyway.. or?
cheers
josef
On 25. 02. 24 21:12, Dan Healy via slur
isn't your /softs.. filesystem eg. some cluster network filesystem mount?
It happened to me multiple times, that I was attempting to build some
scientific software, and because of building on top of BeeGFS (I think
hardlinks are not fully supported), or NFS ( caching), I was getting
_interesti
My impression is, that there are multiple challenges why it's not easy
to create good-for-all recent slurm RPM:
- NVML dependency - different sites use different NVML lib version with
varying update cycle
- pmi* deps - some sites (like mine) is using only one reasonable recent
openpmix, I know
To protect from HW failure, and to have more free hands when upgrading
underlying OS, we use virtualization with "live migration"/HA and
MariaDB server as a VM.
VM is easy to backup, restore as a snapshot, clone for possible tests, etc.
In the past, I deployed (customer-requirement) one site u
couldn't be that library "cuda-nvml-devel" was not installed when you
were building slurm?
cheers
josef
On 30. 11. 23 15:06, Ravi Konila wrote:
Hello,
My gres.conf has AutoDetect=nvml
when I restart slurmd service I do get
*fatal: We were configured to autodetect nvml functionality, but we
w
> can you please advice me on the monitoring tools, I
I'm _somehow_ satisfied with:
Prometheus Slurm exporter - (
https://github.com/vpenso/prometheus-slurm-exporter),
being grabbed by Telegraf - (
https://www.influxdata.com/time-series-platform/telegraf )
sending metrics to InfluxDB.
Visual
I'm writing ansible module to interact with my clusters, so currently
diving in --yaml output of `scontrol show node`..
What is the meaning of "next_state_after_reboot" attribute of node?
eg for one of my nodes, it is:
"next_state_after_reboot": [
"INVALID",
Just remove given node from partition.
Already running jobs will continue without interruption..
HTH
josef
On 04. 08. 23 16:40, Pacey, Mike wrote:
..
smime.p7s
Description: S/MIME Cryptographic Signature
my users found the beauty of job arrays, and they tend to use it every
then and now.
Sometimes human factor steps in, and something is wrong in job array
specification, and cluster "works" on one failed array job after another.
Isn't there any way how to automatically stop/scancel/? job array
> But I'd be interested to see what other places do.
we installed this: https://github.com/vpenso/prometheus-slurm-exporter
and scrape this exporter with "inputs.prometheus" Telegraf input and
it's sent to influx (and shown by Grafana)
--
josef
On 12. 06. 23 1:43, Andrew Elwell wrote:
...
hi all slurm ops,
I'd like to improve my new user workflow.
when new (eg. authenticated against external LDAP) user tries to submit
job at my facilities, he sees this:
[test@login1 ~]$ sbatch sbatch_sleep.sh
sbatch: error: Batch job submission failed: Invalid account or
account/partition com
hello @list!
anyone who was dealing with following scenario?
* we have limited amount of Matlab network licenses ( and various
features have various amount of available seats, eg. machine learning: N
licenses, Image_Toolbox: M licenses)
* licenses are being used by slurm jobs and by individual
> I had config the right slurm and munge inside the container.
this is the reason.
Who has access to munge.key can effectively became root at slurm cluster.
you should not disclose munge.key to containers.
cheers
josef
On 18. 05. 22 9:13, GHui wrote:
...I had config the right slurm and mung
x27;re
using filesystem being not secure by default..
On 15. 12. 21 10:29, Hermann Schwärzler wrote:
...
--
Josef Dvoracek
Institute of Physics | Czech Academy of Sciences | office 230A
cell+signal: +420 608 563 558
smime.p7s
Description: S/MIME Cryptographic Signature
kerberized-nfs at compute nodes
under slurm is not-so common scenario?
thanks for any thoughts.
josef
--
Josef Dvoracek
Institute of Physics | Czech Academy of Sciences | office 230A
cell+signal: +420 608 563 558
sable powersaving mechanism for
particular node/noderange?
I'm aware that there is SuspendExcNodes configuration parameter, but
AFAIK it cannot be applied/changed without slurmctld restart.
cheers
josef
--
Josef Dvoracek
Institute of Physics | Czech Academy of Sciences
cell: +420 608 563 5
in
disk size is picked up - at least once with the cluster running.)
It is absolutely safe to restart slurmctld (and slurmdbd) with jobs
running on the cluster, that really is something that at least I do
all the time.
Tina
On 24/06/2021 10:16, Josef Dvoracek wrote:
hi,
just set t
ining
all partitions and then restart the server. That is slurmctld.slurmdb
and mariadb? Or will the restarting of slurm vm have no effect on
running/pending iobs?
Sincerely
Amjad
--
Josef Dvoracek
Institute of Physics | Czech Academy of Sciences
cell: +420 608 563 558 | https://telegram
X
ControlMachine=slurmserver2.DOMAIN
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurmserver2.DOMAIN
AccountingStoragePort=7031
SlurmctldParameters=enable_configless
--
Josef Dvoracek
Institute of Physics | Czech Academy of Sciences
cell: +420 608 563 558 | https://t
--
josef
On 05. 05. 20 2:24, Lisa Kay Weihl wrote:
..
--
Josef Dvoracek
Institute of Physics | Czech Academy of Sciences
cell: +420 608 563 558 | office: +420 266 052 669 | fzu phone nr. : 2669
some time ago I wrote this small collector,
https://github.com/jose-d/influxdb-collectors/tree/master/slurm_metric_writer.
Until you'll write/find better one, feel free to use it, send PRs with
improvements, etc :)
cheers.
josef
On 26. 09. 19 17:15, Marcus Boden wrote:
Hey everyone,
I am
0 127.0.0.11:57504 0.0.0.0:* -
[root@slurmctld_container ~]#
cheers
josef
--
Josef Dvoracek
Institute of Physics @ Czech Academy of Sciences
cell: +420 608 563 558 | office: +420 266 052 669 | fzu phone nr. : 2669
smime.p7s
Description: S/MIME Cryptographic
33 matches
Mail list logo