Hello,
per documentation, it is possible to run slurm on non systemd system with
IgnoreSystemd=yes in cgroup.conf.
However I had an error with slurmd:
error: common_file_write_content: unable to open
'/sys/fs/cgroup/system.slice/cgroup.subtree_control' for writing: No such file
or directory
We did something like this in the past but from C. However, modifying
the features was painful if the user did any interesting syntax.
What we are doing now is using --extra for that purpose. The nodes boot
up with SLURMD_OPTIONS="--extra {\\\"os\\\":\\\"rhel9\\\"}" or similar.
Users can re
I wrote a job_submit.lua also. It would append "¢os79" to the feature
string unless the features already contained "el9," or if empty, set the
features string to "centos79" without the ampersand. I didn't hear from any
users doing anything fancy enough with their feature string for the ampersa
We've done this though with job_submit.lua. Mostly with OS updates. We
add a feature to everything then proceed. Telling users that adding a
feature gets you on the "new" nodes.
I can send you the snippet if you're using the job_submit.lua script.
Bill
On 6/14/24 2:18 PM, David Magda via s
Hello,
What I’m looking for is a way for a node to continue to be in the same
partition, and have the same QoS(es), but only be chosen if a particular
capability is being asked for. This is because we are rolling something (OS
upgrade) out slowly to a small batch of nodes at first, and then mor
I have confirmed that the issue is Ubuntu 20.04. I used the tmate github
action to get access to the Ubuntu 20.04 github arm runner and tried the steps
manually one be one. It did indeed fail, almost immediately in the "debuild -b
-uc -us” step. Given that the same experiment done on a Ubuntu
On 14.06.2024 17:51, Rafał Lalik via slurm-users wrote:
Hello,
I have encountered issues with running slurmctld.
From logs, I see these errors:
[2024-06-14T17:37:57.587] slurmctld version 24.05.0 started on cluster
laura
[2024-06-14T17:37:57.587] error: plugin_load_from_file:
dlopen(/usr/li
Hello,
I have encountered issues with running slurmctld.
From logs, I see these errors:
[2024-06-14T17:37:57.587] slurmctld version 24.05.0 started on cluster laura
[2024-06-14T17:37:57.587] error: plugin_load_from_file: dlopen(/usr/lib64/slurm/jobacct_gather_cgroup.so):
/usr/lib64/slurm/jobac
The commands were grouped like that because they are part of a RUN in a
Dockerfile. The build was happening on a Github Actions runner, so not so easy
to just interactively run them one at a time. But, I'm pretty confident that
it was the "debuild -b -uc -us" that failed.
I have since gathere
I believe I have solved this. I changed the configuration to replace:
TaskPlugin=task/affinity
with:
TaskPlugin=task/none
In my case, the login node, the head node, and all of the compute nodes are
running in their own containers. And docker compose is used to run all of
those containers to
Hi,
because of my real scenario (in mi first post I explained my testing scenario),
with several differents users of differents types (researchers, university
students and/or teachers, etc), I have distributed my GPUs in 3 differents
partitions:
* PartitionName=cuda-staff.q Nodes=gpu-[1-4]
11 matches
Mail list logo