Hello
I am writing to report an issue with the Slurmctld process on our RHEL 9
(Rocky Linux) .
Twice in the past 5 days, the Slurmctld process has encountered an error
that resulted in the service stopping. The error message displayed was
"double free or corruption (out)". This error has cau
On 7/15/24 10:43, William VINCENT via slurm-users wrote:
I am writing to report an issue with the Slurmctld process on our RHEL 9
(Rocky Linux) .
Twice in the past 5 days, the Slurmctld process has encountered an error
that resulted in the service stopping. The error message displayed was
"do
Thank you for your response, I hadn't considered that version 22 could be the
problem.
I am aware that we are not up to date, but we use the EPEL repo for our RPM
packages. Originally, we did not want to install .rpm directly because our
policy is to apply security updates every night via the r
On 7/15/24 11:35, William V via slurm-users wrote:
Thank you for your response, I hadn't considered that version 22 could be the
problem.
I am aware that we are not up to date, but we use the EPEL repo for our RPM
packages. Originally, we did not want to install .rpm directly because our
poli
Wow, thank you so much for all this information and the installation wiki.
I have a lot of work to do to change the infrastructure, I hope it will go
smoothly.
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
Hi João,
did you get this problem solved? I have the exact same problem and would be
very interested.
Help would be greatly appreciated!
Thank you and best regards,
Andi
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedm
Hi Daniel,
Thanks for picking up this query. Let me try to briefly describe my problem.
As you rightly guessed, we have some hardware on the backend which would be
used for our
jobs to run. The app which manages the h/w has its own set of resource
placement/remapping
rules to place a job.
So, fo
Hi all,
I am hoping someone can help with our problem. Every hour after restarting
slurmctld the controller becomes unresponsive to commands for 1 sec, reporting
errors such as:
[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934767]]
slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO