[slurm-users] Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2024-07-15 Thread William VINCENT via slurm-users

Hello

I am writing to report an issue with the Slurmctld process on our RHEL 9 
(Rocky Linux) .


Twice in the past 5 days, the Slurmctld process has encountered an error 
that resulted in the service stopping. The error message displayed was 
"double free or corruption (out)". This error has caused significant 
disruption to our jobs, and we are concerned about its recurrence.


We have tried troubleshooting the issue, but we have not been able to 
identify the root cause of the problem. We would appreciate any 
assistance or guidance you can provide to help us resolve this issue.


Please let us know if you need any additional information or if there 
are any specific steps we should take to diagnose the problem further.


Thank you for your attention to this matter.

Best regards,

_

Jul 09 22:12:01 admin slurmctld[711010]: double free or corruption 
(fasttop)
Jul 09 22:12:01 admin systemd[1]: slurmctld.service: Main process 
exited, code=killed, status=6/ABRT
Jul 09 22:12:01 admin systemd[1]: slurmctld.service: Failed with result 
'signal'.
Jul 09 22:12:01 admin systemd[1]: slurmctld.service: Consumed 11min 
26.451s CPU time.


.

Jul 14 10:15:01 admin slurmctld[1633720]: double free or corruption (out)
Jul 14 10:15:02 admin systemd[1]: slurmctld.service: Main process 
exited, code=killed, status=6/ABRT
Jul 14 10:15:02 admin systemd[1]: slurmctld.service: Failed with result 
'signal'.
Jul 14 10:15:02 admin systemd[1]: slurmctld.service: Consumed 7min 
27.596s CPU time.


_

slurmctld -V
slurm 22.05.9



cat /etc/slurm/slurm.conf |grep -v '#'


ClusterName=xxx
SlurmctldHost=admin
SlurmctldParameters=enable_configless
SlurmUser=slurm
AuthType=auth/munge
CryptoType=crypto/munge


SlurmctldPort=6817
StateSaveLocation=/var/spool/slurmctld
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmctldDebug=verbose
DebugFlags=NO_CONF_HASH


SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmdDebug=verbose

SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core,CR_LLN
DefMemPerCPU=1024
MaxMemPerCPU=4096
GresTypes=gpu


ProctrackType=proctrack/cgroup
JobAcctGatherType=jobacct_gather/cgroup
JobAcctGatherFrequency=15
JobCompType=jobcomp/none

TaskPlugin=task/cgroup
LaunchParameters=use_interactive_step

AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=admin
AccountingStoragePort=6819
AccountingStorageEnforce=associations
AccountingStorageTRES=gres/gpu



MailProg=/usr/bin/mailx
EnforcePartLimits=YES
MaxArraySize=20
MaxJobCount=50
MpiDefault=none
ReturnToService=2
SwitchType=switch/none
TmpFS=/tmpslurm/
UsePAM=1



InactiveLimit=0
KillWait=30
MessageTimeout=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0



PriorityType=priority/multifactor
PriorityFlags=FAIR_TREE,MAX_TRES
PriorityDecayHalfLife=1-0
PriorityWeightFairshare=1




NodeName=xxx  NodeHostname=xxx  CPUs=4 Sockets=4 RealMemory=3500 
TmpDisk=1 CoresPerSocket=1 ThreadsPerCore=1 State=DRAIN
NodeName=xxx  NodeHostname=xxx  CPUs=2 Sockets=2 RealMemory=1700 
TmpDisk=1 CoresPerSocket=1 ThreadsPerCore=1 State=DRAIN
NodeName=xxx  NodeHostname=xxx  CPUs=4 Sockets=4 RealMemory=1700 
TmpDisk=1 CoresPerSocket=1 ThreadsPerCore=1 State=DRAIN
NodeName=xxx  NodeHostname=xxx  CPUs=4 Sockets=4 RealMemory=3500 
TmpDisk=1 CoresPerSocket=1 ThreadsPerCore=1 State=DRAIN



NodeName=r9nc-24-[1-12] NodeHostname=r9nc-24-[1-12] Sockets=2 
CoresPerSocket=12 ThreadsPerCore=1 CPUs=24 RealMemory=18 State=UNKNOWN
NodeName=r9nc-48-[1-4]  NodeHostname=r9nc-48-[1-4] Sockets=2 
CoresPerSocket=24 ThreadsPerCore=1 CPUs=48 RealMemory=48 State=UNKNOWN
NodeName=r9ng-1080-[1-7]   NodeHostname=r9ng-1080-[1-7] Sockets=2 
CoresPerSocket=10 ThreadsPerCore=1 CPUs=20 RealMemory=18 
State=UNKNOWN Gres=gpu:1080ti:4
NodeName=r9ng-1080-8   NodeHostname=r9ng-1080-8 Sockets=2 
CoresPerSocket=10 ThreadsPerCore=1 CPUs=20 RealMemory=176687 
State=UNKNOWN Gres=gpu:1080ti:1


PartitionName=24CPUNodes  Nodes=r9nc-24-[1-12]    State=UP 
MaxTime=UNLIMITED OverSubscribe=NO MaxMemPerCPU=7500 DefMemPerCPU=7500 
TRESBillingWeights="CPU=1.0,Mem=0.125G" Default=YES
PartitionName=48CPUNodes  Nodes=r9nc-48-[1-4] State=UP 
MaxTime=UNLIMITED OverSubscribe=NO MaxMemPerCPU=1 DefMemPerCPU=8000 
TRESBillingWeights="CPU=1.0,Mem=0.125G"
PartitionName=GPUNodes   Nodes=r9ng-1080-[1-7]    State=UP 
MaxTime=UNLIMITED OverSubscribe=NO MaxMemPerCPU=9000 DefMemPerCPU=9000
PartitionName=GPUNodes1080-dev   Nodes=r9ng-1080-8    State=UP 
MaxTime=UNLIMITED OverSubscribe=NO MaxMemPerCPU=9000 DefMemPerCPU=9000 
Hidden=Yes


_

sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
24CPUNodes* up   infinite 12   idle r9nc-24-[1-12]
48CPUNodes  up   infinite  2   idle r9nc-48-[1-2]
GPUNodes    up   infinite  4   idle r9ng-1080-[4-7]
GPU

[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2024-07-15 Thread Ole Holm Nielsen via slurm-users

On 7/15/24 10:43, William VINCENT via slurm-users wrote:
I am writing to report an issue with the Slurmctld process on our RHEL 9 
(Rocky Linux) .


Twice in the past 5 days, the Slurmctld process has encountered an error 
that resulted in the service stopping. The error message displayed was 
"double free or corruption (out)". This error has caused significant 
disruption to our jobs, and we are concerned about its recurrence.


We have tried troubleshooting the issue, but we have not been able to 
identify the root cause of the problem. We would appreciate any assistance 
or guidance you can provide to help us resolve this issue.


Please let us know if you need any additional information or if there are 
any specific steps we should take to diagnose the problem further.


You're running Slurm 22.05.9 on RockyLinux 9 (is that Rocky 9.4 or what?). 
Such an old Slurm version probably hasn't been tested much on EL9 systems,


For security reasons you ought to upgrade to a recent Slurm version, just 
search for "CVE" in https://github.com/SchedMD/slurm/blob/master/NEWS to 
find out about security holes in older versions.


You can upgrade by 2 major releases in a single step, so you can go to 
23.11.8.  Upgrading Slurm is fairly easy, and I've collected various 
pieces of advice in the Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm


Hopefully a newer Slurm version is going to solve your issue.

I hope this helps,
Ole

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2024-07-15 Thread William V via slurm-users
Thank you for your response, I hadn't considered that version 22 could be the 
problem.

I am aware that we are not up to date, but we use the EPEL repo for our RPM 
packages. Originally, we did not want to install .rpm directly because our 
policy is to apply security updates every night via the repositories, but 
unfortunately, in this case, it does not work. I think it is because only one 
person is responsible for maintaining the packages for RHEL.

I have already reported the security issue, but at the moment it does not seem 
possible to update: https://bugzilla.redhat.com/show_bug.cgi?id=2280545

It appears from another ticket that the compilation fails for version 24: 
https://bugzilla.redhat.com/show_bug.cgi?id=2259935

If the compilation fails, will the RPM package work on RHEL 9?

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2024-07-15 Thread Ole Holm Nielsen via slurm-users

On 7/15/24 11:35, William V via slurm-users wrote:

Thank you for your response, I hadn't considered that version 22 could be the 
problem.

I am aware that we are not up to date, but we use the EPEL repo for our RPM 
packages. Originally, we did not want to install .rpm directly because our 
policy is to apply security updates every night via the repositories, but 
unfortunately, in this case, it does not work. I think it is because only one 
person is responsible for maintaining the packages for RHEL.


You should *NOT* use Slurm packages from the EPEL repository!!  The Slurm 
documentation recommends to exclude those packages, see

https://slurm.schedmd.com/upgrades.html#epel_repository


I have already reported the security issue, but at the moment it does not seem 
possible to update: https://bugzilla.redhat.com/show_bug.cgi?id=2280545


RedHat doesn't provide support for Slurm, and if necessary you should 
contact SchedMD to obtain Slurm support.



It appears from another ticket that the compilation fails for version 24: 
https://bugzilla.redhat.com/show_bug.cgi?id=2259935


I think this ticket only reports problems regarding older Slurm releases?


If the compilation fails, will the RPM package work on RHEL 9?


You should build your own Slurm RPM packages, and compilation failure 
would indicate a bug somewhere!


Just as a test, I've now built RPM packages of the currently supported 
Slurm releases 23.11.8 and 24.05.1 on a RockyLinux 9.4 system.  The RPMs 
have built without any issues or compilation errors at all!  I haven't 
tested these RPMs on our production cluster which runs EL8 :-)


I recommend that you consult the Slurm documentation page[1] and my Wiki 
page for Slurm installation: 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/  Remember to 
install all prerequisite packages before building Slurm, as explained in 
the Wiki!


Best regards,
Ole

[1] https://slurm.schedmd.com/documentation.html


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2024-07-15 Thread William V via slurm-users
Wow, thank you so much for all this information and the installation wiki. 
I have a lot of work to do to change the infrastructure, I hope it will go 
smoothly.

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: _refresh_assoc_mgr_qos_list: no new list given back keeping cached one

2024-07-15 Thread andreas.wiedholz--- via slurm-users
Hi João,

did you get this problem solved? I have the exact same problem and would be 
very interested.

Help would be greatly appreciated!

Thank you and best regards,
Andi

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Custom Plugin Integration

2024-07-15 Thread jubhaskar--- via slurm-users
Hi Daniel,
Thanks for picking up this query. Let me try to briefly describe my problem.

As you rightly guessed, we have some hardware on the backend which would be 
used for our
jobs to run. The app which manages the h/w has its own set of resource 
placement/remapping
rules to place a job.
So, for eg., if only 3 hosts h1, h2, h3 (2 cores available each) are available 
at some point for a
4 core job then it's only a few combination of cores from these hosts can be 
allowed for
the job. Also there is a preference order of the placements decided by our app.

It's in this respect we want our backend app to bring the placement for the job.
Slurm would then dispatch the job accordingly while honoring the exact resource 
distribution
as asked for. In case for the need of preemption as well our backend would 
decide the placement
which would decide which preemptable job candidates to preempt.

So, how should we proceed then?
We mayn't have the whole site/cluster to ourselves. There me be other jobs 
which we don't
care about & hence they should go in the usual route from the select plugin 
which is there (linear, cons_tres etc).

Is there a scope for a separate partition which will encompass our resources 
only & trigger our
plugin only for our jobs?
How do the options a>, b> , c> stand (as described in my 1st message) now that 
I mention our requirement?

A 4th option which comes to my mind is that if there's a possibility through 
some API interface from Slurm
which will inform a separate process P (say) about resource availability on a 
real time basis.
P will talk to our backend app, bring a placement & then ask lSurm to place our 
job.

Your concern about everchanging resources (being allocated before our backend 
comes up) is uncalled for
as the hosts are segregated as far as our system is concerned. Our hosts will 
run only our jobs & other Slurm
jobs would run in different hosts.

Hope I make myself little more clearer ! Any help would be appreciated.

(Note: We already have a working solution with LSF! LSF does provide option for 
custom scheduler plugins
to let one connect in the decision making loop during scheduling. This only led 
us to believe Slurm would also
have some possibilities.)

Regards,
Bhaskar.

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] slurmctld hourly: Unexpected missing socket error

2024-07-15 Thread Jason Ellul via slurm-users
Hi all,

I am hoping someone can help with our problem. Every hour after restarting 
slurmctld the controller becomes unresponsive to commands for 1 sec, reporting 
errors such as:

[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934767]] 
slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket 
error
[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934760]] 
slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket 
error
[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934875]] 
slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket 
error
[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934906]] 
slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket 
error
[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[939016]] 
slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket 
error

It occurs consistently at around the hour mark, but generally not at other 
times, unless we run a reconfigure or restart the controller. We don’t see any 
issues in the slurmdbd.log and the errors are also always msg type RESPONSE. We 
have tried building a new server on different infrastructure, but the problem 
has persisted. Yesterday we even tried updating slurm to v24.05.1 in the hope 
that may provide a fix. During our troubleshooting we have:
Set:

  *
SchedulerParameters = 
max_rpc_cnt=400,sched_min_interval=5,sched_max_job_start=300,batch_sched_delay=20,bf_resolution=600,bf_min_prio_reserve=2000,bf_min_age_reserve=600
  *
SlurmctldPort   = 6808-6817

But although the stats in sdiag have improved we still see the errors.

On our monitoring software we also see a drop in network and disk activity 
during this 1 second, always at approx. 1 hour after restarting the controller.

Many Thanks in advance

Jason

Jason Ellul
Head - Research Computing Facility
Office of Cancer Research
Peter MacCallum Cancer Centre

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com