Hello everyone,
I’ve recently encountered an issue where some nodes in our cluster enter
a drain state randomly, typically after completing long-running jobs.
Below is the output from the |sinfo| command showing the reason *“Prolog
error”* :
|root@controller-node:~# sinfo -R REASON USER TIMESTAMP NODELIST Prolog
error slurm 2024-09-24T21:18:05 node[24,31] |
When checking the |slurmd.log| files on the nodes, I noticed the
following errors:
|[2024-09-24T17:18:22.386] [217703.extern] error:
_handle_add_extern_pid_internal: Job 217703 can't add pid 3311892 to
jobacct_gather plugin in the extern_step. **(repeated 90 times)**
[2024-09-24T17:18:22.917] [217703.extern] error:
_handle_add_extern_pid_internal: Job 217703 can't add pid 3313158 to
jobacct_gather plugin in the extern_step. ... [2024-09-24T21:17:45.162]
launch task StepId=217703.0 request from UID:54059 GID:1600
HOST:<SLURMCTLD_IP> PORT:53514 [2024-09-24T21:18:05.166] error: Waiting
for JobId=217703 REQUEST_LAUNCH_PROLOG notification failed, giving up
after 20 sec [2024-09-24T21:18:05.166] error: slurm_send_node_msg:
[(null)] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC_MSG) failed:
Unexpected missing socket error [2024-09-24T21:18:05.166] error:
_rpc_launch_tasks: unable to send return code to
address:port=<SLURMCTLD_IP>:53514 msg_type=6001: No such file or directory |
If you know how to solve these errors, please let me know. I would
greatly appreciate any guidance or suggestions for further troubleshooting.
Thank you in advance for your assistance.
Best regards,
--
Télécom Paris <https://www.telecom-paris.fr>
*Nacereddine LADDAOUI*
Ingénieur de Recherche et de Développement
19 place Marguerite Perey
CS 20031
91123 Palaiseau Cedex
Site web Télécom Paris <https://www.telecom-paris.fr>X Télécom Paris
<https://twitter.com/TelecomParis_>Facebook Télécom Paris
<https://www.facebook.com/TelecomParis>LinkedIn Télécom Paris
<https://www.linkedin.com/school/telecom-paris/>Instagram Télécom Paris
<https://www.instagram.com/telecom_paris/>Blog Télécom Paris
<https://imtech.wp.imt.fr>
Une école de l'IMT <https://www.imt.fr>
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com