Ok so a step further (I hope), but still am stuck with a non working cluster.
I managed to solve both problems above by installing two debian packages (sudo apt install mailutils libpmix-dev) on both head and compute nodes. I have no errors in the two log files, but somehow the node is still drained. How do I get around this please? On Tue, 7 Nov 2023 at 17:43, JP Ebejer <jean.p.ebe...@um.edu.mt> wrote: > > > On Tue, 7 Nov 2023 at 11:34, Diego Zuccato <diego.zucc...@unibo.it> wrote: > >> Il 07/11/2023 11:15, JP Ebejer ha scritto: >> > but on running sinfo >> > right after, the node is still "drained". >> >> That's not normal :( >> Look at the slurmd log on the node for a reason. Probably the node >> detects an error and sets itself to drained. Another possibility is that >> slurmctld detects a mismatch between the node and its config: in this >> case you'll find the reason in slurmctld.log . >> > > Ok great. So I clear the slurmd.log on the compute-0 node. I restart the > service (after changing the logging from debug3 to verbose). > > [2023-11-07T16:34:17.575] topology/none: init: topology NONE plugin loaded > [2023-11-07T16:34:17.575] route/default: init: route default plugin loaded > [2023-11-07T16:34:17.577] task/affinity: init: task affinity plugin loaded > with CPU mask 0xffffffff > [2023-11-07T16:34:17.578] cred/munge: init: Munge credential signature > plugin loaded > [2023-11-07T16:34:17.578] slurmd version 22.05.8 started > [2023-11-07T16:34:17.579] error: mpi/pmix_v4: init: (null) [0]: > mpi_pmix.c:195: pmi/pmix: can not load PMIx library > [2023-11-07T16:34:17.579] error: Couldn't load specified plugin name for > mpi/pmix: Plugin init() callback failed > [2023-11-07T16:34:17.579] error: MPI: Cannot create context for mpi/pmix > [2023-11-07T16:34:17.580] error: mpi/pmix_v4: init: (null) [0]: > mpi_pmix.c:195: pmi/pmix: can not load PMIx library > [2023-11-07T16:34:17.580] error: Couldn't load specified plugin name for > mpi/pmix_v4: Plugin init() callback failed > [2023-11-07T16:34:17.580] error: MPI: Cannot create context for mpi/pmix_v4 > [2023-11-07T16:34:17.580] slurmd started on Tue, 07 Nov 2023 16:34:17 +0000 > [2023-11-07T16:34:17.580] CPUs=32 Boards=1 Sockets=2 Cores=8 Threads=2 > Memory=64171 TmpDisk=1031475 Uptime=87818 CPUSpecList=(null) > FeaturesAvail=(null) FeaturesActive=(null) > > I am not sure I understand this, and my MPI setting is none (so > MpiDefault=none). The jobs I intend to run do not use MPI. > > Could this be the cause, and how do I fix this (on Debian 12)? > > Also if I stop, truncate the log file, and start the slurmctld service I > see similar errors. Below: > > [2023-11-07T16:40:22.888] error: Configured MailProg is invalid > [2023-11-07T16:40:22.889] slurmctld version 22.05.8 started on cluster > mycluster > [2023-11-07T16:40:22.890] cred/munge: init: Munge credential signature > plugin loaded > [2023-11-07T16:40:22.892] select/cons_res: common_init: select/cons_res > loaded > [2023-11-07T16:40:22.892] select/cons_tres: common_init: select/cons_tres > loaded > [2023-11-07T16:40:22.892] select/cray_aries: init: Cray/Aries node > selection plugin loaded > [2023-11-07T16:40:22.893] preempt/none: init: preempt/none loaded > [2023-11-07T16:40:22.894] ext_sensors/none: init: ExtSensors NONE plugin > loaded > [2023-11-07T16:40:22.895] error: mpi/pmix_v4: init: (null) [0]: > mpi_pmix.c:195: pmi/pmix: can not load PMIx library > [2023-11-07T16:40:22.895] error: Couldn't load specified plugin name for > mpi/pmix_v4: Plugin init() callback failed > [2023-11-07T16:40:22.895] error: MPI: Cannot create context for mpi/pmix_v4 > [2023-11-07T16:40:22.899] accounting_storage/none: init: Accounting > storage NOT INVOKED plugin loaded > [2023-11-07T16:40:22.901] No memory enforcing mechanism configured. > [2023-11-07T16:40:22.902] topology/none: init: topology NONE plugin loaded > [2023-11-07T16:40:22.904] sched: Backfill scheduler plugin loaded > [2023-11-07T16:40:22.904] route/default: init: route default plugin loaded > [2023-11-07T16:40:22.905] Recovered state of 1 nodes > [2023-11-07T16:40:22.905] Recovered JobId=8 Assoc=0 > [2023-11-07T16:40:22.905] Recovered JobId=9 Assoc=0 > [2023-11-07T16:40:22.905] Recovered JobId=10 Assoc=0 > [2023-11-07T16:40:22.905] Recovered JobId=11 Assoc=0 > [2023-11-07T16:40:22.905] Recovered information about 4 jobs > [2023-11-07T16:40:22.906] select/cons_tres: select_p_node_init: > select/cons_tres SelectTypeParameters not specified, using default value: > CR_Core_Memory > [2023-11-07T16:40:22.906] select/cons_tres: part_data_create_array: > select/cons_tres: preparing for 1 partitions > [2023-11-07T16:40:22.906] Recovered state of 0 reservations > [2023-11-07T16:40:22.906] State of 0 triggers recovered > [2023-11-07T16:40:22.906] read_slurm_conf: backup_controller not specified > [2023-11-07T16:40:22.906] select/cons_tres: select_p_reconfigure: > select/cons_tres: reconfigure > [2023-11-07T16:40:22.906] select/cons_tres: part_data_create_array: > select/cons_tres: preparing for 1 partitions > [2023-11-07T16:40:22.906] Running as primary controller > [2023-11-07T16:40:22.907] No parameter for mcs plugin, default values set > [2023-11-07T16:40:22.907] mcs: MCSParameters = (null). ondemand set. > > > Is this a step closer to resolution? > > > > -- <https://www.um.edu.mt/> Prof. Jean-Paul Ebejer | Associate Professor BSc (Hons) (Melita), MSc (Imperial), DPhil (Oxon.) *Centre for Molecular Medicine and Biobanking* Office 320, Biomedical Sciences Building, University of Malta, Msida, MSD 2080. MALTA. T: (00356) 2340 3263 *Department of Artificial Intelligence* Associate Member Join the *Bioinformatics@UM* <https://groups.google.com/a/um.edu.mt/g/mailinglist-bioinformatics.research> mailing list! *Where to find me* <https://bitsilla.com/blog/where-to-find-me/> [image: https://twitter.com/dr_jpe] <https://twitter.com/dr_jpe> [image: https://bitsilla.com/blog/] <https://bitsilla.com/blog/> [image: https://github.com/jp-um] <https://github.com/jp-um> -- *The contents of this email are subject to *these terms <https://www.um.edu.mt/disclaimer/email/>.**