On Tue, 7 Nov 2023 at 11:34, Diego Zuccato <diego.zucc...@unibo.it> wrote:
> Il 07/11/2023 11:15, JP Ebejer ha scritto: > > but on running sinfo > > right after, the node is still "drained". > > That's not normal :( > Look at the slurmd log on the node for a reason. Probably the node > detects an error and sets itself to drained. Another possibility is that > slurmctld detects a mismatch between the node and its config: in this > case you'll find the reason in slurmctld.log . > Ok great. So I clear the slurmd.log on the compute-0 node. I restart the service (after changing the logging from debug3 to verbose). [2023-11-07T16:34:17.575] topology/none: init: topology NONE plugin loaded [2023-11-07T16:34:17.575] route/default: init: route default plugin loaded [2023-11-07T16:34:17.577] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffff [2023-11-07T16:34:17.578] cred/munge: init: Munge credential signature plugin loaded [2023-11-07T16:34:17.578] slurmd version 22.05.8 started [2023-11-07T16:34:17.579] error: mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:195: pmi/pmix: can not load PMIx library [2023-11-07T16:34:17.579] error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed [2023-11-07T16:34:17.579] error: MPI: Cannot create context for mpi/pmix [2023-11-07T16:34:17.580] error: mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:195: pmi/pmix: can not load PMIx library [2023-11-07T16:34:17.580] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed [2023-11-07T16:34:17.580] error: MPI: Cannot create context for mpi/pmix_v4 [2023-11-07T16:34:17.580] slurmd started on Tue, 07 Nov 2023 16:34:17 +0000 [2023-11-07T16:34:17.580] CPUs=32 Boards=1 Sockets=2 Cores=8 Threads=2 Memory=64171 TmpDisk=1031475 Uptime=87818 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) I am not sure I understand this, and my MPI setting is none (so MpiDefault=none). The jobs I intend to run do not use MPI. Could this be the cause, and how do I fix this (on Debian 12)? Also if I stop, truncate the log file, and start the slurmctld service I see similar errors. Below: [2023-11-07T16:40:22.888] error: Configured MailProg is invalid [2023-11-07T16:40:22.889] slurmctld version 22.05.8 started on cluster mycluster [2023-11-07T16:40:22.890] cred/munge: init: Munge credential signature plugin loaded [2023-11-07T16:40:22.892] select/cons_res: common_init: select/cons_res loaded [2023-11-07T16:40:22.892] select/cons_tres: common_init: select/cons_tres loaded [2023-11-07T16:40:22.892] select/cray_aries: init: Cray/Aries node selection plugin loaded [2023-11-07T16:40:22.893] preempt/none: init: preempt/none loaded [2023-11-07T16:40:22.894] ext_sensors/none: init: ExtSensors NONE plugin loaded [2023-11-07T16:40:22.895] error: mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:195: pmi/pmix: can not load PMIx library [2023-11-07T16:40:22.895] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed [2023-11-07T16:40:22.895] error: MPI: Cannot create context for mpi/pmix_v4 [2023-11-07T16:40:22.899] accounting_storage/none: init: Accounting storage NOT INVOKED plugin loaded [2023-11-07T16:40:22.901] No memory enforcing mechanism configured. [2023-11-07T16:40:22.902] topology/none: init: topology NONE plugin loaded [2023-11-07T16:40:22.904] sched: Backfill scheduler plugin loaded [2023-11-07T16:40:22.904] route/default: init: route default plugin loaded [2023-11-07T16:40:22.905] Recovered state of 1 nodes [2023-11-07T16:40:22.905] Recovered JobId=8 Assoc=0 [2023-11-07T16:40:22.905] Recovered JobId=9 Assoc=0 [2023-11-07T16:40:22.905] Recovered JobId=10 Assoc=0 [2023-11-07T16:40:22.905] Recovered JobId=11 Assoc=0 [2023-11-07T16:40:22.905] Recovered information about 4 jobs [2023-11-07T16:40:22.906] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory [2023-11-07T16:40:22.906] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions [2023-11-07T16:40:22.906] Recovered state of 0 reservations [2023-11-07T16:40:22.906] State of 0 triggers recovered [2023-11-07T16:40:22.906] read_slurm_conf: backup_controller not specified [2023-11-07T16:40:22.906] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure [2023-11-07T16:40:22.906] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions [2023-11-07T16:40:22.906] Running as primary controller [2023-11-07T16:40:22.907] No parameter for mcs plugin, default values set [2023-11-07T16:40:22.907] mcs: MCSParameters = (null). ondemand set. Is this a step closer to resolution? -- *The contents of this email are subject to *these terms <https://www.um.edu.mt/disclaimer/email/>.**