I enabled "debug3" logging and saw this in the node log: error: mpi_conf_send_stepd: unable to resolve MPI plugin offset from plugin_id=106. This error usually results from a job being submitted against an MPI plugin which was not compiled into slurmd but was for job submission command. error: _send_slurmstepd_init: mpi_conf_send_stepd(9, 106) failed: No error
I removed "MpiDefault" option from slurm.conf and now "srun -N2 -l hostname" returns hostnames of all machines On Tue, Jun 11, 2024 at 11:05 AM Arnuld <arn...@aganitha.ai> wrote: > I have two machines. When I run "srum hostname" on one machine (it's both > a controller and a node) then I get the hostname fine but I get socket > timed out error in these two situations: > > 1) "srun hostname" on 2nd machine (it's a node) > 2) "srun -N 2 hostname" on controller > > "scontrol show node" shows both mach2 and mach4. "sinfo" shows both nodes > too. Also the job gets stuck forever in CG state after the error. Here is > the output: > > $ srun -N 2 hostname > mach2 > srun: error: slurm_receive_msgs: [[mach4]:6818] failed: Socket timed out > on send/recv operation > srun: error: Task launch for StepId=2222.0 failed on node hpc4: Socket > timed out on send/recv operation > srun: error: Application launch failed: Socket timed out on send/recv > operation > srun: Job step aborted > > > Output form "squeue" 3 seconds apart > > Tue Jun 11 05:09:56 2024 > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > 2222 poxo hostname arnuld R 0:19 2 > mach4,mach2 > > Tue Jun 11 05:09:59 2024 > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > 2222 poxo hostname arnuld CG 0:20 1 mach4 > >
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com