I enabled "debug3" logging and saw this in the node log:

error: mpi_conf_send_stepd: unable to resolve MPI plugin offset from
plugin_id=106. This error usually results from a job being submitted
against an MPI plugin which was not compiled into slurmd but was for job
submission command.
error: _send_slurmstepd_init: mpi_conf_send_stepd(9, 106) failed: No error

I removed "MpiDefault" option from slurm.conf and now "srun -N2 -l
hostname" returns hostnames of all machines



On Tue, Jun 11, 2024 at 11:05 AM Arnuld <arn...@aganitha.ai> wrote:

> I have two machines. When I run "srum hostname" on one machine (it's both
> a controller and a node) then I get the hostname fine but I get socket
> timed out error in these two situations:
>
> 1) "srun hostname" on 2nd machine (it's a node)
> 2) "srun -N 2 hostname" on controller
>
> "scontrol show node" shows both mach2 and mach4. "sinfo" shows both nodes
> too.  Also the job gets stuck forever in CG state after the error. Here is
> the output:
>
> $ srun -N 2 hostname
> mach2
> srun: error: slurm_receive_msgs: [[mach4]:6818] failed: Socket timed out
> on send/recv operation
> srun: error: Task launch for StepId=2222.0 failed on node hpc4: Socket
> timed out on send/recv operation
> srun: error: Application launch failed: Socket timed out on send/recv
> operation
> srun: Job step aborted
>
>
> Output form "squeue" 3 seconds apart
>
> Tue Jun 11 05:09:56 2024
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>               2222     poxo hostname   arnuld  R       0:19      2
> mach4,mach2
>
> Tue Jun 11 05:09:59 2024
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>               2222     poxo hostname   arnuld CG       0:20      1 mach4
>
>
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to