Hi there Slurm-experts! I am trouble using or running a python-mpi program involving more than one node. The pythom-mpi program is very simple, it only lists the number of ranks that is available in its environment. I have a munge-daemon running prior to starting the slurm-service and the program works when using a single node (so I suppose munge is working). In addition, I have tested to run a simple sbatch-script where each available node (four nodes) states its hostname and returns. Since authentication with Slurm is used via munge, do I need a passwordless SSH communication between the slurmctl and the nodes? (I found a guide,probably outdated stating that passwordless SSH communication is a neccessity for slurm, HTTP://admin-magazine.com/HPC/Articles/Resource-Management-with-Slurm).
I run the python-mpi program via a sbatch-script,invoking a srun-command. Each node has 8 CPUs. The srun-command is : ”srun -N2 -n8 python3 python-mpi.py” , when tested on two nodes. It works fine running on a single node(with ”-N1” instead of ”-N2”), but it is aborted or stopped when running on two nodes. Should I have ”-n16” when running on two nodes? (In order to allocate the complete number of CPUs available of the two nodes.) Slurm is configured and built with pmix. I am running Slurm 19.05 on Ubuntu 18.04 as server and the nodes are running same slurm-version on Ubuntu 18.10. Best rehards, Palle