Hello Open MPI volks,We use OpenMPI 1.5.3 on our pretty new 1800+ nodes InfiniBand cluster, and we have some strange hangups if starting OpenMPI processes.
The nodes are named linuxbsc001,linuxbsc002,... (with some lacuna due of offline nodes). Each node is accessible from each other over SSH (without password), also MPI programs between any two nodes are checked to run.
So long, I tried to start some bigger number of processes, one process per node:
$ mpiexec -np NN --host linuxbsc001,linuxbsc002,... MPI_FastTest.exeNow the problem: there are some constellations of names in the host list on which mpiexec reproducible hangs forever; and more surprising: other *permutation* of the *same* node names may run without any errors!
Example: the command in laueft.txt runs OK, the command in haengt.txt hangs. Note: the only difference is that the node linuxbsc025 is put on the end of the host list. Amazed, too?
Looking on the particular nodes during the above mpiexec hangs, we found the orted daemons started on *each* node and the binary on all but one node (orted.txt, MPI_FastTest.txt). Again amazing that the node with no user process started (leading to hangup in MPI_Init of all processes and thus to hangup, I believe) was always the same, linuxbsc005, which is NOT the permuted item linuxbsc025...
This behaviour is reproducible. The hang-on only occure if the started application is a MPI application ("hostname" does not hang).
Any Idea what is gonna on? Best, Paul Kapinos P.S: no alias names used, all names are real ones -- Dipl.-Inform. Paul Kapinos - High Performance Computing, RWTH Aachen University, Center for Computing and Communication Seffenter Weg 23, D 52074 Aachen (Germany) Tel: +49 241/80-24915
linuxbsc001: STDOUT: 24323 ? SLl 0:00 MPI_FastTest.exe linuxbsc002: STDOUT: 2142 ? SLl 0:00 MPI_FastTest.exe linuxbsc003: STDOUT: 69266 ? SLl 0:00 MPI_FastTest.exe linuxbsc004: STDOUT: 58899 ? SLl 0:00 MPI_FastTest.exe linuxbsc006: STDOUT: 68255 ? SLl 0:00 MPI_FastTest.exe linuxbsc007: STDOUT: 62026 ? SLl 0:00 MPI_FastTest.exe linuxbsc008: STDOUT: 54221 ? SLl 0:00 MPI_FastTest.exe linuxbsc009: STDOUT: 55482 ? SLl 0:00 MPI_FastTest.exe linuxbsc010: STDOUT: 59380 ? SLl 0:00 MPI_FastTest.exe linuxbsc011: STDOUT: 58312 ? SLl 0:00 MPI_FastTest.exe linuxbsc014: STDOUT: 56013 ? SLl 0:00 MPI_FastTest.exe linuxbsc016: STDOUT: 58563 ? SLl 0:00 MPI_FastTest.exe linuxbsc017: STDOUT: 54693 ? SLl 0:00 MPI_FastTest.exe linuxbsc018: STDOUT: 54187 ? SLl 0:00 MPI_FastTest.exe linuxbsc020: STDOUT: 55811 ? SLl 0:00 MPI_FastTest.exe linuxbsc021: STDOUT: 54982 ? SLl 0:00 MPI_FastTest.exe linuxbsc022: STDOUT: 50032 ? SLl 0:00 MPI_FastTest.exe linuxbsc023: STDOUT: 54044 ? SLl 0:00 MPI_FastTest.exe linuxbsc024: STDOUT: 51247 ? SLl 0:00 MPI_FastTest.exe linuxbsc025: STDOUT: 18575 ? SLl 0:00 MPI_FastTest.exe linuxbsc027: STDOUT: 48969 ? SLl 0:00 MPI_FastTest.exe linuxbsc028: STDOUT: 52397 ? SLl 0:00 MPI_FastTest.exe linuxbsc029: STDOUT: 52780 ? SLl 0:00 MPI_FastTest.exe linuxbsc030: STDOUT: 47537 ? SLl 0:00 MPI_FastTest.exe linuxbsc031: STDOUT: 54609 ? SLl 0:00 MPI_FastTest.exe linuxbsc032: STDOUT: 52833 ? SLl 0:00 MPI_FastTest.exe
$ timex /opt/MPI/openmpi-1.5.3/linux/intel/bin/mpiexec -np 27 --host linuxbsc001,linuxbsc002,linuxbsc003,linuxbsc004,linuxbsc005,linuxbsc006,linuxbsc007,linuxbsc008,linuxbsc009,linuxbsc010,linuxbsc011,linuxbsc014,linuxbsc016,linuxbsc017,linuxbsc018,linuxbsc020,linuxbsc021,linuxbsc022,linuxbsc023,linuxbsc024,linuxbsc025,linuxbsc027,linuxbsc028,linuxbsc029,linuxbsc030,linuxbsc031,linuxbsc032 MPI_FastTest.exe
$ timex /opt/MPI/openmpi-1.5.3/linux/intel/bin/mpiexec -np 27 --host linuxbsc001,linuxbsc002,linuxbsc003,linuxbsc004,linuxbsc005,linuxbsc006,linuxbsc007,linuxbsc008,linuxbsc009,linuxbsc010,linuxbsc011,linuxbsc014,linuxbsc016,linuxbsc017,linuxbsc018,linuxbsc020,linuxbsc021,linuxbsc022,linuxbsc023,linuxbsc024,linuxbsc027,linuxbsc028,linuxbsc029,linuxbsc030,linuxbsc031,linuxbsc032,linuxbsc025 MPI_FastTest.exe
linuxbsc001: STDOUT: 24322 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc002: STDOUT: 2141 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc003: STDOUT: 69265 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 3 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc004: STDOUT: 58898 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 4 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc005: STDOUT: 65642 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 5 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc006: STDOUT: 68254 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 6 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc007: STDOUT: 62025 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 7 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc008: STDOUT: 54220 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 8 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc009: STDOUT: 55481 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 9 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc010: STDOUT: 59379 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 10 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc011: STDOUT: 58311 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 11 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc014: STDOUT: 56012 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 12 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc016: STDOUT: 58562 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 13 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc017: STDOUT: 54692 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 14 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc018: STDOUT: 54186 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 15 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc020: STDOUT: 55810 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 16 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc021: STDOUT: 54981 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 17 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc022: STDOUT: 50031 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 18 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc023: STDOUT: 54043 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 19 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc024: STDOUT: 51246 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 20 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc025: STDOUT: 18574 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 21 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc027: STDOUT: 48968 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 22 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc028: STDOUT: 52396 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 23 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc029: STDOUT: 52779 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 24 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc030: STDOUT: 47536 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 25 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc031: STDOUT: 54608 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 26 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc032: STDOUT: 52832 ? Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 27 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
smime.p7s
Description: S/MIME Cryptographic Signature