Hello Open MPI volks,

We use OpenMPI 1.5.3 on our pretty new 1800+ nodes InfiniBand cluster, and we have some strange hangups if starting OpenMPI processes.

The nodes are named linuxbsc001,linuxbsc002,... (with some lacuna due of offline nodes). Each node is accessible from each other over SSH (without password), also MPI programs between any two nodes are checked to run.


So long, I tried to start some bigger number of processes, one process per node:
$ mpiexec -np NN  --host linuxbsc001,linuxbsc002,... MPI_FastTest.exe

Now the problem: there are some constellations of names in the host list on which mpiexec reproducible hangs forever; and more surprising: other *permutation* of the *same* node names may run without any errors!

Example: the command in laueft.txt runs OK, the command in haengt.txt hangs. Note: the only difference is that the node linuxbsc025 is put on the end of the host list. Amazed, too?

Looking on the particular nodes during the above mpiexec hangs, we found the orted daemons started on *each* node and the binary on all but one node (orted.txt, MPI_FastTest.txt). Again amazing that the node with no user process started (leading to hangup in MPI_Init of all processes and thus to hangup, I believe) was always the same, linuxbsc005, which is NOT the permuted item linuxbsc025...

This behaviour is reproducible. The hang-on only occure if the started application is a MPI application ("hostname" does not hang).


Any Idea what is gonna on?


Best,

Paul Kapinos


P.S: no alias names used, all names are real ones







--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915
linuxbsc001: STDOUT: 24323 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc002: STDOUT:  2142 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc003: STDOUT: 69266 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc004: STDOUT: 58899 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc006: STDOUT: 68255 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc007: STDOUT: 62026 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc008: STDOUT: 54221 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc009: STDOUT: 55482 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc010: STDOUT: 59380 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc011: STDOUT: 58312 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc014: STDOUT: 56013 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc016: STDOUT: 58563 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc017: STDOUT: 54693 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc018: STDOUT: 54187 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc020: STDOUT: 55811 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc021: STDOUT: 54982 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc022: STDOUT: 50032 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc023: STDOUT: 54044 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc024: STDOUT: 51247 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc025: STDOUT: 18575 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc027: STDOUT: 48969 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc028: STDOUT: 52397 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc029: STDOUT: 52780 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc030: STDOUT: 47537 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc031: STDOUT: 54609 ?        SLl    0:00 MPI_FastTest.exe
linuxbsc032: STDOUT: 52833 ?        SLl    0:00 MPI_FastTest.exe
$ timex /opt/MPI/openmpi-1.5.3/linux/intel/bin/mpiexec -np 27  --host 
linuxbsc001,linuxbsc002,linuxbsc003,linuxbsc004,linuxbsc005,linuxbsc006,linuxbsc007,linuxbsc008,linuxbsc009,linuxbsc010,linuxbsc011,linuxbsc014,linuxbsc016,linuxbsc017,linuxbsc018,linuxbsc020,linuxbsc021,linuxbsc022,linuxbsc023,linuxbsc024,linuxbsc025,linuxbsc027,linuxbsc028,linuxbsc029,linuxbsc030,linuxbsc031,linuxbsc032
 
MPI_FastTest.exe
$ timex /opt/MPI/openmpi-1.5.3/linux/intel/bin/mpiexec -np 27  --host 
linuxbsc001,linuxbsc002,linuxbsc003,linuxbsc004,linuxbsc005,linuxbsc006,linuxbsc007,linuxbsc008,linuxbsc009,linuxbsc010,linuxbsc011,linuxbsc014,linuxbsc016,linuxbsc017,linuxbsc018,linuxbsc020,linuxbsc021,linuxbsc022,linuxbsc023,linuxbsc024,linuxbsc027,linuxbsc028,linuxbsc029,linuxbsc030,linuxbsc031,linuxbsc032,linuxbsc025
 
MPI_FastTest.exe
linuxbsc001: STDOUT: 24322 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc002: STDOUT:  2141 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc003: STDOUT: 69265 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 3 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc004: STDOUT: 58898 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 4 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc005: STDOUT: 65642 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 5 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc006: STDOUT: 68254 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 6 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc007: STDOUT: 62025 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 7 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc008: STDOUT: 54220 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 8 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc009: STDOUT: 55481 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 9 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc010: STDOUT: 59379 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 10 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc011: STDOUT: 58311 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 11 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc014: STDOUT: 56012 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 12 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc016: STDOUT: 58562 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 13 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc017: STDOUT: 54692 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 14 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc018: STDOUT: 54186 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 15 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc020: STDOUT: 55810 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 16 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc021: STDOUT: 54981 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 17 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc022: STDOUT: 50031 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 18 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc023: STDOUT: 54043 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 19 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc024: STDOUT: 51246 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 20 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc025: STDOUT: 18574 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 21 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc027: STDOUT: 48968 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 22 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc028: STDOUT: 52396 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 23 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc029: STDOUT: 52779 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 24 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc030: STDOUT: 47536 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 25 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc031: STDOUT: 54608 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 26 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh
linuxbsc032: STDOUT: 52832 ?        Ss     0:00 
/opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca 
orte_ess_jobid 751435776 -mca orte_ess_vpid 27 -mca orte_ess_num_procs 28 
--hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to