Hi Siegmar, my bad, there was a typo in my reply. i really meant > > what if you ? > > mpiexec --host loki:2,exin:1 -np 3 hello_1_mpi
but you also tried that and it did not help. i could not find anything in your logs that suggest mpiexec tries to start 5 MPI tasks, did i miss something ? i will try to reproduce the issue by myself Cheers, Gilles ----- Original Message ----- > Hi Gilles, > > > what if you ? > > mpiexec --host loki:1,exin:1 -np 3 hello_1_mpi > > I need as many slots as processes so that I use "-np 2". > "mpiexec --host loki,exin -np 2 hello_1_mpi" works as well. The command > breaks, if I use at least "-np 3" and distribute the processes across at > least two machines. > > loki hello_1 118 mpiexec --host loki:1,exin:1 -np 2 hello_1_mpi > Process 0 of 2 running on loki > Process 1 of 2 running on exin > Now 1 slave tasks are sending greetings. > Greetings from task 1: > message type: 3 > msg length: 131 characters > message: > hostname: exin > operating system: Linux > release: 4.4.49-92.11-default > processor: x86_64 > loki hello_1 119 > > > > > are loki and exin different ? (os, sockets, core) > > Yes, loki is a real machine and exin is a virtual one. "exin" uses a newer > kernel. > > loki fd1026 108 uname -a > Linux loki 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016 ( 2d3e9d4) > x86_64 x86_64 x86_64 GNU/Linux > > loki fd1026 109 ssh exin uname -a > Linux exin 4.4.49-92.11-default #1 SMP Fri Feb 17 08:29:30 UTC 2017 ( 8f9478a) > x86_64 x86_64 x86_64 GNU/Linux > loki fd1026 110 > > The number of sockets and cores is identical, but the processor types are > different as you can see at the end of my previous email. "loki" uses two > "Intel(R) Xeon(R) CPU E5-2620 v3" processors and "exin" two "Intel Core > Processor (Haswell, no TSX)" from QEMU. I can provide a pdf file with both > topologies (89 K) if you are interested in the output from lstopo. I' ve > added some runs. Most interesting in my opinion are the last two > "mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi" and > "mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi". > Why does mpiexec create five processes although I've asked for only three > processes? Why do I have to break the program with <Ctrl-c> for the first > of the above commands? > > > > loki hello_1 110 mpiexec --host loki:2,exin:1 -np 3 hello_1_mpi > ---------------------------------------------------------------------- ---- > There are not enough slots available in the system to satisfy the 3 slots > that were requested by the application: > hello_1_mpi > > Either request fewer slots for your application, or make more slots available > for use. > ---------------------------------------------------------------------- ---- > > > > loki hello_1 111 mpiexec --host exin:3 -np 3 hello_1_mpi > Process 0 of 3 running on exin > Process 1 of 3 running on exin > Process 2 of 3 running on exin > ... > > > > loki hello_1 115 mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi > Process 1 of 3 running on loki > Process 0 of 3 running on loki > Process 2 of 3 running on loki > ... > > Process 0 of 3 running on exin > Process 1 of 3 running on exin > [exin][[52173,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/ opal/mca/btl/tcp/btl_tcp_endpoint.c:794:mca_btl_tcp_endpoint_complete_ connect] > connect() to 193.xxx.xxx.xxx failed: Connection refused (111) > > ^Cloki hello_1 116 > > > > > loki hello_1 116 mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi > Process 0 of 3 running on loki > Process 2 of 3 running on loki > Process 1 of 3 running on loki > ... > Process 1 of 3 running on exin > Process 0 of 3 running on exin > [exin][[51638,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/ opal/mca/btl/tcp/btl_tcp_endpoint.c:590:mca_btl_tcp_endpoint_recv_ blocking] > recv(16, 0/8) failed: Connection reset by peer (104) > [exin:31909] > ../../../../../openmpi-v3.x-201705250239-d5200ea/ompi/mca/pml/ob1/pml_ ob1_sendreq.c:191 > FATAL > loki hello_1 117 > > > Do you need anything else? > > > Kind regards and thank you very much for your help > > Siegmar > > > > > > > Cheers, > > > > Gilles > > > > ----- Original Message ----- > >> Hi, > >> > >> I have installed openmpi-v3.x-201705250239-d5200ea on my "SUSE Linux > >> Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0. > >> Depending on the machine that I use to start my processes, I have > >> a problem with "--host" for versions "v3.x" and "master", while > >> everything works as expected with earlier versions. > >> > >> > >> loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi > >> ------------------------------------------------------------------- --- > > ---- > >> There are not enough slots available in the system to satisfy the 3 > > slots > >> that were requested by the application: > >> hello_1_mpi > >> > >> Either request fewer slots for your application, or make more slots > > available > >> for use. > >> ------------------------------------------------------------------- --- > > ---- > >> > >> > >> > >> Everything is ok if I use the same command on "exin". > >> > >> exin fd1026 107 mpiexec -np 3 --host loki:2,exin hello_1_mpi > >> Process 0 of 3 running on loki > >> Process 1 of 3 running on loki > >> Process 2 of 3 running on exin > >> ... > >> > >> > >> > >> Everything is also ok if I use openmpi-v2.x-201705260340-58c6b3c on " > > loki". > >> > >> loki hello_1 114 which mpiexec > >> /usr/local/openmpi-2.1.2_64_cc/bin/mpiexec > >> loki hello_1 115 mpiexec -np 3 --host loki:2,exin hello_1_mpi > >> Process 0 of 3 running on loki > >> Process 1 of 3 running on loki > >> Process 2 of 3 running on exin > >> ... > >> > >> > >> "exin" is a virtual machine on QEMU so that it uses a slightly > > different > >> processor architecture, e.g., it has no L3 cache but larger L2 caches. > >> > >> loki fd1026 117 cat /proc/cpuinfo | grep -e "model name" -e " physical > > id" -e > >> "cpu cores" -e "cache size" | sort | uniq > >> cache size : 15360 KB > >> cpu cores : 6 > >> model name : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz > >> physical id : 0 > >> physical id : 1 > >> > >> > >> loki fd1026 118 ssh exin cat /proc/cpuinfo | grep -e "model name" - e " > > physical > >> id" -e "cpu cores" -e "cache size" | sort | uniq > >> cache size : 4096 KB > >> cpu cores : 6 > >> model name : Intel Core Processor (Haswell, no TSX) > >> physical id : 0 > >> physical id : 1 > >> > >> > >> Any ideas what's different in the newer versions of Open MPI? Is the > > new > >> behavior intended? I would be grateful, if somebody can fix the > > problem, > >> if "mpiexec -np 3 --host loki:2,exin hello_1_mpi" should print my > > messages > >> in versions "3.x" and "master" as well, if the programs are started on > > any > >> machine. Do you need anything else? Thank you very much for any help > > in > >> advance. > >> > >> > >> Kind regards > >> > >> Siegmar > >> _______________________________________________ > >> users mailing list > >> users@lists.open-mpi.org > >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > >> > > > > > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users