Just looking at this output, it would appear that Windows is configured in a way that prevents the procs from connecting to each other via TCP while on the same node, and shared memory is disqualifying itself - which leaves no way for two procs on the same node to communicate.
> On Jun 7, 2016, at 12:16 PM, Roth, Christopher <cr...@aer.com> wrote: > > I have developed a set of C++ MPI programs for performing a series of > scientific calculations. The master 'scheduler' program spawns off sets of > parallelized 'executor' programs using the MPI_Comm_spawn routine; these > executors communicate back and forth with the scheduler (only small amounts > of information) via MPI_Bcast, MPI_Recv and MPI_Send routines (the 'C' > library versions). > > This software was originally developed on a multi-core Linux machine using > OpenMpi v1.5.2, and works extremely well; now I'm attempting to port it to > multi-core Windows 7 machine, using Visual Studios 2012 and the precompiled > OpenMpi v1.6.2 Windows release. It all compiles and links properly under > VS2012. > When attempting to run this software on the Windows machine, the scheduler > program is able to spawn off the executor programs as intended, but > everything chokes when scheduler sends its initial broadcast. There is > slightly different behavior when launching the scheduler via 'mpirun', or > just by itself, as shown in the logs below: > (the warning about the 127.0.0.1 address is benign - there is no loopback on > Windows) > > C:\Users\cjr\Desktop\mpi_demo>mpirun -np 1 mpi_scheduler.exe > scheduler: MPI_Init > -------------------------------------------------------------------------- > WARNING: An invalid value was given for btl_tcp_if_exclude. This > value will be ignored. > > Local host: sweet1 > Value: 127.0.0.1/8 > Message: Did not find interface matching this subnet > -------------------------------------------------------------------------- > -->MPI_COMM_WORLD size = 1 > parent: MPI_UNIVERSE_SIZE = 1 > scheduler: calling MPI_Comm_spawn for 4 instances of 'mpi_executor.exe' > executor: MPI_Init > executor: MPI_Init > executor: MPI_Init > executor: MPI_Init > [sweet1][[20141,1],0][..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp\btl_tcp_proc.c:128:..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp > \btl_tcp_proc.c] mca_base_modex_recv: failed with return value=-13 > -------------------------------------------------------------------------- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[20141,1],0]) is on host: sweet1 > Process 2 ([[20141,2],0]) is on host: sweet1 > BTLs attempted: tcp sm self > > Your MPI job is now going to abort; sorry. > -------------------------------------------------------------------------- > subtask rank = 1 out of 4 > subtask rank = 2 out of 4 > subtask rank = 0 out of 4 > subtask rank = 3 out of 4 > > scheduler: MPI_Comm_spawn completed > scheduler broadcasting function string length = 4 > child: MPI_UNIVERSE_SIZE = 4 > child: MPI_UNIVERSE_SIZE = 4 > child: MPI_UNIVERSE_SIZE = 4 > child: MPI_UNIVERSE_SIZE = 4 > Proc0 wait for first broadcast > Proc1 wait for first broadcast > Proc2 wait for first broadcast > Proc3 wait for first broadcast > [sweet1:6800] *** An error occurred in MPI_Bcast > [sweet1:6800] *** on communicator > [sweet1:6800] *** MPI_ERR_INTERN: internal error > [sweet1:6800] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort > [sweet1:02412] [[20141,0],0]-[[20141,1],0] mca_oob_tcp_msg_recv: readv > failed: Unknown error (108) > [sweet1:02412] 4 more processes have sent help message help-mpi-btl-tcp.txt / > invalid if_inexclude > [sweet1:02412] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages > -------------------------------------------------------------------------- > WARNING: A process refused to die! > > Host: sweet1 > PID: 524 > > This process may still be running and/or consuming resources. > > -------------------------------------------------------------------------- > [sweet1:02412] [[20141,0],0]-[[20141,2],1] mca_oob_tcp_msg_recv: readv > failed: Unknown error (108) > [sweet1:02412] [[20141,0],0]-[[20141,2],0] mca_oob_tcp_msg_recv: readv > failed: Unknown error (108) > [sweet1:02412] [[20141,0],0]-[[20141,2],2] mca_oob_tcp_msg_recv: readv > failed: Unknown error (108) > -------------------------------------------------------------------------- > mpirun has exited due to process rank 0 with PID 488 on > node sweet1 exiting improperly. There are two reasons this could occur: > > 1. this process did not call "init" before exiting, but others in > the job did. This can cause a job to hang indefinitely while it waits > for all processes to call "init". By rule, if one process calls "init", > then ALL processes must call "init" prior to termination. > > 2. this process called "init", but exited without calling "finalize". > By rule, all processes that call "init" MUST call "finalize" prior to > exiting or it will be considered an "abnormal termination" > > This may have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > -------------------------------------------------------------------------- > [sweet1:02412] 3 more processes have sent help message help-odls-default.txt > / odls-default:could-not-kill > > C:\Users\cjr\Desktop\mpi_demo> > > ==================================================== > > C:\Users\cjr\Desktop\mpi_demo>mpi_scheduler.exe > scheduler: MPI_Init > -------------------------------------------------------------------------- > WARNING: An invalid value was given for btl_tcp_if_exclude. This > value will be ignored. > > Local host: sweet1 > Value: 127.0.0.1/8 > Message: Did not find interface matching this subnet > -------------------------------------------------------------------------- > -->MPI_COMM_WORLD size = 1 > parent: MPI_UNIVERSE_SIZE = 1 > scheduler: calling MPI_Comm_spawn for 4 instances of 'mpi_executor.exe' > executor: MPI_Init > executor: MPI_Init > executor: MPI_Init > executor: MPI_Init > [sweet1:04400] 1 more process has sent help message help-mpi-btl-tcp.txt / > invalid if_inexclude > [sweet1:04400] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages > subtask rank = 2 out of 4 > subtask rank = 1 out of 4 > subtask rank = 0 out of 4 > subtask rank = 3 out of 4 > > scheduler: MPI_Comm_spawn completed > scheduler broadcasting function string length = 4 > > child: MPI_UNIVERSE_SIZE = 4 > child: MPI_UNIVERSE_SIZE = 4 > child: MPI_UNIVERSE_SIZE = 4 > child: MPI_UNIVERSE_SIZE = 4 > Proc0 wait for first broadcast > Proc1 wait for first broadcast > Proc2 wait for first broadcast > Proc3 wait for first broadcast > > [sweet1:04400] 3 more processes have sent help message help-mpi-btl-tcp.txt / > invalid if_inexclude > > <<<<***mpi_executor.exe processes are running, but 'hung' while wating for > first broadcast***>>>> > <<<<***manually killed one of the 'mpi_executor.exe' processes; others > subsequently exited***>>>> > > [sweet1:04400] [[22257,0],0]-[[22257,2],3] mca_oob_tcp_msg_recv: readv > failed: Unknown error (108) > -------------------------------------------------------------------------- > WARNING: A process refused to die! > > Host: sweet1 > PID: 568 > > This process may still be running and/or consuming resources. > > -------------------------------------------------------------------------- > [sweet1:04400] [[22257,0],0]-[[22257,2],0] mca_oob_tcp_msg_recv: readv > failed: Unknown error (108) > [sweet1:04400] [[22257,0],0]-[[22257,2],1] mca_oob_tcp_msg_recv: readv > failed: Unknown error (108) > [sweet1:04400] 2 more processes have sent help message help-odls-default.txt > / odls-default:could-not-kill > > C:\Users\cjr\Desktop\mpi_demo> > > ================================================ > > The addition of the mpirun option "-mca btl_tcp_if_exclude none" eliminates > the benign 127.0.0.1 warning; the option "-mca btl_base_verbose 100" adds > output that verifies that the tcp, sm and self btl modules are successfully > initialized - nothing else seems to be amiss! > I have also tested this with the firewall completely disabled on the Windows > machine, with no change in behavior. > > I have been unable to determine what the error codes indicate, and am puzzled > why the behavior is different when using the 'mpirun' launcher. > I have attached the prototype scheduler and executor program source code > files, as well as files containing the Windows installation ompi information. > > Please help me figure out what is needed to enable this MPI communication. > > Thanks, > CJ Roth > > > This email is intended solely for the recipient. It may contain privileged, > proprietary or confidential information or material. If you are not the > intended recipient, please delete this email and any attachments and notify > the sender of the error. > <mpi_scheduler.cpp><mpi_executor.cpp><ompi_info-all.txt><ompi_btl_info.txt>_______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > <https://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29395.php > <http://www.open-mpi.org/community/lists/users/2016/06/29395.php>