Just looking at this output, it would appear that Windows is configured in a 
way that prevents the procs from connecting to each other via TCP while on the 
same node, and shared memory is disqualifying itself - which leaves no way for 
two procs on the same node to communicate.


> On Jun 7, 2016, at 12:16 PM, Roth, Christopher <cr...@aer.com> wrote:
> 
> I have developed a set of C++ MPI programs for performing a series of 
> scientific calculations.  The master 'scheduler' program spawns off sets of 
> parallelized 'executor' programs using the MPI_Comm_spawn routine; these 
> executors communicate back and forth with the scheduler (only small amounts 
> of information) via MPI_Bcast, MPI_Recv and MPI_Send routines (the 'C' 
> library versions).
> 
> This software was originally developed on a multi-core Linux machine using 
> OpenMpi v1.5.2, and works extremely well; now I'm attempting to port it to 
> multi-core Windows 7 machine, using Visual Studios 2012 and the precompiled 
> OpenMpi v1.6.2 Windows release.  It all compiles and links properly under 
> VS2012.
> When attempting to run this software on the Windows machine, the scheduler 
> program is able to spawn off the executor programs as intended, but 
> everything chokes when scheduler sends its initial broadcast.  There is 
> slightly different behavior when launching the scheduler via 'mpirun', or 
> just by itself, as shown in the logs below:
> (the warning about the 127.0.0.1 address is benign - there is no loopback on 
> Windows)
> 
> C:\Users\cjr\Desktop\mpi_demo>mpirun -np 1 mpi_scheduler.exe
>  scheduler: MPI_Init
> --------------------------------------------------------------------------
> WARNING: An invalid value was given for btl_tcp_if_exclude.  This
> value will be ignored.
> 
>   Local host: sweet1
>   Value:      127.0.0.1/8
>   Message:    Did not find interface matching this subnet
> --------------------------------------------------------------------------
> -->MPI_COMM_WORLD size = 1
> parent: MPI_UNIVERSE_SIZE = 1
> scheduler: calling MPI_Comm_spawn for 4 instances of 'mpi_executor.exe'
>  executor: MPI_Init
>  executor: MPI_Init
>  executor: MPI_Init
>  executor: MPI_Init
> [sweet1][[20141,1],0][..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp\btl_tcp_proc.c:128:..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp
> \btl_tcp_proc.c] mca_base_modex_recv: failed with return value=-13
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
> 
>   Process 1 ([[20141,1],0]) is on host: sweet1
>   Process 2 ([[20141,2],0]) is on host: sweet1
>   BTLs attempted: tcp sm self
> 
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
>  subtask rank = 1 out of 4
>  subtask rank = 2 out of 4
>  subtask rank = 0 out of 4
>  subtask rank = 3 out of 4
> 
> scheduler: MPI_Comm_spawn completed
>  scheduler broadcasting function string length = 4
> child: MPI_UNIVERSE_SIZE = 4
> child: MPI_UNIVERSE_SIZE = 4
> child: MPI_UNIVERSE_SIZE = 4
> child: MPI_UNIVERSE_SIZE = 4
> Proc0 wait for first broadcast
> Proc1 wait for first broadcast
> Proc2 wait for first broadcast
> Proc3 wait for first broadcast
> [sweet1:6800] *** An error occurred in MPI_Bcast
> [sweet1:6800] *** on communicator
> [sweet1:6800] *** MPI_ERR_INTERN: internal error
> [sweet1:6800] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> [sweet1:02412] [[20141,0],0]-[[20141,1],0] mca_oob_tcp_msg_recv: readv 
> failed: Unknown error (108)
> [sweet1:02412] 4 more processes have sent help message help-mpi-btl-tcp.txt / 
> invalid if_inexclude
> [sweet1:02412] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
> help / error messages
> --------------------------------------------------------------------------
> WARNING: A process refused to die!
> 
> Host: sweet1
> PID:  524
> 
> This process may still be running and/or consuming resources.
> 
> --------------------------------------------------------------------------
> [sweet1:02412] [[20141,0],0]-[[20141,2],1] mca_oob_tcp_msg_recv: readv 
> failed: Unknown error (108)
> [sweet1:02412] [[20141,0],0]-[[20141,2],0] mca_oob_tcp_msg_recv: readv 
> failed: Unknown error (108)
> [sweet1:02412] [[20141,0],0]-[[20141,2],2] mca_oob_tcp_msg_recv: readv 
> failed: Unknown error (108)
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 488 on
> node sweet1 exiting improperly. There are two reasons this could occur:
> 
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
> 
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
> 
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> [sweet1:02412] 3 more processes have sent help message help-odls-default.txt 
> / odls-default:could-not-kill
> 
> C:\Users\cjr\Desktop\mpi_demo>
> 
> ====================================================
> 
> C:\Users\cjr\Desktop\mpi_demo>mpi_scheduler.exe
>  scheduler: MPI_Init
> --------------------------------------------------------------------------
> WARNING: An invalid value was given for btl_tcp_if_exclude.  This
> value will be ignored.
> 
>   Local host: sweet1
>   Value:      127.0.0.1/8
>   Message:    Did not find interface matching this subnet
> --------------------------------------------------------------------------
> -->MPI_COMM_WORLD size = 1
> parent: MPI_UNIVERSE_SIZE = 1
> scheduler: calling MPI_Comm_spawn for 4 instances of 'mpi_executor.exe'
>  executor: MPI_Init
>  executor: MPI_Init
>  executor: MPI_Init
>  executor: MPI_Init
> [sweet1:04400] 1 more process has sent help message help-mpi-btl-tcp.txt / 
> invalid if_inexclude
> [sweet1:04400] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
> help / error messages
>  subtask rank = 2 out of 4
>  subtask rank = 1 out of 4
>  subtask rank = 0 out of 4
>  subtask rank = 3 out of 4
> 
> scheduler: MPI_Comm_spawn completed
>  scheduler broadcasting function string length = 4
> 
> child: MPI_UNIVERSE_SIZE = 4
> child: MPI_UNIVERSE_SIZE = 4
> child: MPI_UNIVERSE_SIZE = 4
> child: MPI_UNIVERSE_SIZE = 4
> Proc0 wait for first broadcast
> Proc1 wait for first broadcast
> Proc2 wait for first broadcast
> Proc3 wait for first broadcast
> 
> [sweet1:04400] 3 more processes have sent help message help-mpi-btl-tcp.txt / 
> invalid if_inexclude
> 
> <<<<***mpi_executor.exe processes are running, but 'hung' while wating for 
> first broadcast***>>>>
> <<<<***manually killed one of the 'mpi_executor.exe' processes; others 
> subsequently exited***>>>>
> 
> [sweet1:04400] [[22257,0],0]-[[22257,2],3] mca_oob_tcp_msg_recv: readv 
> failed: Unknown error (108)
> --------------------------------------------------------------------------
> WARNING: A process refused to die!
> 
> Host: sweet1
> PID:  568
> 
> This process may still be running and/or consuming resources.
> 
> --------------------------------------------------------------------------
> [sweet1:04400] [[22257,0],0]-[[22257,2],0] mca_oob_tcp_msg_recv: readv 
> failed: Unknown error (108)
> [sweet1:04400] [[22257,0],0]-[[22257,2],1] mca_oob_tcp_msg_recv: readv 
> failed: Unknown error (108)
> [sweet1:04400] 2 more processes have sent help message help-odls-default.txt 
> / odls-default:could-not-kill
> 
> C:\Users\cjr\Desktop\mpi_demo>
> 
> ================================================
> 
> The addition of the mpirun option "-mca btl_tcp_if_exclude none" eliminates 
> the benign 127.0.0.1 warning; the option "-mca btl_base_verbose 100" adds 
> output that verifies that the tcp, sm and self btl modules are successfully 
> initialized - nothing else seems to be amiss!
> I have also tested this with the firewall completely disabled on the Windows 
> machine, with no change in behavior.
> 
> I have been unable to determine what the error codes indicate, and am puzzled 
> why the behavior is different when using the 'mpirun' launcher.
> I have attached the prototype scheduler and executor program source code 
> files, as well as files containing the Windows installation ompi information.
> 
> Please help me figure out what is needed to enable this MPI communication.
> 
> Thanks,
> CJ Roth
> 
> 
> This email is intended solely for the recipient. It may contain privileged, 
> proprietary or confidential information or material. If you are not the 
> intended recipient, please delete this email and any attachments and notify 
> the sender of the error.
> <mpi_scheduler.cpp><mpi_executor.cpp><ompi_info-all.txt><ompi_btl_info.txt>_______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users 
> <https://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29395.php 
> <http://www.open-mpi.org/community/lists/users/2016/06/29395.php>

Reply via email to