> On Jun 8, 2016, at 4:30 AM, Roth, Christopher <cr...@aer.com> wrote:
> 
> What part of this output indicates this non-communicative configuration?

--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[20141,1],0]) is on host: sweet1
  Process 2 ([[20141,2],0]) is on host: sweet1
  BTLs attempted: tcp sm self

Your MPI job is now going to abort; sorry.
—————————————————————————————————————

Both procs are on the same host. Since they cannot communicate, it means that 
(a) the shared memory component (sm) was unable to be used, and (b) the TCP 
subsystem did not provide a usable address for the two procs to reach each 
other. The former could mean that there wasn’t enough room in the tmp 
directory, and the latter indicates that the TCP subsystem isn’t configured to 
allow connections from its own local IP address.

I don’t know anything about Windows configuration I’m afraid.


> Please recall, this is using the precompiled OpenMpi Windows installation
> 
> When the 'verbose' option is added, I see this sequence of output for the 
> scheduler and each of the executor processes:
> ------
> [sweet1:06412] mca: base: components_open: Looking for btl components
> [sweet1:06412] mca: base: components_open: opening btl components
> [sweet1:06412] mca: base: components_open: found loaded component tcp
> [sweet1:06412] mca: base: components_open: component tcp register function 
> successful
> [sweet1:06412] mca: base: components_open: component tcp open function 
> successful
> [sweet1:06412] mca: base: components_open: found loaded component sm
> [sweet1:06412] mca: base: components_open: component sm has no register 
> function
> [sweet1:06412] mca: base: components_open: component sm open function 
> successful
> [sweet1:06412] mca: base: components_open: found loaded component self
> [sweet1:06412] mca: base: components_open: component self has no register 
> function
> [sweet1:06412] mca: base: components_open: component self open function 
> successful
> [sweet1:06412] select: initializing btl component tcp
> [sweet1:06412] select: init of component tcp returned success
> [sweet1:06412] select: initializing btl component sm
> [sweet1:06412] select: init of component sm returned success
> [sweet1:06412] select: initializing btl component self
> [sweet1:06412] select: init of component self returned success
> -------
> 
> This output appears to show the btl components for TCP, SM and Self are all 
> available, but this is contradicted the error message shown in the initial 
> post  ("At least one pair of MPI processes are unable to reach each other for 
> MPI communications....")
> 
> If there is some sort of misconfiguration present, do you have a suggestion 
> for correcting the situation?
> 
> From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
> [r...@open-mpi.org]
> Sent: Tuesday, June 07, 2016 3:39 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Processes unable to communicate when using 
> MPI_Comm_spawn on Windows
> 
> Just looking at this output, it would appear that Windows is configured in a 
> way that prevents the procs from connecting to each other via TCP while on 
> the same node, and shared memory is disqualifying itself - which leaves no 
> way for two procs on the same node to communicate.
> 
> 
>> On Jun 7, 2016, at 12:16 PM, Roth, Christopher <cr...@aer.com 
>> <mailto:cr...@aer.com>> wrote:
>> 
>> I have developed a set of C++ MPI programs for performing a series of 
>> scientific calculations.  The master 'scheduler' program spawns off sets of 
>> parallelized 'executor' programs using the MPI_Comm_spawn routine; these 
>> executors communicate back and forth with the scheduler (only small amounts 
>> of information) via MPI_Bcast, MPI_Recv and MPI_Send routines (the 'C' 
>> library versions).
>> 
>> This software was originally developed on a multi-core Linux machine using 
>> OpenMpi v1.5.2, and works extremely well; now I'm attempting to port it to 
>> multi-core Windows 7 machine, using Visual Studios 2012 and the precompiled 
>> OpenMpi v1.6.2 Windows release.  It all compiles and links properly under 
>> VS2012.
>> When attempting to run this software on the Windows machine, the scheduler 
>> program is able to spawn off the executor programs as intended, but 
>> everything chokes when scheduler sends its initial broadcast.  There is 
>> slightly different behavior when launching the scheduler via 'mpirun', or 
>> just by itself, as shown in the logs below:
>> (the warning about the 127.0.0.1 address is benign - there is no loopback on 
>> Windows)
>> 
>> C:\Users\cjr\Desktop\mpi_demo>mpirun -np 1 mpi_scheduler.exe
>>  scheduler: MPI_Init
>> --------------------------------------------------------------------------
>> WARNING: An invalid value was given for btl_tcp_if_exclude.  This
>> value will be ignored.
>> 
>>   Local host: sweet1
>>   Value:      127.0.0.1/8
>>   Message:    Did not find interface matching this subnet
>> --------------------------------------------------------------------------
>> -->MPI_COMM_WORLD size = 1
>> parent: MPI_UNIVERSE_SIZE = 1
>> scheduler: calling MPI_Comm_spawn for 4 instances of 'mpi_executor.exe'
>>  executor: MPI_Init
>>  executor: MPI_Init
>>  executor: MPI_Init
>>  executor: MPI_Init
>> [sweet1][[20141,1],0][..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp\btl_tcp_proc.c:128:..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp
>> \btl_tcp_proc.c] mca_base_modex_recv: failed with return value=-13
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications.  This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes.  This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other.  This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>> 
>>   Process 1 ([[20141,1],0]) is on host: sweet1
>>   Process 2 ([[20141,2],0]) is on host: sweet1
>>   BTLs attempted: tcp sm self
>> 
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>>  subtask rank = 1 out of 4
>>  subtask rank = 2 out of 4
>>  subtask rank = 0 out of 4
>>  subtask rank = 3 out of 4
>> 
>> scheduler: MPI_Comm_spawn completed
>>  scheduler broadcasting function string length = 4
>> child: MPI_UNIVERSE_SIZE = 4
>> child: MPI_UNIVERSE_SIZE = 4
>> child: MPI_UNIVERSE_SIZE = 4
>> child: MPI_UNIVERSE_SIZE = 4
>> Proc0 wait for first broadcast
>> Proc1 wait for first broadcast
>> Proc2 wait for first broadcast
>> Proc3 wait for first broadcast
>> [sweet1:6800] *** An error occurred in MPI_Bcast
>> [sweet1:6800] *** on communicator
>> [sweet1:6800] *** MPI_ERR_INTERN: internal error
>> [sweet1:6800] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
>> [sweet1:02412] [[20141,0],0]-[[20141,1],0] mca_oob_tcp_msg_recv: readv 
>> failed: Unknown error (108)
>> [sweet1:02412] 4 more processes have sent help message help-mpi-btl-tcp.txt 
>> / invalid if_inexclude
>> [sweet1:02412] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
>> help / error messages
>> --------------------------------------------------------------------------
>> WARNING: A process refused to die!
>> 
>> Host: sweet1
>> PID:  524
>> 
>> This process may still be running and/or consuming resources.
>> 
>> --------------------------------------------------------------------------
>> [sweet1:02412] [[20141,0],0]-[[20141,2],1] mca_oob_tcp_msg_recv: readv 
>> failed: Unknown error (108)
>> [sweet1:02412] [[20141,0],0]-[[20141,2],0] mca_oob_tcp_msg_recv: readv 
>> failed: Unknown error (108)
>> [sweet1:02412] [[20141,0],0]-[[20141,2],2] mca_oob_tcp_msg_recv: readv 
>> failed: Unknown error (108)
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 0 with PID 488 on
>> node sweet1 exiting improperly. There are two reasons this could occur:
>> 
>> 1. this process did not call "init" before exiting, but others in
>> the job did. This can cause a job to hang indefinitely while it waits
>> for all processes to call "init". By rule, if one process calls "init",
>> then ALL processes must call "init" prior to termination.
>> 
>> 2. this process called "init", but exited without calling "finalize".
>> By rule, all processes that call "init" MUST call "finalize" prior to
>> exiting or it will be considered an "abnormal termination"
>> 
>> This may have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> --------------------------------------------------------------------------
>> [sweet1:02412] 3 more processes have sent help message help-odls-default.txt 
>> / odls-default:could-not-kill
>> 
>> C:\Users\cjr\Desktop\mpi_demo>
>> 
>> ====================================================
>> 
>> C:\Users\cjr\Desktop\mpi_demo>mpi_scheduler.exe
>>  scheduler: MPI_Init
>> --------------------------------------------------------------------------
>> WARNING: An invalid value was given for btl_tcp_if_exclude.  This
>> value will be ignored.
>> 
>>   Local host: sweet1
>>   Value:      127.0.0.1/8
>>   Message:    Did not find interface matching this subnet
>> --------------------------------------------------------------------------
>> -->MPI_COMM_WORLD size = 1
>> parent: MPI_UNIVERSE_SIZE = 1
>> scheduler: calling MPI_Comm_spawn for 4 instances of 'mpi_executor.exe'
>>  executor: MPI_Init
>>  executor: MPI_Init
>>  executor: MPI_Init
>>  executor: MPI_Init
>> [sweet1:04400] 1 more process has sent help message help-mpi-btl-tcp.txt / 
>> invalid if_inexclude
>> [sweet1:04400] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
>> help / error messages
>>  subtask rank = 2 out of 4
>>  subtask rank = 1 out of 4
>>  subtask rank = 0 out of 4
>>  subtask rank = 3 out of 4
>> 
>> scheduler: MPI_Comm_spawn completed
>>  scheduler broadcasting function string length = 4
>> 
>> child: MPI_UNIVERSE_SIZE = 4
>> child: MPI_UNIVERSE_SIZE = 4
>> child: MPI_UNIVERSE_SIZE = 4
>> child: MPI_UNIVERSE_SIZE = 4
>> Proc0 wait for first broadcast
>> Proc1 wait for first broadcast
>> Proc2 wait for first broadcast
>> Proc3 wait for first broadcast
>> 
>> [sweet1:04400] 3 more processes have sent help message help-mpi-btl-tcp.txt 
>> / invalid if_inexclude
>> 
>> <<<<***mpi_executor.exe processes are running, but 'hung' while wating for 
>> first broadcast***>>>>
>> <<<<***manually killed one of the 'mpi_executor.exe' processes; others 
>> subsequently exited***>>>>
>> 
>> [sweet1:04400] [[22257,0],0]-[[22257,2],3] mca_oob_tcp_msg_recv: readv 
>> failed: Unknown error (108)
>> --------------------------------------------------------------------------
>> WARNING: A process refused to die!
>> 
>> Host: sweet1
>> PID:  568
>> 
>> This process may still be running and/or consuming resources.
>> 
>> --------------------------------------------------------------------------
>> [sweet1:04400] [[22257,0],0]-[[22257,2],0] mca_oob_tcp_msg_recv: readv 
>> failed: Unknown error (108)
>> [sweet1:04400] [[22257,0],0]-[[22257,2],1] mca_oob_tcp_msg_recv: readv 
>> failed: Unknown error (108)
>> [sweet1:04400] 2 more processes have sent help message help-odls-default.txt 
>> / odls-default:could-not-kill
>> 
>> C:\Users\cjr\Desktop\mpi_demo>
>> 
>> ================================================
>> 
>> The addition of the mpirun option "-mca btl_tcp_if_exclude none" eliminates 
>> the benign 127.0.0.1 warning; the option "-mca btl_base_verbose 100" adds 
>> output that verifies that the tcp, sm and self btl modules are successfully 
>> initialized - nothing else seems to be amiss!
>> I have also tested this with the firewall completely disabled on the Windows 
>> machine, with no change in behavior.
>> 
>> I have been unable to determine what the error codes indicate, and am 
>> puzzled why the behavior is different when using the 'mpirun' launcher.
>> I have attached the prototype scheduler and executor program source code 
>> files, as well as files containing the Windows installation ompi information.
>> 
>> Please help me figure out what is needed to enable this MPI communication.
>> 
>> Thanks,
>> CJ Roth
>> 
>> 
>> This email is intended solely for the recipient. It may contain privileged, 
>> proprietary or confidential information or material. If you are not the 
>> intended recipient, please delete this email and any attachments and notify 
>> the sender of the error.
>> <mpi_scheduler.cpp><mpi_executor.cpp><ompi_info-all.txt><ompi_btl_info.txt>_______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <https://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2016/06/29395.php 
>> <http://www.open-mpi.org/community/lists/users/2016/06/29395.php>
> _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users 
> <https://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29408.php 
> <http://www.open-mpi.org/community/lists/users/2016/06/29408.php>

Reply via email to