Christopher, just to be clear, MPI_Comm_spawn is *not* a basic functionality. also, it might work on older windows (xp for example)
you might want to report thus issue to whoever provided this Open MPI pre-compiled library. an other option is to use cygwin, it provides a fairly recent Open MPI and the maintainer is active. other options include Linux (you can even run it in a virtual machine) or OS X Cheers, Gilles On Thursday, June 9, 2016, Roth, Christopher <cr...@aer.com> wrote: > Thanks for the info, Gilles. > Being relatively new to MPI, I was not aware 'sm' did not work with > intercommunicators - I had assumed it was an option if the others were not > available. > > I am running as an admin on this test machine. When adding the option > '-mca btl_tcp_port_min_v4 2000', a higher port number is used, but that > does not alter the program behavior at all. > > Given that the initial development was on Linux using OpenMpi v1.5, I > would like to assume the Windows implementation would have mostly > equivalent feature development, and then improved in v1.6. Apparently that > isn't true... > This is rather disappointing that a seemingly basic MPI communication > functionality is broken like this under Windows, even if it is an older > version. > Hacking on the Windows OpenMPI code is a rabbit hole that I do not want to > go down for numerous reasons. > > I have briefly explored alternate Windows MPI libraries: the Windows > version of MPICH (from Microsoft) has not implemented MPI_Comm_Spawn; Intel > MPI has a licensing fee. Do you any other alternatives to suggest? > > ------------------------------ > *From:* users [users-boun...@open-mpi.org > <javascript:_e(%7B%7D,'cvml','users-boun...@open-mpi.org');>] on behalf > of Gilles Gouaillardet [gil...@rist.or.jp > <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>] > *Sent:* Wednesday, June 08, 2016 7:58 PM > *To:* Open MPI Users > *Subject:* Re: [OMPI users] Processes unable to communicate when using > MPI_Comm_spawn on Windows > > Christopher, > > > the sm btl does not work with inter communicators and hence disqualifies > itself. > > i guess this is what you interpreted as 'partially working' > > > i am surprised you are using a privileged port (260 < 1024), are you > running as an admin ? > > > Open MPI is no more supported on windows, and the 1.6 series is pretty > antique these days... > > > regardless this, the source code points to > > > static __inline int opal_get_socket_errno(void) { > > int ret = WSAGetLastError(); > switch (ret) { > case WSAEINTR: return EINTR; > ... > > default: printf("Feature not implemented: %d %s\n", __LINE__, > __FILE__); return OPAL_ERROR; > }; > } > > > at first, it is worth printing (ret) if the feature is not implemented. > > then you can hack this part and add the missing case > > recent windows (7) might use a newer one that was not available on older > ones (xp) > > > Cheers, > > > Gilles > > > > > On 6/9/2016 1:51 AM, Roth, Christopher wrote: > > Well, that obvious error message states the basic problem - I was hoping > you had noticed a detail in the ompi_info output that would point to a > reason for it. > > Further test runs with the option '-mca btl tcp,self' (excluding 'sm' from > the mix) and '-mca btl_base_verbose 100', supply some more information: > ------ > [sweet1:04556] btl: tcp: attempting to connect() to address 10.3.2.109 on > port 260 > [sweet1:04556] btl: tcp: attempting to connect() to address 10.3.2.109 on > port 260 > ------ > The ip address is the host machine's. The process ID corresponds to the > first of the executor programs. The programs now hang at that point > (within the scheduler's MPI_Comm_spawn call and the executors' MPI_Init > calls), and and have to be manually killed. > > Yet another test, adding the '-mca mpi_preconnect_mpi 1' (along with the > other two added arguments), gives more info: > ------ > [sweet1:04976] btl: tcp: attempting to connect() to address 10.3.2.109 on > port 260 > [sweet1:04516] btl: tcp: attempting to connect() to address 10.3.2.109 on > port 260 > [sweet1:03824] btl: tcp: attempting to connect() to address 10.3.2.109 on > port 260 > > [sweet1][[17613,2],1][..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp\btl_tcp_endpoint.c:486:..\..\..\openmpi-1.6.2\ompi\mca\btl > \tcp\btl_tcp_endpoint.c] received unexpected process identifier > [[17613,2],0] > > [sweet1][[17613,2],0][..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp\btl_tcp_frag.c:215:..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp > \btl_tcp_frag.c] Feature not implemented: 130 > D:/temp/OpenMPI/openmpi-1.6.2/opal/include\opal/opal_socket_errno.h > Feature not implemented: 130 > D:/temp/OpenMPI/openmpi-1.6.2/opal/include\opal/opal_socket_errno.h > mca_btl_tcp_frag_recv: readv failed: Unknown error (-1) > ------ > With the 'preconnect' option, it sets up the TCP link for all of the > executor processes, but then runs into an actual error, regarding some > function not implemented. This option is not required, but I had to give > it a whirl. > > All of these test runs have the same behavior when performed with and > without the firewall active. > > The fact that the executor programs don't get past the MPI_Init call when > the 'sm' is excluded from btl set , implies that the 'sm' is at least > partially working. > > ------------------------------ > *From:* users [users-boun...@open-mpi.org > <javascript:_e(%7B%7D,'cvml','users-boun...@open-mpi.org');>] on behalf > of Ralph Castain [r...@open-mpi.org > <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>] > *Sent:* Wednesday, June 08, 2016 10:47 AM > *To:* Open MPI Users > *Subject:* Re: [OMPI users] Processes unable to communicate when using > MPI_Comm_spawn on Windows > > > On Jun 8, 2016, at 4:30 AM, Roth, Christopher <cr...@aer.com > <javascript:_e(%7B%7D,'cvml','cr...@aer.com');>> wrote: > > What part of this output indicates this non-communicative configuration? > > > -------------------------------------------------------------------------- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[20141,1],0]) is on host: sweet1 > Process 2 ([[20141,2],0]) is on host: sweet1 > BTLs attempted: tcp sm self > > Your MPI job is now going to abort; sorry. > ————————————————————————————————————— > > Both procs are on the same host. Since they cannot communicate, it means > that (a) the shared memory component (sm) was unable to be used, and (b) > the TCP subsystem did not provide a usable address for the two procs to > reach each other. The former could mean that there wasn’t enough room in > the tmp directory, and the latter indicates that the TCP subsystem isn’t > configured to allow connections from its own local IP address. > > I don’t know anything about Windows configuration I’m afraid. > > > Please recall, this is using the precompiled OpenMpi Windows installation > > When the 'verbose' option is added, I see this sequence of output for the > scheduler and each of the executor processes: > ------ > [sweet1:06412] mca: base: components_open: Looking for btl components > [sweet1:06412] mca: base: components_open: opening btl components > [sweet1:06412] mca: base: components_open: found loaded component tcp > [sweet1:06412] mca: base: components_open: component tcp register function > successful > [sweet1:06412] mca: base: components_open: component tcp open function > successful > [sweet1:06412] mca: base: components_open: found loaded component sm > [sweet1:06412] mca: base: components_open: component sm has no register > function > [sweet1:06412] mca: base: components_open: component sm open function > successful > [sweet1:06412] mca: base: components_open: found loaded component self > [sweet1:06412] mca: base: components_open: component self has no register > function > [sweet1:06412] mca: base: components_open: component self open function > successful > [sweet1:06412] select: initializing btl component tcp > [sweet1:06412] select: init of component tcp returned success > [sweet1:06412] select: initializing btl component sm > [sweet1:06412] select: init of component sm returned success > [sweet1:06412] select: initializing btl component self > [sweet1:06412] select: init of component self returned success > ------- > > This output appears to show the btl components for TCP, SM and Self are > all available, but this is contradicted the error message shown in the > initial post ("At least one pair of MPI processes are unable to reach each > other for MPI communications....") > > If there is some sort of misconfiguration present, do you have a > suggestion for correcting the situation? > > ------------------------------ > *From:* users [users-boun...@open-mpi.org > <javascript:_e(%7B%7D,'cvml','users-boun...@open-mpi.org');>] on behalf > of Ralph Castain [ <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');> > r...@open-mpi.org <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>] > *Sent:* Tuesday, June 07, 2016 3:39 PM > *To:* Open MPI Users > *Subject:* Re: [OMPI users] Processes unable to communicate when using > MPI_Comm_spawn on Windows > > Just looking at this output, it would appear that Windows is configured in > a way that prevents the procs from connecting to each other via TCP while > on the same node, and shared memory is disqualifying itself - which leaves > no way for two procs on the same node to communicate. > > > On Jun 7, 2016, at 12:16 PM, Roth, Christopher < > <javascript:_e(%7B%7D,'cvml','cr...@aer.com');>cr...@aer.com > <javascript:_e(%7B%7D,'cvml','cr...@aer.com');>> wrote: > > I have developed a set of C++ MPI programs for performing a series of > scientific calculations. The master 'scheduler' program spawns off sets of > parallelized 'executor' programs using the MPI_Comm_spawn routine; these > executors communicate back and forth with the scheduler (only small amounts > of information) via MPI_Bcast, MPI_Recv and MPI_Send routines (the 'C' > library versions). > > This software was originally developed on a multi-core Linux machine using > OpenMpi v1.5.2, and works extremely well; now I'm attempting to port it to > multi-core Windows 7 machine, using Visual Studios 2012 and the precompiled > OpenMpi v1.6.2 Windows release. It all compiles and links properly under > VS2012. > When attempting to run this software on the Windows machine, the scheduler > program is able to spawn off the executor programs as intended, but > everything chokes when scheduler sends its initial broadcast. There is > slightly different behavior when launching the scheduler via 'mpirun', or > just by itself, as shown in the logs below: > (the warning about the 127.0.0.1 address is benign - there is no loopback > on Windows) > > C:\Users\cjr\Desktop\mpi_demo>mpirun -np 1 mpi_scheduler.exe > scheduler: MPI_Init > -------------------------------------------------------------------------- > WARNING: An invalid value was given for btl_tcp_if_exclude. This > value will be ignored. > > Local host: sweet1 > Value: 127.0.0.1/8 > Message: Did not find interface matching this subnet > -------------------------------------------------------------------------- > -->MPI_COMM_WORLD size = 1 > parent: MPI_UNIVERSE_SIZE = 1 > scheduler: calling MPI_Comm_spawn for 4 instances of 'mpi_executor.exe' > executor: MPI_Init > executor: MPI_Init > executor: MPI_Init > executor: MPI_Init > > [sweet1][[20141,1],0][..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp\btl_tcp_proc.c:128:..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp > \btl_tcp_proc.c] mca_base_modex_recv: failed with return value=-13 > -------------------------------------------------------------------------- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[20141,1],0]) is on host: sweet1 > Process 2 ([[20141,2],0]) is on host: sweet1 > BTLs attempted: tcp sm self > > Your MPI job is now going to abort; sorry. > -------------------------------------------------------------------------- > subtask rank = 1 out of 4 > subtask rank = 2 out of 4 > subtask rank = 0 out of 4 > subtask rank = 3 out of 4 > > scheduler: MPI_Comm_spawn completed > scheduler broadcasting function string length = 4 > child: MPI_UNIVERSE_SIZE = 4 > child: MPI_UNIVERSE_SIZE = 4 > child: MPI_UNIVERSE_SIZE = 4 > child: MPI_UNIVERSE_SIZE = 4 > Proc0 wait for first broadcast > Proc1 wait for first broadcast > Proc2 wait for first broadcast > Proc3 wait for first broadcast > [sweet1:6800] *** An error occurred in MPI_Bcast > [sweet1:6800] *** on communicator > [sweet1:6800] *** MPI_ERR_INTERN: internal error > [sweet1:6800] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort > [sweet1:02412] [[20141,0],0]-[[20141,1],0] mca_oob_tcp_msg_recv: readv > failed: Unknown error (108) > [sweet1:02412] 4 more processes have sent help message > help-mpi-btl-tcp.txt / invalid if_inexclude > [sweet1:02412] Set MCA parameter "orte_base_help_aggregate" to 0 to see > all help / error messages > -------------------------------------------------------------------------- > WARNING: A process refused to die! > > Host: sweet1 > PID: 524 > > This process may still be running and/or consuming resources. > > -------------------------------------------------------------------------- > [sweet1:02412] [[20141,0],0]-[[20141,2],1] mca_oob_tcp_msg_recv: readv > failed: Unknown error (108) > [sweet1:02412] [[20141,0],0]-[[20141,2],0] mca_oob_tcp_msg_recv: readv > failed: Unknown error (108) > [sweet1:02412] [[20141,0],0]-[[20141,2],2] mca_oob_tcp_msg_recv: readv > failed: Unknown error (108) > -------------------------------------------------------------------------- > mpirun has exited due to process rank 0 with PID 488 on > node sweet1 exiting improperly. There are two reasons this could occur: > > 1. this process did not call "init" before exiting, but others in > the job did. This can cause a job to hang indefinitely while it waits > for all processes to call "init". By rule, if one process calls "init", > then ALL processes must call "init" prior to termination. > > 2. this process called "init", but exited without calling "finalize". > By rule, all processes that call "init" MUST call "finalize" prior to > exiting or it will be considered an "abnormal termination" > > This may have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > -------------------------------------------------------------------------- > [sweet1:02412] 3 more processes have sent help message > help-odls-default.txt / odls-default:could-not-kill > > C:\Users\cjr\Desktop\mpi_demo> > > ==================================================== > > C:\Users\cjr\Desktop\mpi_demo>mpi_scheduler.exe > scheduler: MPI_Init > -------------------------------------------------------------------------- > WARNING: An invalid value was given for btl_tcp_if_exclude. This > value will be ignored. > > Local host: sweet1 > Value: 127.0.0.1/8 > Message: Did not find interface matching this subnet > -------------------------------------------------------------------------- > -->MPI_COMM_WORLD size = 1 > parent: MPI_UNIVERSE_SIZE = 1 > scheduler: calling MPI_Comm_spawn for 4 instances of 'mpi_executor.exe' > executor: MPI_Init > executor: MPI_Init > executor: MPI_Init > executor: MPI_Init > [sweet1:04400] 1 more process has sent help message help-mpi-btl-tcp.txt / > invalid if_inexclude > [sweet1:04400] Set MCA parameter "orte_base_help_aggregate" to 0 to see > all help / error messages > subtask rank = 2 out of 4 > subtask rank = 1 out of 4 > subtask rank = 0 out of 4 > subtask rank = 3 out of 4 > > scheduler: MPI_Comm_spawn completed > scheduler broadcasting function string length = 4 > > child: MPI_UNIVERSE_SIZE = 4 > child: MPI_UNIVERSE_SIZE = 4 > child: MPI_UNIVERSE_SIZE = 4 > child: MPI_UNIVERSE_SIZE = 4 > Proc0 wait for first broadcast > Proc1 wait for first broadcast > Proc2 wait for first broadcast > Proc3 wait for first broadcast > > [sweet1:04400] 3 more processes have sent help message > help-mpi-btl-tcp.txt / invalid if_inexclude > > <<<<***mpi_executor.exe processes are running, but 'hung' while wating for > first broadcast***>>>> > <<<<***manually killed one of the 'mpi_executor.exe' processes; others > subsequently exited***>>>> > > [sweet1:04400] [[22257,0],0]-[[22257,2],3] mca_oob_tcp_msg_recv: readv > failed: Unknown error (108) > -------------------------------------------------------------------------- > WARNING: A process refused to die! > > Host: sweet1 > PID: 568 > > This process may still be running and/or consuming resources. > > -------------------------------------------------------------------------- > [sweet1:04400] [[22257,0],0]-[[22257,2],0] mca_oob_tcp_msg_recv: readv > failed: Unknown error (108) > [sweet1:04400] [[22257,0],0]-[[22257,2],1] mca_oob_tcp_msg_recv: readv > failed: Unknown error (108) > [sweet1:04400] 2 more processes have sent help message > help-odls-default.txt / odls-default:could-not-kill > > C:\Users\cjr\Desktop\mpi_demo> > > ================================================ > > The addition of the mpirun option "-mca btl_tcp_if_exclude none" > eliminates the benign 127.0.0.1 warning; the option "-mca btl_base_verbose > 100" adds output that verifies that the tcp, sm and self btl modules are > successfully initialized - nothing else seems to be amiss! > I have also tested this with the firewall completely disabled on the > Windows machine, with no change in behavior. > > I have been unable to determine what the error codes indicate, and am > puzzled why the behavior is different when using the 'mpirun' launcher. > I have attached the prototype scheduler and executor program source code > files, as well as files containing the Windows installation ompi > information. > > Please help me figure out what is needed to enable this MPI communication. > > Thanks, > CJ Roth > > ------------------------------ > > This email is intended solely for the recipient. It may contain > privileged, proprietary or confidential information or material. If you are > not the intended recipient, please delete this email and any attachments > and notify the sender of the error. > <mpi_scheduler.cpp><mpi_executor.cpp><ompi_info-all.txt> > <ompi_btl_info.txt>_______________________________________________ > users mailing list > us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> > Subscription: <https://www.open-mpi.org/mailman/listinfo.cgi/users> > https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > <http://www.open-mpi.org/community/lists/users/2016/06/29395.php> > http://www.open-mpi.org/community/lists/users/2016/06/29395.php > > > _______________________________________________ > users mailing list > us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> > Subscription: <https://www.open-mpi.org/mailman/listinfo.cgi/users> > https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > <http://www.open-mpi.org/community/lists/users/2016/06/29408.php> > http://www.open-mpi.org/community/lists/users/2016/06/29408.php > > > > > _______________________________________________ > users mailing listus...@open-mpi.org > <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29412.php > > >