Christopher,

just to be clear, MPI_Comm_spawn is *not* a basic functionality.
also, it might work on older windows (xp for example)

you might want to report thus issue to whoever provided this Open MPI
pre-compiled library.
an other option is to use cygwin, it provides a fairly recent Open MPI and
the maintainer is active.

other options include Linux (you can even run it in a virtual machine) or
OS X

Cheers,

Gilles

On Thursday, June 9, 2016, Roth, Christopher <cr...@aer.com> wrote:

> Thanks for the info, Gilles.
> Being relatively new to MPI, I was not aware 'sm' did not work with
> intercommunicators - I had assumed it was an option if the others were not
> available.
>
> I am running as an admin on this test machine.  When adding the option
> '-mca btl_tcp_port_min_v4 2000', a higher port number is used, but that
> does not alter the program behavior at all.
>
> Given that the initial development was on Linux using OpenMpi v1.5, I
> would like to assume the Windows implementation would have mostly
> equivalent feature development, and then improved in v1.6.  Apparently that
> isn't true...
> This is rather disappointing that a seemingly basic MPI communication
> functionality is broken like this under Windows, even if it is an older
> version.
> Hacking on the Windows OpenMPI code is a rabbit hole that I do not want to
> go down for numerous reasons.
>
> I have briefly explored alternate Windows MPI libraries: the Windows
> version of MPICH (from Microsoft) has not implemented MPI_Comm_Spawn; Intel
> MPI has a licensing fee.  Do you any other alternatives to suggest?
>
> ------------------------------
> *From:* users [users-boun...@open-mpi.org
> <javascript:_e(%7B%7D,'cvml','users-boun...@open-mpi.org');>] on behalf
> of Gilles Gouaillardet [gil...@rist.or.jp
> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>]
> *Sent:* Wednesday, June 08, 2016 7:58 PM
> *To:* Open MPI Users
> *Subject:* Re: [OMPI users] Processes unable to communicate when using
> MPI_Comm_spawn on Windows
>
> Christopher,
>
>
> the sm btl does not work with inter communicators and hence disqualifies
> itself.
>
> i guess this is what you interpreted as 'partially working'
>
>
> i am surprised you are using a privileged port (260 < 1024), are you
> running as an admin ?
>
>
> Open MPI is no more supported on windows, and the 1.6 series is pretty
> antique these days...
>
>
> regardless this, the source code points to
>
>
> static __inline int opal_get_socket_errno(void) {
>
>     int ret = WSAGetLastError();
>     switch (ret) {
>       case WSAEINTR: return EINTR;
> ...
>
>      default: printf("Feature not implemented: %d %s\n", __LINE__,
> __FILE__); return OPAL_ERROR;
>     };
> }
>
>
> at first, it is worth printing (ret) if the feature is not implemented.
>
> then you can hack this part and add the missing case
>
> recent windows (7) might use a newer one that was not available on older
> ones (xp)
>
>
> Cheers,
>
>
> Gilles
>
>
>
>
> On 6/9/2016 1:51 AM, Roth, Christopher wrote:
>
> Well, that obvious error message states the basic problem - I was hoping
> you had noticed a detail in the ompi_info output that would point to a
> reason for it.
>
> Further test runs with the option '-mca btl tcp,self' (excluding 'sm' from
> the mix) and '-mca btl_base_verbose 100', supply some more information:
> ------
> [sweet1:04556] btl: tcp: attempting to connect() to address 10.3.2.109 on
> port 260
> [sweet1:04556] btl: tcp: attempting to connect() to address 10.3.2.109 on
> port 260
> ------
> The ip address is the host machine's.  The process ID corresponds to the
> first of the executor programs.  The programs now hang at that point
> (within the scheduler's MPI_Comm_spawn call and the executors' MPI_Init
> calls), and and have to be manually killed.
>
> Yet another test, adding the '-mca mpi_preconnect_mpi 1' (along with the
> other two added arguments), gives more info:
> ------
> [sweet1:04976] btl: tcp: attempting to connect() to address 10.3.2.109 on
> port 260
> [sweet1:04516] btl: tcp: attempting to connect() to address 10.3.2.109 on
> port 260
> [sweet1:03824] btl: tcp: attempting to connect() to address 10.3.2.109 on
> port 260
>
> [sweet1][[17613,2],1][..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp\btl_tcp_endpoint.c:486:..\..\..\openmpi-1.6.2\ompi\mca\btl
> \tcp\btl_tcp_endpoint.c] received unexpected process identifier
> [[17613,2],0]
>
> [sweet1][[17613,2],0][..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp\btl_tcp_frag.c:215:..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp
> \btl_tcp_frag.c] Feature not implemented: 130
> D:/temp/OpenMPI/openmpi-1.6.2/opal/include\opal/opal_socket_errno.h
> Feature not implemented: 130
> D:/temp/OpenMPI/openmpi-1.6.2/opal/include\opal/opal_socket_errno.h
> mca_btl_tcp_frag_recv: readv failed: Unknown error (-1)
> ------
> With the 'preconnect' option, it sets up the TCP link for all of the
> executor processes, but then runs into an actual error, regarding some
> function not implemented.  This option is not required, but I had to give
> it a whirl.
>
> All of these test runs have the same behavior when performed with and
> without the firewall active.
>
> The fact that the executor programs don't get past the MPI_Init call when
> the 'sm' is excluded from btl set , implies that the 'sm' is at least
> partially working.
>
> ------------------------------
> *From:* users [users-boun...@open-mpi.org
> <javascript:_e(%7B%7D,'cvml','users-boun...@open-mpi.org');>] on behalf
> of Ralph Castain [r...@open-mpi.org
> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>]
> *Sent:* Wednesday, June 08, 2016 10:47 AM
> *To:* Open MPI Users
> *Subject:* Re: [OMPI users] Processes unable to communicate when using
> MPI_Comm_spawn on Windows
>
>
> On Jun 8, 2016, at 4:30 AM, Roth, Christopher <cr...@aer.com
> <javascript:_e(%7B%7D,'cvml','cr...@aer.com');>> wrote:
>
> What part of this output indicates this non-communicative configuration?
>
>
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
>   Process 1 ([[20141,1],0]) is on host: sweet1
>   Process 2 ([[20141,2],0]) is on host: sweet1
>   BTLs attempted: tcp sm self
>
> Your MPI job is now going to abort; sorry.
> —————————————————————————————————————
>
> Both procs are on the same host. Since they cannot communicate, it means
> that (a) the shared memory component (sm) was unable to be used, and (b)
> the TCP subsystem did not provide a usable address for the two procs to
> reach each other. The former could mean that there wasn’t enough room in
> the tmp directory, and the latter indicates that the TCP subsystem isn’t
> configured to allow connections from its own local IP address.
>
> I don’t know anything about Windows configuration I’m afraid.
>
>
> Please recall, this is using the precompiled OpenMpi Windows installation
>
> When the 'verbose' option is added, I see this sequence of output for the
> scheduler and each of the executor processes:
> ------
> [sweet1:06412] mca: base: components_open: Looking for btl components
> [sweet1:06412] mca: base: components_open: opening btl components
> [sweet1:06412] mca: base: components_open: found loaded component tcp
> [sweet1:06412] mca: base: components_open: component tcp register function
> successful
> [sweet1:06412] mca: base: components_open: component tcp open function
> successful
> [sweet1:06412] mca: base: components_open: found loaded component sm
> [sweet1:06412] mca: base: components_open: component sm has no register
> function
> [sweet1:06412] mca: base: components_open: component sm open function
> successful
> [sweet1:06412] mca: base: components_open: found loaded component self
> [sweet1:06412] mca: base: components_open: component self has no register
> function
> [sweet1:06412] mca: base: components_open: component self open function
> successful
> [sweet1:06412] select: initializing btl component tcp
> [sweet1:06412] select: init of component tcp returned success
> [sweet1:06412] select: initializing btl component sm
> [sweet1:06412] select: init of component sm returned success
> [sweet1:06412] select: initializing btl component self
> [sweet1:06412] select: init of component self returned success
> -------
>
> This output appears to show the btl components for TCP, SM and Self are
> all available, but this is contradicted the error message shown in the
> initial post  ("At least one pair of MPI processes are unable to reach each
> other for MPI communications....")
>
> If there is some sort of misconfiguration present, do you have a
> suggestion for correcting the situation?
>
> ------------------------------
> *From:* users [users-boun...@open-mpi.org
> <javascript:_e(%7B%7D,'cvml','users-boun...@open-mpi.org');>] on behalf
> of Ralph Castain [ <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>
> r...@open-mpi.org <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>]
> *Sent:* Tuesday, June 07, 2016 3:39 PM
> *To:* Open MPI Users
> *Subject:* Re: [OMPI users] Processes unable to communicate when using
> MPI_Comm_spawn on Windows
>
> Just looking at this output, it would appear that Windows is configured in
> a way that prevents the procs from connecting to each other via TCP while
> on the same node, and shared memory is disqualifying itself - which leaves
> no way for two procs on the same node to communicate.
>
>
> On Jun 7, 2016, at 12:16 PM, Roth, Christopher <
> <javascript:_e(%7B%7D,'cvml','cr...@aer.com');>cr...@aer.com
> <javascript:_e(%7B%7D,'cvml','cr...@aer.com');>> wrote:
>
> I have developed a set of C++ MPI programs for performing a series of
> scientific calculations.  The master 'scheduler' program spawns off sets of
> parallelized 'executor' programs using the MPI_Comm_spawn routine; these
> executors communicate back and forth with the scheduler (only small amounts
> of information) via MPI_Bcast, MPI_Recv and MPI_Send routines (the 'C'
> library versions).
>
> This software was originally developed on a multi-core Linux machine using
> OpenMpi v1.5.2, and works extremely well; now I'm attempting to port it to
> multi-core Windows 7 machine, using Visual Studios 2012 and the precompiled
> OpenMpi v1.6.2 Windows release.  It all compiles and links properly under
> VS2012.
> When attempting to run this software on the Windows machine, the scheduler
> program is able to spawn off the executor programs as intended, but
> everything chokes when scheduler sends its initial broadcast.  There is
> slightly different behavior when launching the scheduler via 'mpirun', or
> just by itself, as shown in the logs below:
> (the warning about the 127.0.0.1 address is benign - there is no loopback
> on Windows)
>
> C:\Users\cjr\Desktop\mpi_demo>mpirun -np 1 mpi_scheduler.exe
>  scheduler: MPI_Init
> --------------------------------------------------------------------------
> WARNING: An invalid value was given for btl_tcp_if_exclude.  This
> value will be ignored.
>
>   Local host: sweet1
>   Value:      127.0.0.1/8
>   Message:    Did not find interface matching this subnet
> --------------------------------------------------------------------------
> -->MPI_COMM_WORLD size = 1
> parent: MPI_UNIVERSE_SIZE = 1
> scheduler: calling MPI_Comm_spawn for 4 instances of 'mpi_executor.exe'
>  executor: MPI_Init
>  executor: MPI_Init
>  executor: MPI_Init
>  executor: MPI_Init
>
> [sweet1][[20141,1],0][..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp\btl_tcp_proc.c:128:..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp
> \btl_tcp_proc.c] mca_base_modex_recv: failed with return value=-13
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
>   Process 1 ([[20141,1],0]) is on host: sweet1
>   Process 2 ([[20141,2],0]) is on host: sweet1
>   BTLs attempted: tcp sm self
>
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
>  subtask rank = 1 out of 4
>  subtask rank = 2 out of 4
>  subtask rank = 0 out of 4
>  subtask rank = 3 out of 4
>
> scheduler: MPI_Comm_spawn completed
>  scheduler broadcasting function string length = 4
> child: MPI_UNIVERSE_SIZE = 4
> child: MPI_UNIVERSE_SIZE = 4
> child: MPI_UNIVERSE_SIZE = 4
> child: MPI_UNIVERSE_SIZE = 4
> Proc0 wait for first broadcast
> Proc1 wait for first broadcast
> Proc2 wait for first broadcast
> Proc3 wait for first broadcast
> [sweet1:6800] *** An error occurred in MPI_Bcast
> [sweet1:6800] *** on communicator
> [sweet1:6800] *** MPI_ERR_INTERN: internal error
> [sweet1:6800] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> [sweet1:02412] [[20141,0],0]-[[20141,1],0] mca_oob_tcp_msg_recv: readv
> failed: Unknown error (108)
> [sweet1:02412] 4 more processes have sent help message
> help-mpi-btl-tcp.txt / invalid if_inexclude
> [sweet1:02412] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
> --------------------------------------------------------------------------
> WARNING: A process refused to die!
>
> Host: sweet1
> PID:  524
>
> This process may still be running and/or consuming resources.
>
> --------------------------------------------------------------------------
> [sweet1:02412] [[20141,0],0]-[[20141,2],1] mca_oob_tcp_msg_recv: readv
> failed: Unknown error (108)
> [sweet1:02412] [[20141,0],0]-[[20141,2],0] mca_oob_tcp_msg_recv: readv
> failed: Unknown error (108)
> [sweet1:02412] [[20141,0],0]-[[20141,2],2] mca_oob_tcp_msg_recv: readv
> failed: Unknown error (108)
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 488 on
> node sweet1 exiting improperly. There are two reasons this could occur:
>
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
>
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> [sweet1:02412] 3 more processes have sent help message
> help-odls-default.txt / odls-default:could-not-kill
>
> C:\Users\cjr\Desktop\mpi_demo>
>
> ====================================================
>
> C:\Users\cjr\Desktop\mpi_demo>mpi_scheduler.exe
>  scheduler: MPI_Init
> --------------------------------------------------------------------------
> WARNING: An invalid value was given for btl_tcp_if_exclude.  This
> value will be ignored.
>
>   Local host: sweet1
>   Value:      127.0.0.1/8
>   Message:    Did not find interface matching this subnet
> --------------------------------------------------------------------------
> -->MPI_COMM_WORLD size = 1
> parent: MPI_UNIVERSE_SIZE = 1
> scheduler: calling MPI_Comm_spawn for 4 instances of 'mpi_executor.exe'
>  executor: MPI_Init
>  executor: MPI_Init
>  executor: MPI_Init
>  executor: MPI_Init
> [sweet1:04400] 1 more process has sent help message help-mpi-btl-tcp.txt /
> invalid if_inexclude
> [sweet1:04400] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
>  subtask rank = 2 out of 4
>  subtask rank = 1 out of 4
>  subtask rank = 0 out of 4
>  subtask rank = 3 out of 4
>
> scheduler: MPI_Comm_spawn completed
>  scheduler broadcasting function string length = 4
>
> child: MPI_UNIVERSE_SIZE = 4
> child: MPI_UNIVERSE_SIZE = 4
> child: MPI_UNIVERSE_SIZE = 4
> child: MPI_UNIVERSE_SIZE = 4
> Proc0 wait for first broadcast
> Proc1 wait for first broadcast
> Proc2 wait for first broadcast
> Proc3 wait for first broadcast
>
> [sweet1:04400] 3 more processes have sent help message
> help-mpi-btl-tcp.txt / invalid if_inexclude
>
> <<<<***mpi_executor.exe processes are running, but 'hung' while wating for
> first broadcast***>>>>
> <<<<***manually killed one of the 'mpi_executor.exe' processes; others
> subsequently exited***>>>>
>
> [sweet1:04400] [[22257,0],0]-[[22257,2],3] mca_oob_tcp_msg_recv: readv
> failed: Unknown error (108)
> --------------------------------------------------------------------------
> WARNING: A process refused to die!
>
> Host: sweet1
> PID:  568
>
> This process may still be running and/or consuming resources.
>
> --------------------------------------------------------------------------
> [sweet1:04400] [[22257,0],0]-[[22257,2],0] mca_oob_tcp_msg_recv: readv
> failed: Unknown error (108)
> [sweet1:04400] [[22257,0],0]-[[22257,2],1] mca_oob_tcp_msg_recv: readv
> failed: Unknown error (108)
> [sweet1:04400] 2 more processes have sent help message
> help-odls-default.txt / odls-default:could-not-kill
>
> C:\Users\cjr\Desktop\mpi_demo>
>
> ================================================
>
> The addition of the mpirun option "-mca btl_tcp_if_exclude none"
> eliminates the benign 127.0.0.1 warning; the option "-mca btl_base_verbose
> 100" adds output that verifies that the tcp, sm and self btl modules are
> successfully initialized - nothing else seems to be amiss!
> I have also tested this with the firewall completely disabled on the
> Windows machine, with no change in behavior.
>
> I have been unable to determine what the error codes indicate, and am
> puzzled why the behavior is different when using the 'mpirun' launcher.
> I have attached the prototype scheduler and executor program source code
> files, as well as files containing the Windows installation ompi
> information.
>
> Please help me figure out what is needed to enable this MPI communication.
>
> Thanks,
> CJ Roth
>
> ------------------------------
>
> This email is intended solely for the recipient. It may contain
> privileged, proprietary or confidential information or material. If you are
> not the intended recipient, please delete this email and any attachments
> and notify the sender of the error.
> <mpi_scheduler.cpp><mpi_executor.cpp><ompi_info-all.txt>
> <ompi_btl_info.txt>_______________________________________________
> users mailing list
> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
> Subscription:  <https://www.open-mpi.org/mailman/listinfo.cgi/users>
> https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> <http://www.open-mpi.org/community/lists/users/2016/06/29395.php>
> http://www.open-mpi.org/community/lists/users/2016/06/29395.php
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
> Subscription:  <https://www.open-mpi.org/mailman/listinfo.cgi/users>
> https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> <http://www.open-mpi.org/community/lists/users/2016/06/29408.php>
> http://www.open-mpi.org/community/lists/users/2016/06/29408.php
>
>
>
>
> _______________________________________________
> users mailing listus...@open-mpi.org 
> <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29412.php
>
>
>

Reply via email to