Jeff:

Thanks for the extremely informative reply!
I can appreciate the reasons for dropping of Windows support.  The reason for 
wanting a Windows version of this scientific calculation software is to allow 
the non-programmerish end-users be able to run it 'out of the box' using 
precompiled Windows executables.
I do not know much about Cygwin; I'll look into that and see what it can do for 
my situation.
Good info to know about 'sm' and why it doesn't play well with MPI_Comm_spawn.

I had wondered about recoding to _not_ use MPI_Comm_spawn, but (a) this works 
absolutely fabulously under Linux (both multi-core systems and clusters); and 
(b) any such changes would require significant changes to the architecture of 
the programs depending on it.  Given this alternative, the licensing fee for 
the Intel MPI library may not be an obstacle anymore.
A Linux VM on a Windows machine could be a possible solution, except that some 
of the end-users have their Windows boxes locked down tight, for security 
reasons.
I also thought the TCP connection issue was odd; I got the same results on both 
my work and home Windows computers.


________________________________________
From: users [users-boun...@open-mpi.org] on behalf of Jeff Squyres (jsquyres) 
[jsquy...@cisco.com]
Sent: Thursday, June 09, 2016 8:56 AM
To: Open MPI User's List
Subject: Re: [OMPI users] Processes unable to communicate when using 
MPI_Comm_spawn on Windows

I think there were a few not-entirely-correct data points in this thread. Let 
me clarify a few things:

1. Yes, Open MPI suspended native Windows support a while back. Native windows 
support is simply not a popular use case, and therefore we couldn't justify 
spending the time on it (not the mention the fact that no one in the community 
had enough Windows development experience to keep a native port alive and 
well-maintained).

2. That being said, AFAIK, Open MPI still compiles and runs fine -- albeit with 
restrictions -- in a Cygwin environment on Windows. This was deemed "good 
enough" by the Open MPI community (especially given the points from #1). Recent 
binary versions of Open MPI are available courtesy of the Cygwin project: 
https://cygwin.com/cgi-bin2/package-grep.cgi?grep=openmpi.

3. "sm" works fine with intercommunicators. What it doesn't do is handle the 
expansion of its shared memory allocation when new MPI processes are added via 
the dynamic APIs (e.g., MPI_COMM_SPAWN). We've talked about removing this 
restriction in "vader" (the next-gen version of the "sm" BTL -- yes, I know, 
the name is not intuitive at all...), but I don't think that this has been an 
important enough feature for anyone to spend time on it. As always, patches are 
welcome. ;-)

-----

It's been pointed out, but Open MPI 1.6.5 is pretty ancient (it was released 
Jun of 2013). Per #2, you might want to try the latest stable release (e.g., 
via Cygwin binaries).

There's two other options that may not have been mentioned yet:

1. Re-code your application to not use the MPI dynamic APIs (e.g., 
MPI_COMM_SPAWN). I know this is not quite what you want to do, but given all 
the other restrictions and data points, it might be your least-sucky option.

2. Run a VM on your Windows machine with some flavor of Linux. That would give 
you access to a much greater set of Open MPI features (i.e., significantly 
fewer restrictions).

As for why your machine cannot connect to itself using its own IP addresses, 
that's a bit odd. It suggests that you have some kind of blocking software in 
there somewhere. We probably don't have enough Windows experience here in the 
community to help with that.

Hope that helps!



> On Jun 9, 2016, at 5:06 AM, Roth, Christopher <cr...@aer.com> wrote:
>
> Thanks for the info, Gilles.
> Being relatively new to MPI, I was not aware 'sm' did not work with 
> intercommunicators - I had assumed it was an option if the others were not 
> available.
>
> I am running as an admin on this test machine. When adding the option '-mca 
> btl_tcp_port_min_v4 2000', a higher port number is used, but that does not 
> alter the program behavior at all.
>
> Given that the initial development was on Linux using OpenMpi v1.5, I would 
> like to assume the Windows implementation would have mostly equivalent 
> feature development, and then improved in v1.6. Apparently that isn't true...
> This is rather disappointing that a seemingly basic MPI communication 
> functionality is broken like this under Windows, even if it is an older 
> version.
> Hacking on the Windows OpenMPI code is a rabbit hole that I do not want to go 
> down for numerous reasons.
>
> I have briefly explored alternate Windows MPI libraries: the Windows version 
> of MPICH (from Microsoft) has not implemented MPI_Comm_Spawn; Intel MPI has a 
> licensing fee. Do you any other alternatives to suggest?
>
> From: users [users-boun...@open-mpi.org] on behalf of Gilles Gouaillardet 
> [gil...@rist.or.jp]
> Sent: Wednesday, June 08, 2016 7:58 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Processes unable to communicate when using 
> MPI_Comm_spawn on Windows
>
> Christopher,
>
> the sm btl does not work with inter communicators and hence disqualifies 
> itself.
> i guess this is what you interpreted as 'partially working'
>
> i am surprised you are using a privileged port (260 < 1024), are you running 
> as an admin ?
>
> Open MPI is no more supported on windows, and the 1.6 series is pretty 
> antique these days...
>
> regardless this, the source code points to
>
> static __inline int opal_get_socket_errno(void) {
> int ret = WSAGetLastError();
> switch (ret) {
> case WSAEINTR: return EINTR;
> ...
> default: printf("Feature not implemented: %d %s\n", __LINE__, __FILE__); 
> return OPAL_ERROR;
> };
> }
>
> at first, it is worth printing (ret) if the feature is not implemented.
> then you can hack this part and add the missing case
> recent windows (7) might use a newer one that was not available on older ones 
> (xp)
>
> Cheers,
>
> Gilles
>
>
>
> On 6/9/2016 1:51 AM, Roth, Christopher wrote:
>> Well, that obvious error message states the basic problem - I was hoping you 
>> had noticed a detail in the ompi_info output that would point to a reason 
>> for it.
>>
>> Further test runs with the option '-mca btl tcp,self' (excluding 'sm' from 
>> the mix) and '-mca btl_base_verbose 100', supply some more information:
>> ------
>> [sweet1:04556] btl: tcp: attempting to connect() to address 10.3.2.109 on 
>> port 260
>> [sweet1:04556] btl: tcp: attempting to connect() to address 10.3.2.109 on 
>> port 260
>> ------
>> The ip address is the host machine's. The process ID corresponds to the 
>> first of the executor programs. The programs now hang at that point (within 
>> the scheduler's MPI_Comm_spawn call and the executors' MPI_Init calls), and 
>> and have to be manually killed.
>>
>> Yet another test, adding the '-mca mpi_preconnect_mpi 1' (along with the 
>> other two added arguments), gives more info:
>> ------
>> [sweet1:04976] btl: tcp: attempting to connect() to address 10.3.2.109 on 
>> port 260
>> [sweet1:04516] btl: tcp: attempting to connect() to address 10.3.2.109 on 
>> port 260
>> [sweet1:03824] btl: tcp: attempting to connect() to address 10.3.2.109 on 
>> port 260
>> [sweet1][[17613,2],1][..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp\btl_tcp_endpoint.c:486:..\..\..\openmpi-1.6.2\ompi\mca\btl
>> \tcp\btl_tcp_endpoint.c] received unexpected process identifier [[17613,2],0]
>> [sweet1][[17613,2],0][..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp\btl_tcp_frag.c:215:..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp
>> \btl_tcp_frag.c] Feature not implemented: 130 
>> D:/temp/OpenMPI/openmpi-1.6.2/opal/include\opal/opal_socket_errno.h
>> Feature not implemented: 130 
>> D:/temp/OpenMPI/openmpi-1.6.2/opal/include\opal/opal_socket_errno.h
>> mca_btl_tcp_frag_recv: readv failed: Unknown error (-1)
>> ------
>> With the 'preconnect' option, it sets up the TCP link for all of the 
>> executor processes, but then runs into an actual error, regarding some 
>> function not implemented. This option is not required, but I had to give it 
>> a whirl.
>>
>> All of these test runs have the same behavior when performed with and 
>> without the firewall active.
>>
>> The fact that the executor programs don't get past the MPI_Init call when 
>> the 'sm' is excluded from btl set , implies that the 'sm' is at least 
>> partially working.
>>
>> From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
>> [r...@open-mpi.org]
>> Sent: Wednesday, June 08, 2016 10:47 AM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Processes unable to communicate when using 
>> MPI_Comm_spawn on Windows
>>
>>
>>> On Jun 8, 2016, at 4:30 AM, Roth, Christopher <cr...@aer.com> wrote:
>>>
>>> What part of this output indicates this non-communicative configuration?
>>
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications. This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes. This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other. This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>> Process 1 ([[20141,1],0]) is on host: sweet1
>> Process 2 ([[20141,2],0]) is on host: sweet1
>> BTLs attempted: tcp sm self
>>
>> Your MPI job is now going to abort; sorry.
>> —————————————————————————————————————
>>
>> Both procs are on the same host. Since they cannot communicate, it means 
>> that (a) the shared memory component (sm) was unable to be used, and (b) the 
>> TCP subsystem did not provide a usable address for the two procs to reach 
>> each other. The former could mean that there wasn’t enough room in the tmp 
>> directory, and the latter indicates that the TCP subsystem isn’t configured 
>> to allow connections from its own local IP address.
>>
>> I don’t know anything about Windows configuration I’m afraid.
>>
>>
>>> Please recall, this is using the precompiled OpenMpi Windows installation
>>>
>>> When the 'verbose' option is added, I see this sequence of output for the 
>>> scheduler and each of the executor processes:
>>> ------
>>> [sweet1:06412] mca: base: components_open: Looking for btl components
>>> [sweet1:06412] mca: base: components_open: opening btl components
>>> [sweet1:06412] mca: base: components_open: found loaded component tcp
>>> [sweet1:06412] mca: base: components_open: component tcp register function 
>>> successful
>>> [sweet1:06412] mca: base: components_open: component tcp open function 
>>> successful
>>> [sweet1:06412] mca: base: components_open: found loaded component sm
>>> [sweet1:06412] mca: base: components_open: component sm has no register 
>>> function
>>> [sweet1:06412] mca: base: components_open: component sm open function 
>>> successful
>>> [sweet1:06412] mca: base: components_open: found loaded component self
>>> [sweet1:06412] mca: base: components_open: component self has no register 
>>> function
>>> [sweet1:06412] mca: base: components_open: component self open function 
>>> successful
>>> [sweet1:06412] select: initializing btl component tcp
>>> [sweet1:06412] select: init of component tcp returned success
>>> [sweet1:06412] select: initializing btl component sm
>>> [sweet1:06412] select: init of component sm returned success
>>> [sweet1:06412] select: initializing btl component self
>>> [sweet1:06412] select: init of component self returned success
>>> -------
>>>
>>> This output appears to show the btl components for TCP, SM and Self are all 
>>> available, but this is contradicted the error message shown in the initial 
>>> post ("At least one pair of MPI processes are unable to reach each other 
>>> for MPI communications....")
>>>
>>> If there is some sort of misconfiguration present, do you have a suggestion 
>>> for correcting the situation?
>>>
>>> From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
>>> [r...@open-mpi.org]
>>> Sent: Tuesday, June 07, 2016 3:39 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] Processes unable to communicate when using 
>>> MPI_Comm_spawn on Windows
>>>
>>> Just looking at this output, it would appear that Windows is configured in 
>>> a way that prevents the procs from connecting to each other via TCP while 
>>> on the same node, and shared memory is disqualifying itself - which leaves 
>>> no way for two procs on the same node to communicate.
>>>
>>>
>>>> On Jun 7, 2016, at 12:16 PM, Roth, Christopher <cr...@aer.com> wrote:
>>>>
>>>> I have developed a set of C++ MPI programs for performing a series of 
>>>> scientific calculations. The master 'scheduler' program spawns off sets of 
>>>> parallelized 'executor' programs using the MPI_Comm_spawn routine; these 
>>>> executors communicate back and forth with the scheduler (only small 
>>>> amounts of information) via MPI_Bcast, MPI_Recv and MPI_Send routines (the 
>>>> 'C' library versions).
>>>>
>>>> This software was originally developed on a multi-core Linux machine using 
>>>> OpenMpi v1.5.2, and works extremely well; now I'm attempting to port it to 
>>>> multi-core Windows 7 machine, using Visual Studios 2012 and the 
>>>> precompiled OpenMpi v1.6.2 Windows release. It all compiles and links 
>>>> properly under VS2012.
>>>> When attempting to run this software on the Windows machine, the scheduler 
>>>> program is able to spawn off the executor programs as intended, but 
>>>> everything chokes when scheduler sends its initial broadcast. There is 
>>>> slightly different behavior when launching the scheduler via 'mpirun', or 
>>>> just by itself, as shown in the logs below:
>>>> (the warning about the 127.0.0.1 address is benign - there is no loopback 
>>>> on Windows)
>>>>
>>>> C:\Users\cjr\Desktop\mpi_demo>mpirun -np 1 mpi_scheduler.exe
>>>> scheduler: MPI_Init
>>>> --------------------------------------------------------------------------
>>>> WARNING: An invalid value was given for btl_tcp_if_exclude. This
>>>> value will be ignored.
>>>>
>>>> Local host: sweet1
>>>> Value: 127.0.0.1/8
>>>> Message: Did not find interface matching this subnet
>>>> --------------------------------------------------------------------------
>>>> -->MPI_COMM_WORLD size = 1
>>>> parent: MPI_UNIVERSE_SIZE = 1
>>>> scheduler: calling MPI_Comm_spawn for 4 instances of 'mpi_executor.exe'
>>>> executor: MPI_Init
>>>> executor: MPI_Init
>>>> executor: MPI_Init
>>>> executor: MPI_Init
>>>> [sweet1][[20141,1],0][..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp\btl_tcp_proc.c:128:..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp
>>>> \btl_tcp_proc.c] mca_base_modex_recv: failed with return value=-13
>>>> --------------------------------------------------------------------------
>>>> At least one pair of MPI processes are unable to reach each other for
>>>> MPI communications. This means that no Open MPI device has indicated
>>>> that it can be used to communicate between these processes. This is
>>>> an error; Open MPI requires that all MPI processes be able to reach
>>>> each other. This error can sometimes be the result of forgetting to
>>>> specify the "self" BTL.
>>>>
>>>> Process 1 ([[20141,1],0]) is on host: sweet1
>>>> Process 2 ([[20141,2],0]) is on host: sweet1
>>>> BTLs attempted: tcp sm self
>>>>
>>>> Your MPI job is now going to abort; sorry.
>>>> --------------------------------------------------------------------------
>>>> subtask rank = 1 out of 4
>>>> subtask rank = 2 out of 4
>>>> subtask rank = 0 out of 4
>>>> subtask rank = 3 out of 4
>>>>
>>>> scheduler: MPI_Comm_spawn completed
>>>> scheduler broadcasting function string length = 4
>>>> child: MPI_UNIVERSE_SIZE = 4
>>>> child: MPI_UNIVERSE_SIZE = 4
>>>> child: MPI_UNIVERSE_SIZE = 4
>>>> child: MPI_UNIVERSE_SIZE = 4
>>>> Proc0 wait for first broadcast
>>>> Proc1 wait for first broadcast
>>>> Proc2 wait for first broadcast
>>>> Proc3 wait for first broadcast
>>>> [sweet1:6800] *** An error occurred in MPI_Bcast
>>>> [sweet1:6800] *** on communicator
>>>> [sweet1:6800] *** MPI_ERR_INTERN: internal error
>>>> [sweet1:6800] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
>>>> [sweet1:02412] [[20141,0],0]-[[20141,1],0] mca_oob_tcp_msg_recv: readv 
>>>> failed: Unknown error (108)
>>>> [sweet1:02412] 4 more processes have sent help message 
>>>> help-mpi-btl-tcp.txt / invalid if_inexclude
>>>> [sweet1:02412] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
>>>> all help / error messages
>>>> --------------------------------------------------------------------------
>>>> WARNING: A process refused to die!
>>>>
>>>> Host: sweet1
>>>> PID: 524
>>>>
>>>> This process may still be running and/or consuming resources.
>>>>
>>>> --------------------------------------------------------------------------
>>>> [sweet1:02412] [[20141,0],0]-[[20141,2],1] mca_oob_tcp_msg_recv: readv 
>>>> failed: Unknown error (108)
>>>> [sweet1:02412] [[20141,0],0]-[[20141,2],0] mca_oob_tcp_msg_recv: readv 
>>>> failed: Unknown error (108)
>>>> [sweet1:02412] [[20141,0],0]-[[20141,2],2] mca_oob_tcp_msg_recv: readv 
>>>> failed: Unknown error (108)
>>>> --------------------------------------------------------------------------
>>>> mpirun has exited due to process rank 0 with PID 488 on
>>>> node sweet1 exiting improperly. There are two reasons this could occur:
>>>>
>>>> 1. this process did not call "init" before exiting, but others in
>>>> the job did. This can cause a job to hang indefinitely while it waits
>>>> for all processes to call "init". By rule, if one process calls "init",
>>>> then ALL processes must call "init" prior to termination.
>>>>
>>>> 2. this process called "init", but exited without calling "finalize".
>>>> By rule, all processes that call "init" MUST call "finalize" prior to
>>>> exiting or it will be considered an "abnormal termination"
>>>>
>>>> This may have caused other processes in the application to be
>>>> terminated by signals sent by mpirun (as reported here).
>>>> --------------------------------------------------------------------------
>>>> [sweet1:02412] 3 more processes have sent help message 
>>>> help-odls-default.txt / odls-default:could-not-kill
>>>>
>>>> C:\Users\cjr\Desktop\mpi_demo>
>>>>
>>>> ====================================================
>>>>
>>>> C:\Users\cjr\Desktop\mpi_demo>mpi_scheduler.exe
>>>> scheduler: MPI_Init
>>>> --------------------------------------------------------------------------
>>>> WARNING: An invalid value was given for btl_tcp_if_exclude. This
>>>> value will be ignored.
>>>>
>>>> Local host: sweet1
>>>> Value: 127.0.0.1/8
>>>> Message: Did not find interface matching this subnet
>>>> --------------------------------------------------------------------------
>>>> -->MPI_COMM_WORLD size = 1
>>>> parent: MPI_UNIVERSE_SIZE = 1
>>>> scheduler: calling MPI_Comm_spawn for 4 instances of 'mpi_executor.exe'
>>>> executor: MPI_Init
>>>> executor: MPI_Init
>>>> executor: MPI_Init
>>>> executor: MPI_Init
>>>> [sweet1:04400] 1 more process has sent help message help-mpi-btl-tcp.txt / 
>>>> invalid if_inexclude
>>>> [sweet1:04400] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
>>>> all help / error messages
>>>> subtask rank = 2 out of 4
>>>> subtask rank = 1 out of 4
>>>> subtask rank = 0 out of 4
>>>> subtask rank = 3 out of 4
>>>>
>>>> scheduler: MPI_Comm_spawn completed
>>>> scheduler broadcasting function string length = 4
>>>>
>>>> child: MPI_UNIVERSE_SIZE = 4
>>>> child: MPI_UNIVERSE_SIZE = 4
>>>> child: MPI_UNIVERSE_SIZE = 4
>>>> child: MPI_UNIVERSE_SIZE = 4
>>>> Proc0 wait for first broadcast
>>>> Proc1 wait for first broadcast
>>>> Proc2 wait for first broadcast
>>>> Proc3 wait for first broadcast
>>>>
>>>> [sweet1:04400] 3 more processes have sent help message 
>>>> help-mpi-btl-tcp.txt / invalid if_inexclude
>>>>
>>>> <<<<***mpi_executor.exe processes are running, but 'hung' while wating for 
>>>> first broadcast***>>>>
>>>> <<<<***manually killed one of the 'mpi_executor.exe' processes; others 
>>>> subsequently exited***>>>>
>>>>
>>>> [sweet1:04400] [[22257,0],0]-[[22257,2],3] mca_oob_tcp_msg_recv: readv 
>>>> failed: Unknown error (108)
>>>> --------------------------------------------------------------------------
>>>> WARNING: A process refused to die!
>>>>
>>>> Host: sweet1
>>>> PID: 568
>>>>
>>>> This process may still be running and/or consuming resources.
>>>>
>>>> --------------------------------------------------------------------------
>>>> [sweet1:04400] [[22257,0],0]-[[22257,2],0] mca_oob_tcp_msg_recv: readv 
>>>> failed: Unknown error (108)
>>>> [sweet1:04400] [[22257,0],0]-[[22257,2],1] mca_oob_tcp_msg_recv: readv 
>>>> failed: Unknown error (108)
>>>> [sweet1:04400] 2 more processes have sent help message 
>>>> help-odls-default.txt / odls-default:could-not-kill
>>>>
>>>> C:\Users\cjr\Desktop\mpi_demo>
>>>>
>>>> ================================================
>>>>
>>>> The addition of the mpirun option "-mca btl_tcp_if_exclude none" 
>>>> eliminates the benign 127.0.0.1 warning; the option "-mca btl_base_verbose 
>>>> 100" adds output that verifies that the tcp, sm and self btl modules are 
>>>> successfully initialized - nothing else seems to be amiss!
>>>> I have also tested this with the firewall completely disabled on the 
>>>> Windows machine, with no change in behavior.
>>>>
>>>> I have been unable to determine what the error codes indicate, and am 
>>>> puzzled why the behavior is different when using the 'mpirun' launcher.
>>>> I have attached the prototype scheduler and executor program source code 
>>>> files, as well as files containing the Windows installation ompi 
>>>> information.
>>>>
>>>> Please help me figure out what is needed to enable this MPI communication.
>>>>
>>>> Thanks,
>>>> CJ Roth
>>>>
>>>>
>>>> This email is intended solely for the recipient. It may contain 
>>>> privileged, proprietary or confidential information or material. If you 
>>>> are not the intended recipient, please delete this email and any 
>>>> attachments and notify the sender of the error.
>>>> <mpi_scheduler.cpp><mpi_executor.cpp><ompi_info-all.txt><ompi_btl_info.txt>_______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2016/06/29395.php
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2016/06/29408.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>>
>> us...@open-mpi.org
>>
>> Subscription:
>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/06/29412.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29416.php


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/06/29417.php

Reply via email to