Christopher,
the sm btl does not work with inter communicators and hence disqualifies
itself.
i guess this is what you interpreted as 'partially working'
i am surprised you are using a privileged port (260 < 1024), are you
running as an admin ?
Open MPI is no more supported on windows, and the 1.6 series is pretty
antique these days...
regardless this, the source code points to
static __inline int opal_get_socket_errno(void) {
int ret = WSAGetLastError();
switch (ret) {
case WSAEINTR: return EINTR;
...
default: printf("Feature not implemented: %d %s\n", __LINE__,
__FILE__); return OPAL_ERROR;
};
}
at first, it is worth printing (ret) if the feature is not implemented.
then you can hack this part and add the missing case
recent windows (7) might use a newer one that was not available on older
ones (xp)
Cheers,
Gilles
On 6/9/2016 1:51 AM, Roth, Christopher wrote:
Well, that obvious error message states the basic problem - I was
hoping you had noticed a detail in the ompi_info output that would
point to a reason for it.
Further test runs with the option '-mca btl tcp,self' (excluding 'sm'
from the mix) and '-mca btl_base_verbose 100', supply some more
information:
------
[sweet1:04556] btl: tcp: attempting to connect() to address 10.3.2.109
on port 260
[sweet1:04556] btl: tcp: attempting to connect() to address 10.3.2.109
on port 260
------
The ip address is the host machine's. The process ID corresponds to
the first of the executor programs. The programs now hang at that
point (within the scheduler's MPI_Comm_spawn call and the executors'
MPI_Init calls), and and have to be manually killed.
Yet another test, adding the '-mca mpi_preconnect_mpi 1' (along with
the other two added arguments), gives more info:
------
[sweet1:04976] btl: tcp: attempting to connect() to address 10.3.2.109
on port 260
[sweet1:04516] btl: tcp: attempting to connect() to address 10.3.2.109
on port 260
[sweet1:03824] btl: tcp: attempting to connect() to address 10.3.2.109
on port 260
[sweet1][[17613,2],1][..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp\btl_tcp_endpoint.c:486:..\..\..\openmpi-1.6.2\ompi\mca\btl
\tcp\btl_tcp_endpoint.c] received unexpected process identifier
[[17613,2],0]
[sweet1][[17613,2],0][..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp\btl_tcp_frag.c:215:..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp
\btl_tcp_frag.c] Feature not implemented: 130
D:/temp/OpenMPI/openmpi-1.6.2/opal/include\opal/opal_socket_errno.h
Feature not implemented: 130
D:/temp/OpenMPI/openmpi-1.6.2/opal/include\opal/opal_socket_errno.h
mca_btl_tcp_frag_recv: readv failed: Unknown error (-1)
------
With the 'preconnect' option, it sets up the TCP link for all of the
executor processes, but then runs into an actual error, regarding some
function not implemented. This option is not required, but I had to
give it a whirl.
All of these test runs have the same behavior when performed with and
without the firewall active.
The fact that the executor programs don't get past the MPI_Init call
when the 'sm' is excluded from btl set , implies that the 'sm' is at
least partially working.
------------------------------------------------------------------------
*From:* users [users-boun...@open-mpi.org] on behalf of Ralph Castain
[r...@open-mpi.org]
*Sent:* Wednesday, June 08, 2016 10:47 AM
*To:* Open MPI Users
*Subject:* Re: [OMPI users] Processes unable to communicate when using
MPI_Comm_spawn on Windows
On Jun 8, 2016, at 4:30 AM, Roth, Christopher <cr...@aer.com
<mailto:cr...@aer.com>> wrote:
What part of this output indicates this non-communicative configuration?
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[20141,1],0]) is on host: sweet1
Process 2 ([[20141,2],0]) is on host: sweet1
BTLs attempted: tcp sm self
Your MPI job is now going to abort; sorry.
—————————————————————————————————————
Both procs are on the same host. Since they cannot communicate, it
means that (a) the shared memory component (sm) was unable to be used,
and (b) the TCP subsystem did not provide a usable address for the two
procs to reach each other. The former could mean that there wasn’t
enough room in the tmp directory, and the latter indicates that the
TCP subsystem isn’t configured to allow connections from its own local
IP address.
I don’t know anything about Windows configuration I’m afraid.
Please recall, this is using the precompiled OpenMpi Windows installation
When the 'verbose' option is added, I see this sequence of output for
the scheduler and each of the executor processes:
------
[sweet1:06412] mca: base: components_open: Looking for btl components
[sweet1:06412] mca: base: components_open: opening btl components
[sweet1:06412] mca: base: components_open: found loaded component tcp
[sweet1:06412] mca: base: components_open: component tcp register
function successful
[sweet1:06412] mca: base: components_open: component tcp open
function successful
[sweet1:06412] mca: base: components_open: found loaded component sm
[sweet1:06412] mca: base: components_open: component sm has no
register function
[sweet1:06412] mca: base: components_open: component sm open function
successful
[sweet1:06412] mca: base: components_open: found loaded component self
[sweet1:06412] mca: base: components_open: component self has no
register function
[sweet1:06412] mca: base: components_open: component self open
function successful
[sweet1:06412] select: initializing btl component tcp
[sweet1:06412] select: init of component tcp returned success
[sweet1:06412] select: initializing btl component sm
[sweet1:06412] select: init of component sm returned success
[sweet1:06412] select: initializing btl component self
[sweet1:06412] select: init of component self returned success
-------
This output appears to show the btl components for TCP, SM and Self
are all available, but this is contradicted the error message shown
in the initial post ("At least one pair of MPI processes are unable
to reach each other for MPI communications....")
If there is some sort of misconfiguration present, do you have a
suggestion for correcting the situation?
------------------------------------------------------------------------
*From:*users [users-boun...@open-mpi.org
<mailto:users-boun...@open-mpi.org>] on behalf of Ralph Castain
[r...@open-mpi.org <mailto:r...@open-mpi.org>]
*Sent:*Tuesday, June 07, 2016 3:39 PM
*To:*Open MPI Users
*Subject:*Re: [OMPI users] Processes unable to communicate when using
MPI_Comm_spawn on Windows
Just looking at this output, it would appear that Windows is
configured in a way that prevents the procs from connecting to each
other via TCP while on the same node, and shared memory is
disqualifying itself - which leaves no way for two procs on the same
node to communicate.
On Jun 7, 2016, at 12:16 PM, Roth, Christopher <cr...@aer.com
<mailto:cr...@aer.com>> wrote:
I have developed a set of C++ MPI programs for performing a series
of scientific calculations. The master 'scheduler' program spawns
off sets of parallelized 'executor' programs using the
MPI_Comm_spawn routine; these executors communicate back and forth
with the scheduler (only small amounts of information) via
MPI_Bcast, MPI_Recv and MPI_Send routines (the 'C' library versions).
This software was originally developed on a multi-core Linux machine
using OpenMpi v1.5.2, and works extremely well; now I'm attempting
to port it to multi-core Windows 7 machine, using Visual Studios
2012 and the precompiled OpenMpi v1.6.2 Windows release. It all
compiles and links properly under VS2012.
When attempting to run this software on the Windows machine, the
scheduler program is able to spawn off the executor programs as
intended, but everything chokes when scheduler sends its initial
broadcast. There is slightly different behavior when launching the
scheduler via 'mpirun', or just by itself, as shown in the logs below:
(the warning about the 127.0.0.1 address is benign - there is no
loopback on Windows)
C:\Users\cjr\Desktop\mpi_demo>mpirun -np 1 mpi_scheduler.exe
scheduler: MPI_Init
--------------------------------------------------------------------------
WARNING: An invalid value was given for btl_tcp_if_exclude. This
value will be ignored.
Local host: sweet1
Value: 127.0.0.1/8
Message: Did not find interface matching this subnet
--------------------------------------------------------------------------
-->MPI_COMM_WORLD size = 1
parent: MPI_UNIVERSE_SIZE = 1
scheduler: calling MPI_Comm_spawn for 4 instances of 'mpi_executor.exe'
executor: MPI_Init
executor: MPI_Init
executor: MPI_Init
executor: MPI_Init
[sweet1][[20141,1],0][..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp\btl_tcp_proc.c:128:..\..\..\openmpi-1.6.2\ompi\mca\btl\tcp
\btl_tcp_proc.c] mca_base_modex_recv: failed with return value=-13
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[20141,1],0]) is on host: sweet1
Process 2 ([[20141,2],0]) is on host: sweet1
BTLs attempted: tcp sm self
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
subtask rank = 1 out of 4
subtask rank = 2 out of 4
subtask rank = 0 out of 4
subtask rank = 3 out of 4
scheduler: MPI_Comm_spawn completed
scheduler broadcasting function string length = 4
child: MPI_UNIVERSE_SIZE = 4
child: MPI_UNIVERSE_SIZE = 4
child: MPI_UNIVERSE_SIZE = 4
child: MPI_UNIVERSE_SIZE = 4
Proc0 wait for first broadcast
Proc1 wait for first broadcast
Proc2 wait for first broadcast
Proc3 wait for first broadcast
[sweet1:6800] *** An error occurred in MPI_Bcast
[sweet1:6800] *** on communicator
[sweet1:6800] *** MPI_ERR_INTERN: internal error
[sweet1:6800] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
[sweet1:02412] [[20141,0],0]-[[20141,1],0] mca_oob_tcp_msg_recv:
readv failed: Unknown error (108)
[sweet1:02412] 4 more processes have sent help message
help-mpi-btl-tcp.txt / invalid if_inexclude
[sweet1:02412] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages
--------------------------------------------------------------------------
WARNING: A process refused to die!
Host: sweet1
PID: 524
This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
[sweet1:02412] [[20141,0],0]-[[20141,2],1] mca_oob_tcp_msg_recv:
readv failed: Unknown error (108)
[sweet1:02412] [[20141,0],0]-[[20141,2],0] mca_oob_tcp_msg_recv:
readv failed: Unknown error (108)
[sweet1:02412] [[20141,0],0]-[[20141,2],2] mca_oob_tcp_msg_recv:
readv failed: Unknown error (108)
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 488 on
node sweet1 exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[sweet1:02412] 3 more processes have sent help message
help-odls-default.txt / odls-default:could-not-kill
C:\Users\cjr\Desktop\mpi_demo>
====================================================
C:\Users\cjr\Desktop\mpi_demo>mpi_scheduler.exe
scheduler: MPI_Init
--------------------------------------------------------------------------
WARNING: An invalid value was given for btl_tcp_if_exclude. This
value will be ignored.
Local host: sweet1
Value: 127.0.0.1/8
Message: Did not find interface matching this subnet
--------------------------------------------------------------------------
-->MPI_COMM_WORLD size = 1
parent: MPI_UNIVERSE_SIZE = 1
scheduler: calling MPI_Comm_spawn for 4 instances of 'mpi_executor.exe'
executor: MPI_Init
executor: MPI_Init
executor: MPI_Init
executor: MPI_Init
[sweet1:04400] 1 more process has sent help message
help-mpi-btl-tcp.txt / invalid if_inexclude
[sweet1:04400] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages
subtask rank = 2 out of 4
subtask rank = 1 out of 4
subtask rank = 0 out of 4
subtask rank = 3 out of 4
scheduler: MPI_Comm_spawn completed
scheduler broadcasting function string length = 4
child: MPI_UNIVERSE_SIZE = 4
child: MPI_UNIVERSE_SIZE = 4
child: MPI_UNIVERSE_SIZE = 4
child: MPI_UNIVERSE_SIZE = 4
Proc0 wait for first broadcast
Proc1 wait for first broadcast
Proc2 wait for first broadcast
Proc3 wait for first broadcast
[sweet1:04400] 3 more processes have sent help message
help-mpi-btl-tcp.txt / invalid if_inexclude
<<<<***mpi_executor.exe processes are running, but 'hung' while
wating for first broadcast***>>>>
<<<<***manually killed one of the 'mpi_executor.exe' processes;
others subsequently exited***>>>>
[sweet1:04400] [[22257,0],0]-[[22257,2],3] mca_oob_tcp_msg_recv:
readv failed: Unknown error (108)
--------------------------------------------------------------------------
WARNING: A process refused to die!
Host: sweet1
PID: 568
This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
[sweet1:04400] [[22257,0],0]-[[22257,2],0] mca_oob_tcp_msg_recv:
readv failed: Unknown error (108)
[sweet1:04400] [[22257,0],0]-[[22257,2],1] mca_oob_tcp_msg_recv:
readv failed: Unknown error (108)
[sweet1:04400] 2 more processes have sent help message
help-odls-default.txt / odls-default:could-not-kill
C:\Users\cjr\Desktop\mpi_demo>
================================================
The addition of the mpirun option "-mca btl_tcp_if_exclude none"
eliminates the benign 127.0.0.1 warning; the option "-mca
btl_base_verbose 100" adds output that verifies that the tcp, sm and
self btl modules are successfully initialized - nothing else seems
to be amiss!
I have also tested this with the firewall completely disabled on the
Windows machine, with no change in behavior.
I have been unable to determine what the error codes indicate, and
am puzzled why the behavior is different when using the 'mpirun'
launcher.
I have attached the prototype scheduler and executor program source
code files, as well as files containing the Windows installation
ompi information.
Please help me figure out what is needed to enable this MPI
communication.
Thanks,
CJ Roth
------------------------------------------------------------------------
This email is intended solely for the recipient. It may contain
privileged, proprietary or confidential information or material. If
you are not the intended recipient, please delete this email and any
attachments and notify the sender of the error.
<mpi_scheduler.cpp><mpi_executor.cpp><ompi_info-all.txt><ompi_btl_info.txt>_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2016/06/29395.php
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2016/06/29408.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/06/29412.php