On Sep 30, 2014, at 10:49 AM, Lee-Ping Wang <leep...@stanford.edu> wrote:

> Hi Ralph,
> 
> Thank you.  I think your diagnosis is probably correct.  Are these sockets 
> the same as TCP/UDP ports (though different numbers) that are used in web 
> servers, email etc?

Yes

>  If so, then I should be able to (1) locate where the port number is defined 
> in the code, and (2) randomize the port number every time it's called to work 
> around the issue.  What do you think?

That might work, depending on the code. I'm not sure what it is trying to 
connect to, and if that code knows how to handle arbitrary connections

You might check about those warnings - could be that QCLOCALSCR and QCREF need 
to be set for the code to work.

> 
> - Lee-Ping
> 
> On Sep 29, 2014, at 8:45 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
>> I don't know anything about your application, or what the functions in your 
>> code are doing. I imagine it's possible that you are trying to open 
>> statically defined ports, which means that running the job again too soon 
>> could leave the OS thinking the socket is already busy. It takes awhile for 
>> the OS to release a socket resource.
>> 
>> 
>> On Sep 29, 2014, at 5:49 PM, Lee-Ping Wang <leep...@stanford.edu> wrote:
>> 
>>> Here's another data point that might be useful: The error message is much 
>>> more rare if I run my application on 4 cores instead of 8.
>>> 
>>> Thanks,
>>> 
>>> - Lee-Ping
>>> 
>>> On Sep 29, 2014, at 5:38 PM, Lee-Ping Wang <leep...@stanford.edu> wrote:
>>> 
>>>> Sorry for my last email - I think I spoke too quick.  I realized after 
>>>> reading some more documentation that OpenMPI always uses TCP sockets for 
>>>> out-of-band communication, so it doesn't make sense for me to set 
>>>> OMPI_MCA_oob=^tcp.  That said, I am still running into a strange problem 
>>>> in my application when running on a specific machine (Blue Waters compute 
>>>> node); I don't see this problem on any other nodes.
>>>> 
>>>> When I run the same job (~5 seconds) in rapid succession, I see the 
>>>> following error message on the second execution:
>>>> 
>>>> /tmp/leeping/opt/qchem-4.2/bin/parallel.csh,  , qcopt_reactants.in, 8, 0, 
>>>> ./qchem24825/
>>>> MPIRUN in parallel.csh is 
>>>> /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun
>>>> P4_RSHCOMMAND in parallel.csh is ssh
>>>> QCOUTFILE is stdout
>>>> Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines
>>>> [nid15081:24859] Warning: could not find environment variable "QCLOCALSCR"
>>>> [nid15081:24859] Warning: could not find environment variable "QCREF"
>>>> initial socket setup ...start
>>>> -------------------------------------------------------
>>>> Primary job  terminated normally, but 1 process returned
>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>> -------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> mpirun detected that one or more processes exited with non-zero status, 
>>>> thus causing
>>>> the job to be terminated. The first process to do so was:
>>>> 
>>>>   Process name: [[46773,1],0]
>>>>   Exit code:    255
>>>> --------------------------------------------------------------------------
>>>> 
>>>> And here's the source code where the program is exiting (before "initial 
>>>> socket setup ...done")
>>>> 
>>>> int GPICommSoc::init(MPI_Comm comm0) {
>>>> 
>>>>     /* setup basic MPI information */
>>>>     init_comm(comm0);
>>>> 
>>>>     MPI_Barrier(comm);
>>>>     /*-- start inisock and set serveradd[] array --*/
>>>>     if (me == 0) {
>>>>         fprintf(stdout,"initial socket setup ...start\n");
>>>>         fflush(stdout);
>>>>     }
>>>> 
>>>>     // create the initial socket 
>>>>     inisock = new_server_socket(NULL,0);
>>>> 
>>>>     // fill and gather the serveraddr array
>>>>     int szsock = sizeof(SOCKADDR);
>>>>     memset(&serveraddr[0],0, szsock*nproc);
>>>>     int iniport=get_sockport(inisock);
>>>>     set_sockaddr_byhname(NULL, iniport, &serveraddr[me]);
>>>>     //printsockaddr( serveraddr[me] );
>>>> 
>>>>     SOCKADDR addrsend = serveraddr[me];
>>>>     MPI_Allgather(&addrsend,szsock,MPI_BYTE,
>>>>                   &serveraddr[0], szsock,MPI_BYTE, comm);
>>>>     if (me == 0) {
>>>>        fprintf(stdout,"initial socket setup ...done \n"
>>>>     );
>>>>     fflush(stdout);}
>>>> 
>>>> I didn't write this part of the program and I'm really a novice to MPI - 
>>>> but it seems like the initial execution of the program isn't freeing up 
>>>> some system resource as it should.  Is there something that needs to be 
>>>> corrected in the code?
>>>> 
>>>> Thanks,
>>>> 
>>>> - Lee-Ping
>>>> 
>>>> On Sep 29, 2014, at 5:12 PM, Lee-Ping Wang <leep...@stanford.edu> wrote:
>>>> 
>>>>> Hi there,
>>>>> 
>>>>> My application uses MPI to run parallel jobs on a single node, so I have 
>>>>> no need of any support for communication between nodes.  However, when I 
>>>>> use mpirun to launch my application I see strange errors such as:
>>>>> 
>>>>> --------------------------------------------------------------------------
>>>>> No network interfaces were found for out-of-band communications. We 
>>>>> require
>>>>> at least one available network for out-of-band messaging.
>>>>> --------------------------------------------------------------------------
>>>>> 
>>>>> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP 
>>>>> socket for out-of-band communications in file oob_tcp_listener.c at line 
>>>>> 113
>>>>> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP 
>>>>> socket for out-of-band communications in file oob_tcp_component.c at line 
>>>>> 584
>>>>> --------------------------------------------------------------------------
>>>>> It looks like orte_init failed for some reason; your parallel process is
>>>>> likely to abort.  There are many reasons that a parallel process can
>>>>> fail during orte_init; some of which are due to configuration or
>>>>> environment problems.  This failure appears to be an internal failure;
>>>>> here's some additional information (which may only be relevant to an
>>>>> Open MPI developer):
>>>>> 
>>>>>   orte_oob_base_select failed
>>>>>   --> Returned value (null) (-43) instead of ORTE_SUCCESS
>>>>> --------------------------------------------------------------------------
>>>>> 
>>>>> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9]
>>>>> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0]
>>>>> 
>>>>> It seems like in each case, OpenMPI is trying to use some feature related 
>>>>> to networking and crashing as a result.  My workaround is to deduce the 
>>>>> components that are crashing and disable them in my environment variables 
>>>>> like this:
>>>>> 
>>>>> export OMPI_MCA_btl=self,sm
>>>>> export OMPI_MCA_oob=^tcp
>>>>> 
>>>>> Is there a better way to do this - i.e. explicitly prohibit OpenMPI from 
>>>>> using any network-related feature and run only on the local node?
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> - Lee-Ping
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2014/09/25410.php
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2014/09/25411.php
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/09/25412.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/09/25413.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25419.php

Reply via email to