On Sep 30, 2014, at 11:19 AM, Lee-Ping Wang <leep...@stanford.edu> wrote:

> Hi Ralph,
> 
>>>  If so, then I should be able to (1) locate where the port number is 
>>> defined in the code, and (2) randomize the port number every time it's 
>>> called to work around the issue.  What do you think?
>> 
>> That might work, depending on the code. I'm not sure what it is trying to 
>> connect to, and if that code knows how to handle arbitrary connections
> 
> 
> The main reason why Q-Chem is using MPI is for executing parallel tasks on a 
> single node.  Thus, I think it's just the MPI ranks attempting to connect 
> with each other on the same machine.  This could be off the mark because I'm 
> still a novice with respect to MPI concepts - but I am sure it is just one 
> machine.

Your statement doesn't match what you sent us - you showed that it was your 
connection code that was failing, not ours. You wouldn't have gotten that far 
if our connections failed as you would have failed in MPI_Init. You are clearly 
much further than that as you already passed an MPI_Barrier before reaching the 
code in question.

> 
>> You might check about those warnings - could be that QCLOCALSCR and QCREF 
>> need to be set for the code to work.
> 
> Thanks; I don't think these environment variables are the issue but I will 
> check again.  The calculation runs without any problems on four different 
> clusters (where I don't set these environment variables either), it's only 
> broken on the Blue Waters compute node.  Also, the calculation runs without 
> any problems the first time it's executed on the BW compute node - it's only 
> subsequent executions that give the error messages.
> 
> Thanks,
> 
> - Lee-Ping
> 
> On Sep 30, 2014, at 11:05 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
>> 
>> On Sep 30, 2014, at 10:49 AM, Lee-Ping Wang <leep...@stanford.edu> wrote:
>> 
>>> Hi Ralph,
>>> 
>>> Thank you.  I think your diagnosis is probably correct.  Are these sockets 
>>> the same as TCP/UDP ports (though different numbers) that are used in web 
>>> servers, email etc?
>> 
>> Yes
>> 
>>>  If so, then I should be able to (1) locate where the port number is 
>>> defined in the code, and (2) randomize the port number every time it's 
>>> called to work around the issue.  What do you think?
>> 
>> That might work, depending on the code. I'm not sure what it is trying to 
>> connect to, and if that code knows how to handle arbitrary connections
>> 
>> You might check about those warnings - could be that QCLOCALSCR and QCREF 
>> need to be set for the code to work.
>> 
>>> 
>>> - Lee-Ping
>>> 
>>> On Sep 29, 2014, at 8:45 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>> 
>>>> I don't know anything about your application, or what the functions in 
>>>> your code are doing. I imagine it's possible that you are trying to open 
>>>> statically defined ports, which means that running the job again too soon 
>>>> could leave the OS thinking the socket is already busy. It takes awhile 
>>>> for the OS to release a socket resource.
>>>> 
>>>> 
>>>> On Sep 29, 2014, at 5:49 PM, Lee-Ping Wang <leep...@stanford.edu> wrote:
>>>> 
>>>>> Here's another data point that might be useful: The error message is much 
>>>>> more rare if I run my application on 4 cores instead of 8.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> - Lee-Ping
>>>>> 
>>>>> On Sep 29, 2014, at 5:38 PM, Lee-Ping Wang <leep...@stanford.edu> wrote:
>>>>> 
>>>>>> Sorry for my last email - I think I spoke too quick.  I realized after 
>>>>>> reading some more documentation that OpenMPI always uses TCP sockets for 
>>>>>> out-of-band communication, so it doesn't make sense for me to set 
>>>>>> OMPI_MCA_oob=^tcp.  That said, I am still running into a strange problem 
>>>>>> in my application when running on a specific machine (Blue Waters 
>>>>>> compute node); I don't see this problem on any other nodes.
>>>>>> 
>>>>>> When I run the same job (~5 seconds) in rapid succession, I see the 
>>>>>> following error message on the second execution:
>>>>>> 
>>>>>> /tmp/leeping/opt/qchem-4.2/bin/parallel.csh,  , qcopt_reactants.in, 8, 
>>>>>> 0, ./qchem24825/
>>>>>> MPIRUN in parallel.csh is 
>>>>>> /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun
>>>>>> P4_RSHCOMMAND in parallel.csh is ssh
>>>>>> QCOUTFILE is stdout
>>>>>> Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines
>>>>>> [nid15081:24859] Warning: could not find environment variable 
>>>>>> "QCLOCALSCR"
>>>>>> [nid15081:24859] Warning: could not find environment variable "QCREF"
>>>>>> initial socket setup ...start
>>>>>> -------------------------------------------------------
>>>>>> Primary job  terminated normally, but 1 process returned
>>>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>>>> -------------------------------------------------------
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun detected that one or more processes exited with non-zero status, 
>>>>>> thus causing
>>>>>> the job to be terminated. The first process to do so was:
>>>>>> 
>>>>>>   Process name: [[46773,1],0]
>>>>>>   Exit code:    255
>>>>>> --------------------------------------------------------------------------
>>>>>> 
>>>>>> And here's the source code where the program is exiting (before "initial 
>>>>>> socket setup ...done")
>>>>>> 
>>>>>> int GPICommSoc::init(MPI_Comm comm0) {
>>>>>> 
>>>>>>     /* setup basic MPI information */
>>>>>>     init_comm(comm0);
>>>>>> 
>>>>>>     MPI_Barrier(comm);
>>>>>>     /*-- start inisock and set serveradd[] array --*/
>>>>>>     if (me == 0) {
>>>>>>         fprintf(stdout,"initial socket setup ...start\n");
>>>>>>         fflush(stdout);
>>>>>>     }
>>>>>> 
>>>>>>     // create the initial socket 
>>>>>>     inisock = new_server_socket(NULL,0);
>>>>>> 
>>>>>>     // fill and gather the serveraddr array
>>>>>>     int szsock = sizeof(SOCKADDR);
>>>>>>     memset(&serveraddr[0],0, szsock*nproc);
>>>>>>     int iniport=get_sockport(inisock);
>>>>>>     set_sockaddr_byhname(NULL, iniport, &serveraddr[me]);
>>>>>>     //printsockaddr( serveraddr[me] );
>>>>>> 
>>>>>>     SOCKADDR addrsend = serveraddr[me];
>>>>>>     MPI_Allgather(&addrsend,szsock,MPI_BYTE,
>>>>>>                   &serveraddr[0], szsock,MPI_BYTE, comm);
>>>>>>     if (me == 0) {
>>>>>>        fprintf(stdout,"initial socket setup ...done \n"
>>>>>>     );
>>>>>>     fflush(stdout);}
>>>>>> 
>>>>>> I didn't write this part of the program and I'm really a novice to MPI - 
>>>>>> but it seems like the initial execution of the program isn't freeing up 
>>>>>> some system resource as it should.  Is there something that needs to be 
>>>>>> corrected in the code?
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> - Lee-Ping
>>>>>> 
>>>>>> On Sep 29, 2014, at 5:12 PM, Lee-Ping Wang <leep...@stanford.edu> wrote:
>>>>>> 
>>>>>>> Hi there,
>>>>>>> 
>>>>>>> My application uses MPI to run parallel jobs on a single node, so I 
>>>>>>> have no need of any support for communication between nodes.  However, 
>>>>>>> when I use mpirun to launch my application I see strange errors such as:
>>>>>>> 
>>>>>>> --------------------------------------------------------------------------
>>>>>>> No network interfaces were found for out-of-band communications. We 
>>>>>>> require
>>>>>>> at least one available network for out-of-band messaging.
>>>>>>> --------------------------------------------------------------------------
>>>>>>> 
>>>>>>> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP 
>>>>>>> socket for out-of-band communications in file oob_tcp_listener.c at 
>>>>>>> line 113
>>>>>>> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP 
>>>>>>> socket for out-of-band communications in file oob_tcp_component.c at 
>>>>>>> line 584
>>>>>>> --------------------------------------------------------------------------
>>>>>>> It looks like orte_init failed for some reason; your parallel process is
>>>>>>> likely to abort.  There are many reasons that a parallel process can
>>>>>>> fail during orte_init; some of which are due to configuration or
>>>>>>> environment problems.  This failure appears to be an internal failure;
>>>>>>> here's some additional information (which may only be relevant to an
>>>>>>> Open MPI developer):
>>>>>>> 
>>>>>>>   orte_oob_base_select failed
>>>>>>>   --> Returned value (null) (-43) instead of ORTE_SUCCESS
>>>>>>> --------------------------------------------------------------------------
>>>>>>> 
>>>>>>> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9]
>>>>>>> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0]
>>>>>>> 
>>>>>>> It seems like in each case, OpenMPI is trying to use some feature 
>>>>>>> related to networking and crashing as a result.  My workaround is to 
>>>>>>> deduce the components that are crashing and disable them in my 
>>>>>>> environment variables like this:
>>>>>>> 
>>>>>>> export OMPI_MCA_btl=self,sm
>>>>>>> export OMPI_MCA_oob=^tcp
>>>>>>> 
>>>>>>> Is there a better way to do this - i.e. explicitly prohibit OpenMPI 
>>>>>>> from using any network-related feature and run only on the local node?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> - Lee-Ping
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/users/2014/09/25410.php
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/users/2014/09/25411.php
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2014/09/25412.php
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2014/09/25413.php
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/09/25419.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/09/25420.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25421.php

Reply via email to