Hi Ralph,

Thanks.  I'll add some print statements to the code and try to figure out 
precisely where the failure is happening.

- Lee-Ping

On Sep 30, 2014, at 12:06 PM, Ralph Castain <r...@open-mpi.org> wrote:

> 
> On Sep 30, 2014, at 11:19 AM, Lee-Ping Wang <leep...@stanford.edu> wrote:
> 
>> Hi Ralph,
>> 
>>>>  If so, then I should be able to (1) locate where the port number is 
>>>> defined in the code, and (2) randomize the port number every time it's 
>>>> called to work around the issue.  What do you think?
>>> 
>>> That might work, depending on the code. I'm not sure what it is trying to 
>>> connect to, and if that code knows how to handle arbitrary connections
>> 
>> 
>> The main reason why Q-Chem is using MPI is for executing parallel tasks on a 
>> single node.  Thus, I think it's just the MPI ranks attempting to connect 
>> with each other on the same machine.  This could be off the mark because I'm 
>> still a novice with respect to MPI concepts - but I am sure it is just one 
>> machine.
> 
> Your statement doesn't match what you sent us - you showed that it was your 
> connection code that was failing, not ours. You wouldn't have gotten that far 
> if our connections failed as you would have failed in MPI_Init. You are 
> clearly much further than that as you already passed an MPI_Barrier before 
> reaching the code in question.
> 
>> 
>>> You might check about those warnings - could be that QCLOCALSCR and QCREF 
>>> need to be set for the code to work.
>> 
>> Thanks; I don't think these environment variables are the issue but I will 
>> check again.  The calculation runs without any problems on four different 
>> clusters (where I don't set these environment variables either), it's only 
>> broken on the Blue Waters compute node.  Also, the calculation runs without 
>> any problems the first time it's executed on the BW compute node - it's only 
>> subsequent executions that give the error messages.
>> 
>> Thanks,
>> 
>> - Lee-Ping
>> 
>> On Sep 30, 2014, at 11:05 AM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>>> 
>>> On Sep 30, 2014, at 10:49 AM, Lee-Ping Wang <leep...@stanford.edu> wrote:
>>> 
>>>> Hi Ralph,
>>>> 
>>>> Thank you.  I think your diagnosis is probably correct.  Are these sockets 
>>>> the same as TCP/UDP ports (though different numbers) that are used in web 
>>>> servers, email etc?
>>> 
>>> Yes
>>> 
>>>>  If so, then I should be able to (1) locate where the port number is 
>>>> defined in the code, and (2) randomize the port number every time it's 
>>>> called to work around the issue.  What do you think?
>>> 
>>> That might work, depending on the code. I'm not sure what it is trying to 
>>> connect to, and if that code knows how to handle arbitrary connections
>>> 
>>> You might check about those warnings - could be that QCLOCALSCR and QCREF 
>>> need to be set for the code to work.
>>> 
>>>> 
>>>> - Lee-Ping
>>>> 
>>>> On Sep 29, 2014, at 8:45 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> 
>>>>> I don't know anything about your application, or what the functions in 
>>>>> your code are doing. I imagine it's possible that you are trying to open 
>>>>> statically defined ports, which means that running the job again too soon 
>>>>> could leave the OS thinking the socket is already busy. It takes awhile 
>>>>> for the OS to release a socket resource.
>>>>> 
>>>>> 
>>>>> On Sep 29, 2014, at 5:49 PM, Lee-Ping Wang <leep...@stanford.edu> wrote:
>>>>> 
>>>>>> Here's another data point that might be useful: The error message is 
>>>>>> much more rare if I run my application on 4 cores instead of 8.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> - Lee-Ping
>>>>>> 
>>>>>> On Sep 29, 2014, at 5:38 PM, Lee-Ping Wang <leep...@stanford.edu> wrote:
>>>>>> 
>>>>>>> Sorry for my last email - I think I spoke too quick.  I realized after 
>>>>>>> reading some more documentation that OpenMPI always uses TCP sockets 
>>>>>>> for out-of-band communication, so it doesn't make sense for me to set 
>>>>>>> OMPI_MCA_oob=^tcp.  That said, I am still running into a strange 
>>>>>>> problem in my application when running on a specific machine (Blue 
>>>>>>> Waters compute node); I don't see this problem on any other nodes.
>>>>>>> 
>>>>>>> When I run the same job (~5 seconds) in rapid succession, I see the 
>>>>>>> following error message on the second execution:
>>>>>>> 
>>>>>>> /tmp/leeping/opt/qchem-4.2/bin/parallel.csh,  , qcopt_reactants.in, 8, 
>>>>>>> 0, ./qchem24825/
>>>>>>> MPIRUN in parallel.csh is 
>>>>>>> /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun
>>>>>>> P4_RSHCOMMAND in parallel.csh is ssh
>>>>>>> QCOUTFILE is stdout
>>>>>>> Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines
>>>>>>> [nid15081:24859] Warning: could not find environment variable 
>>>>>>> "QCLOCALSCR"
>>>>>>> [nid15081:24859] Warning: could not find environment variable "QCREF"
>>>>>>> initial socket setup ...start
>>>>>>> -------------------------------------------------------
>>>>>>> Primary job  terminated normally, but 1 process returned
>>>>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>>>>> -------------------------------------------------------
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpirun detected that one or more processes exited with non-zero status, 
>>>>>>> thus causing
>>>>>>> the job to be terminated. The first process to do so was:
>>>>>>> 
>>>>>>>   Process name: [[46773,1],0]
>>>>>>>   Exit code:    255
>>>>>>> --------------------------------------------------------------------------
>>>>>>> 
>>>>>>> And here's the source code where the program is exiting (before 
>>>>>>> "initial socket setup ...done")
>>>>>>> 
>>>>>>> int GPICommSoc::init(MPI_Comm comm0) {
>>>>>>> 
>>>>>>>     /* setup basic MPI information */
>>>>>>>     init_comm(comm0);
>>>>>>> 
>>>>>>>     MPI_Barrier(comm);
>>>>>>>     /*-- start inisock and set serveradd[] array --*/
>>>>>>>     if (me == 0) {
>>>>>>>         fprintf(stdout,"initial socket setup ...start\n");
>>>>>>>         fflush(stdout);
>>>>>>>     }
>>>>>>> 
>>>>>>>     // create the initial socket 
>>>>>>>     inisock = new_server_socket(NULL,0);
>>>>>>> 
>>>>>>>     // fill and gather the serveraddr array
>>>>>>>     int szsock = sizeof(SOCKADDR);
>>>>>>>     memset(&serveraddr[0],0, szsock*nproc);
>>>>>>>     int iniport=get_sockport(inisock);
>>>>>>>     set_sockaddr_byhname(NULL, iniport, &serveraddr[me]);
>>>>>>>     //printsockaddr( serveraddr[me] );
>>>>>>> 
>>>>>>>     SOCKADDR addrsend = serveraddr[me];
>>>>>>>     MPI_Allgather(&addrsend,szsock,MPI_BYTE,
>>>>>>>                   &serveraddr[0], szsock,MPI_BYTE, comm);
>>>>>>>     if (me == 0) {
>>>>>>>        fprintf(stdout,"initial socket setup ...done \n"
>>>>>>>     );
>>>>>>>     fflush(stdout);}
>>>>>>> 
>>>>>>> I didn't write this part of the program and I'm really a novice to MPI 
>>>>>>> - but it seems like the initial execution of the program isn't freeing 
>>>>>>> up some system resource as it should.  Is there something that needs to 
>>>>>>> be corrected in the code?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> - Lee-Ping
>>>>>>> 
>>>>>>> On Sep 29, 2014, at 5:12 PM, Lee-Ping Wang <leep...@stanford.edu> wrote:
>>>>>>> 
>>>>>>>> Hi there,
>>>>>>>> 
>>>>>>>> My application uses MPI to run parallel jobs on a single node, so I 
>>>>>>>> have no need of any support for communication between nodes.  However, 
>>>>>>>> when I use mpirun to launch my application I see strange errors such 
>>>>>>>> as:
>>>>>>>> 
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> No network interfaces were found for out-of-band communications. We 
>>>>>>>> require
>>>>>>>> at least one available network for out-of-band messaging.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> 
>>>>>>>> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP 
>>>>>>>> socket for out-of-band communications in file oob_tcp_listener.c at 
>>>>>>>> line 113
>>>>>>>> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP 
>>>>>>>> socket for out-of-band communications in file oob_tcp_component.c at 
>>>>>>>> line 584
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> It looks like orte_init failed for some reason; your parallel process 
>>>>>>>> is
>>>>>>>> likely to abort.  There are many reasons that a parallel process can
>>>>>>>> fail during orte_init; some of which are due to configuration or
>>>>>>>> environment problems.  This failure appears to be an internal failure;
>>>>>>>> here's some additional information (which may only be relevant to an
>>>>>>>> Open MPI developer):
>>>>>>>> 
>>>>>>>>   orte_oob_base_select failed
>>>>>>>>   --> Returned value (null) (-43) instead of ORTE_SUCCESS
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> 
>>>>>>>> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9]
>>>>>>>> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0]
>>>>>>>> 
>>>>>>>> It seems like in each case, OpenMPI is trying to use some feature 
>>>>>>>> related to networking and crashing as a result.  My workaround is to 
>>>>>>>> deduce the components that are crashing and disable them in my 
>>>>>>>> environment variables like this:
>>>>>>>> 
>>>>>>>> export OMPI_MCA_btl=self,sm
>>>>>>>> export OMPI_MCA_oob=^tcp
>>>>>>>> 
>>>>>>>> Is there a better way to do this - i.e. explicitly prohibit OpenMPI 
>>>>>>>> from using any network-related feature and run only on the local node?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> - Lee-Ping
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this post: 
>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/09/25410.php
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/users/2014/09/25411.php
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/users/2014/09/25412.php
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2014/09/25413.php
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2014/09/25419.php
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/09/25420.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/09/25421.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25422.php

Reply via email to