Here's another data point that might be useful: The error message is much more 
rare if I run my application on 4 cores instead of 8.

Thanks,

- Lee-Ping

On Sep 29, 2014, at 5:38 PM, Lee-Ping Wang <leep...@stanford.edu> wrote:

> Sorry for my last email - I think I spoke too quick.  I realized after 
> reading some more documentation that OpenMPI always uses TCP sockets for 
> out-of-band communication, so it doesn't make sense for me to set 
> OMPI_MCA_oob=^tcp.  That said, I am still running into a strange problem in 
> my application when running on a specific machine (Blue Waters compute node); 
> I don't see this problem on any other nodes.
> 
> When I run the same job (~5 seconds) in rapid succession, I see the following 
> error message on the second execution:
> 
> /tmp/leeping/opt/qchem-4.2/bin/parallel.csh,  , qcopt_reactants.in, 8, 0, 
> ./qchem24825/
> MPIRUN in parallel.csh is 
> /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun
> P4_RSHCOMMAND in parallel.csh is ssh
> QCOUTFILE is stdout
> Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines
> [nid15081:24859] Warning: could not find environment variable "QCLOCALSCR"
> [nid15081:24859] Warning: could not find environment variable "QCREF"
> initial socket setup ...start
> -------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status, thus 
> causing
> the job to be terminated. The first process to do so was:
> 
>   Process name: [[46773,1],0]
>   Exit code:    255
> --------------------------------------------------------------------------
> 
> And here's the source code where the program is exiting (before "initial 
> socket setup ...done")
> 
> int GPICommSoc::init(MPI_Comm comm0) {
> 
>     /* setup basic MPI information */
>     init_comm(comm0);
> 
>     MPI_Barrier(comm);
>     /*-- start inisock and set serveradd[] array --*/
>     if (me == 0) {
>         fprintf(stdout,"initial socket setup ...start\n");
>         fflush(stdout);
>     }
> 
>     // create the initial socket 
>     inisock = new_server_socket(NULL,0);
> 
>     // fill and gather the serveraddr array
>     int szsock = sizeof(SOCKADDR);
>     memset(&serveraddr[0],0, szsock*nproc);
>     int iniport=get_sockport(inisock);
>     set_sockaddr_byhname(NULL, iniport, &serveraddr[me]);
>     //printsockaddr( serveraddr[me] );
> 
>     SOCKADDR addrsend = serveraddr[me];
>     MPI_Allgather(&addrsend,szsock,MPI_BYTE,
>                   &serveraddr[0], szsock,MPI_BYTE, comm);
>     if (me == 0) {
>        fprintf(stdout,"initial socket setup ...done \n"
>     );
>     fflush(stdout);}
> 
> I didn't write this part of the program and I'm really a novice to MPI - but 
> it seems like the initial execution of the program isn't freeing up some 
> system resource as it should.  Is there something that needs to be corrected 
> in the code?
> 
> Thanks,
> 
> - Lee-Ping
> 
> On Sep 29, 2014, at 5:12 PM, Lee-Ping Wang <leep...@stanford.edu> wrote:
> 
>> Hi there,
>> 
>> My application uses MPI to run parallel jobs on a single node, so I have no 
>> need of any support for communication between nodes.  However, when I use 
>> mpirun to launch my application I see strange errors such as:
>> 
>> --------------------------------------------------------------------------
>> No network interfaces were found for out-of-band communications. We require
>> at least one available network for out-of-band messaging.
>> --------------------------------------------------------------------------
>> 
>> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket 
>> for out-of-band communications in file oob_tcp_listener.c at line 113
>> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket 
>> for out-of-band communications in file oob_tcp_component.c at line 584
>> --------------------------------------------------------------------------
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>> 
>>   orte_oob_base_select failed
>>   --> Returned value (null) (-43) instead of ORTE_SUCCESS
>> --------------------------------------------------------------------------
>> 
>> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9]
>> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0]
>> 
>> It seems like in each case, OpenMPI is trying to use some feature related to 
>> networking and crashing as a result.  My workaround is to deduce the 
>> components that are crashing and disable them in my environment variables 
>> like this:
>> 
>> export OMPI_MCA_btl=self,sm
>> export OMPI_MCA_oob=^tcp
>> 
>> Is there a better way to do this - i.e. explicitly prohibit OpenMPI from 
>> using any network-related feature and run only on the local node?
>> 
>> Thanks,
>> 
>> - Lee-Ping
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/09/25410.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25411.php

Reply via email to