Re: [OMPI users] General question about running single-node jobs.

Lee-Ping Wang Mon, 29 Sep 2014 20:39:00 -0400 (EDT)

Sorry for my last email - I think I spoke too quick.  I realized after reading 
some more documentation that OpenMPI always uses TCP sockets for out-of-band 
communication, so it doesn't make sense for me to set OMPI_MCA_oob=^tcp.  That 
said, I am still running into a strange problem in my application when running 
on a specific machine (Blue Waters compute node); I don't see this problem on 
any other nodes.


When I run the same job (~5 seconds) in rapid succession, I see the following 
error message on the second execution:

/tmp/leeping/opt/qchem-4.2/bin/parallel.csh,  , qcopt_reactants.in, 8, 0, 
./qchem24825/
MPIRUN in parallel.csh is /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun
P4_RSHCOMMAND in parallel.csh is ssh
QCOUTFILE is stdout
Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines
[nid15081:24859] Warning: could not find environment variable "QCLOCALSCR"
[nid15081:24859] Warning: could not find environment variable "QCREF"
initial socket setup ...start
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

  Process name: [[46773,1],0]
  Exit code:    255
--------------------------------------------------------------------------

And here's the source code where the program is exiting (before "initial socket 
setup ...done")

int GPICommSoc::init(MPI_Comm comm0) {

    /* setup basic MPI information */
    init_comm(comm0);

    MPI_Barrier(comm);
    /*-- start inisock and set serveradd[] array --*/
    if (me == 0) {
        fprintf(stdout,"initial socket setup ...start\n");
        fflush(stdout);
    }

    // create the initial socket 
    inisock = new_server_socket(NULL,0);

    // fill and gather the serveraddr array
    int szsock = sizeof(SOCKADDR);
    memset(&serveraddr[0],0, szsock*nproc);
    int iniport=get_sockport(inisock);
    set_sockaddr_byhname(NULL, iniport, &serveraddr[me]);
    //printsockaddr( serveraddr[me] );

    SOCKADDR addrsend = serveraddr[me];
    MPI_Allgather(&addrsend,szsock,MPI_BYTE,
                  &serveraddr[0], szsock,MPI_BYTE, comm);
    if (me == 0) {
       fprintf(stdout,"initial socket setup ...done \n"
    );
    fflush(stdout);}

I didn't write this part of the program and I'm really a novice to MPI - but it 
seems like the initial execution of the program isn't freeing up some system 
resource as it should.  Is there something that needs to be corrected in the 
code?

Thanks,

- Lee-Ping

On Sep 29, 2014, at 5:12 PM, Lee-Ping Wang <leep...@stanford.edu> wrote:

> Hi there,
> 
> My application uses MPI to run parallel jobs on a single node, so I have no 
> need of any support for communication between nodes.  However, when I use 
> mpirun to launch my application I see strange errors such as:
> 
> --------------------------------------------------------------------------
> No network interfaces were found for out-of-band communications. We require
> at least one available network for out-of-band messaging.
> --------------------------------------------------------------------------
> 
> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket 
> for out-of-band communications in file oob_tcp_listener.c at line 113
> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket 
> for out-of-band communications in file oob_tcp_component.c at line 584
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_oob_base_select failed
>   --> Returned value (null) (-43) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> 
> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9]
> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0]
> 
> It seems like in each case, OpenMPI is trying to use some feature related to 
> networking and crashing as a result.  My workaround is to deduce the 
> components that are crashing and disable them in my environment variables 
> like this:
> 
> export OMPI_MCA_btl=self,sm
> export OMPI_MCA_oob=^tcp
> 
> Is there a better way to do this - i.e. explicitly prohibit OpenMPI from 
> using any network-related feature and run only on the local node?
> 
> Thanks,
> 
> - Lee-Ping
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25410.php

Re: [OMPI users] General question about running single-node jobs.

Reply via email to