Here's another data point that might be useful: The error message is much more rare if I run my application on 4 cores instead of 8.
Thanks, - Lee-Ping On Sep 29, 2014, at 5:38 PM, Lee-Ping Wang <leep...@stanford.edu> wrote: > Sorry for my last email - I think I spoke too quick. I realized after > reading some more documentation that OpenMPI always uses TCP sockets for > out-of-band communication, so it doesn't make sense for me to set > OMPI_MCA_oob=^tcp. That said, I am still running into a strange problem in > my application when running on a specific machine (Blue Waters compute node); > I don't see this problem on any other nodes. > > When I run the same job (~5 seconds) in rapid succession, I see the following > error message on the second execution: > > /tmp/leeping/opt/qchem-4.2/bin/parallel.csh, , qcopt_reactants.in, 8, 0, > ./qchem24825/ > MPIRUN in parallel.csh is > /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun > P4_RSHCOMMAND in parallel.csh is ssh > QCOUTFILE is stdout > Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines > [nid15081:24859] Warning: could not find environment variable "QCLOCALSCR" > [nid15081:24859] Warning: could not find environment variable "QCREF" > initial socket setup ...start > ------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code.. Per user-direction, the job has been aborted. > ------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun detected that one or more processes exited with non-zero status, thus > causing > the job to be terminated. The first process to do so was: > > Process name: [[46773,1],0] > Exit code: 255 > -------------------------------------------------------------------------- > > And here's the source code where the program is exiting (before "initial > socket setup ...done") > > int GPICommSoc::init(MPI_Comm comm0) { > > /* setup basic MPI information */ > init_comm(comm0); > > MPI_Barrier(comm); > /*-- start inisock and set serveradd[] array --*/ > if (me == 0) { > fprintf(stdout,"initial socket setup ...start\n"); > fflush(stdout); > } > > // create the initial socket > inisock = new_server_socket(NULL,0); > > // fill and gather the serveraddr array > int szsock = sizeof(SOCKADDR); > memset(&serveraddr[0],0, szsock*nproc); > int iniport=get_sockport(inisock); > set_sockaddr_byhname(NULL, iniport, &serveraddr[me]); > //printsockaddr( serveraddr[me] ); > > SOCKADDR addrsend = serveraddr[me]; > MPI_Allgather(&addrsend,szsock,MPI_BYTE, > &serveraddr[0], szsock,MPI_BYTE, comm); > if (me == 0) { > fprintf(stdout,"initial socket setup ...done \n" > ); > fflush(stdout);} > > I didn't write this part of the program and I'm really a novice to MPI - but > it seems like the initial execution of the program isn't freeing up some > system resource as it should. Is there something that needs to be corrected > in the code? > > Thanks, > > - Lee-Ping > > On Sep 29, 2014, at 5:12 PM, Lee-Ping Wang <leep...@stanford.edu> wrote: > >> Hi there, >> >> My application uses MPI to run parallel jobs on a single node, so I have no >> need of any support for communication between nodes. However, when I use >> mpirun to launch my application I see strange errors such as: >> >> -------------------------------------------------------------------------- >> No network interfaces were found for out-of-band communications. We require >> at least one available network for out-of-band messaging. >> -------------------------------------------------------------------------- >> >> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket >> for out-of-band communications in file oob_tcp_listener.c at line 113 >> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket >> for out-of-band communications in file oob_tcp_component.c at line 584 >> -------------------------------------------------------------------------- >> It looks like orte_init failed for some reason; your parallel process is >> likely to abort. There are many reasons that a parallel process can >> fail during orte_init; some of which are due to configuration or >> environment problems. This failure appears to be an internal failure; >> here's some additional information (which may only be relevant to an >> Open MPI developer): >> >> orte_oob_base_select failed >> --> Returned value (null) (-43) instead of ORTE_SUCCESS >> -------------------------------------------------------------------------- >> >> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9] >> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0] >> >> It seems like in each case, OpenMPI is trying to use some feature related to >> networking and crashing as a result. My workaround is to deduce the >> components that are crashing and disable them in my environment variables >> like this: >> >> export OMPI_MCA_btl=self,sm >> export OMPI_MCA_oob=^tcp >> >> Is there a better way to do this - i.e. explicitly prohibit OpenMPI from >> using any network-related feature and run only on the local node? >> >> Thanks, >> >> - Lee-Ping >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/09/25410.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25411.php