On Sep 30, 2014, at 11:19 AM, Lee-Ping Wang <leep...@stanford.edu> wrote:
> Hi Ralph, > >>> If so, then I should be able to (1) locate where the port number is >>> defined in the code, and (2) randomize the port number every time it's >>> called to work around the issue. What do you think? >> >> That might work, depending on the code. I'm not sure what it is trying to >> connect to, and if that code knows how to handle arbitrary connections > > > The main reason why Q-Chem is using MPI is for executing parallel tasks on a > single node. Thus, I think it's just the MPI ranks attempting to connect > with each other on the same machine. This could be off the mark because I'm > still a novice with respect to MPI concepts - but I am sure it is just one > machine. Your statement doesn't match what you sent us - you showed that it was your connection code that was failing, not ours. You wouldn't have gotten that far if our connections failed as you would have failed in MPI_Init. You are clearly much further than that as you already passed an MPI_Barrier before reaching the code in question. > >> You might check about those warnings - could be that QCLOCALSCR and QCREF >> need to be set for the code to work. > > Thanks; I don't think these environment variables are the issue but I will > check again. The calculation runs without any problems on four different > clusters (where I don't set these environment variables either), it's only > broken on the Blue Waters compute node. Also, the calculation runs without > any problems the first time it's executed on the BW compute node - it's only > subsequent executions that give the error messages. > > Thanks, > > - Lee-Ping > > On Sep 30, 2014, at 11:05 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> >> On Sep 30, 2014, at 10:49 AM, Lee-Ping Wang <leep...@stanford.edu> wrote: >> >>> Hi Ralph, >>> >>> Thank you. I think your diagnosis is probably correct. Are these sockets >>> the same as TCP/UDP ports (though different numbers) that are used in web >>> servers, email etc? >> >> Yes >> >>> If so, then I should be able to (1) locate where the port number is >>> defined in the code, and (2) randomize the port number every time it's >>> called to work around the issue. What do you think? >> >> That might work, depending on the code. I'm not sure what it is trying to >> connect to, and if that code knows how to handle arbitrary connections >> >> You might check about those warnings - could be that QCLOCALSCR and QCREF >> need to be set for the code to work. >> >>> >>> - Lee-Ping >>> >>> On Sep 29, 2014, at 8:45 PM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>>> I don't know anything about your application, or what the functions in >>>> your code are doing. I imagine it's possible that you are trying to open >>>> statically defined ports, which means that running the job again too soon >>>> could leave the OS thinking the socket is already busy. It takes awhile >>>> for the OS to release a socket resource. >>>> >>>> >>>> On Sep 29, 2014, at 5:49 PM, Lee-Ping Wang <leep...@stanford.edu> wrote: >>>> >>>>> Here's another data point that might be useful: The error message is much >>>>> more rare if I run my application on 4 cores instead of 8. >>>>> >>>>> Thanks, >>>>> >>>>> - Lee-Ping >>>>> >>>>> On Sep 29, 2014, at 5:38 PM, Lee-Ping Wang <leep...@stanford.edu> wrote: >>>>> >>>>>> Sorry for my last email - I think I spoke too quick. I realized after >>>>>> reading some more documentation that OpenMPI always uses TCP sockets for >>>>>> out-of-band communication, so it doesn't make sense for me to set >>>>>> OMPI_MCA_oob=^tcp. That said, I am still running into a strange problem >>>>>> in my application when running on a specific machine (Blue Waters >>>>>> compute node); I don't see this problem on any other nodes. >>>>>> >>>>>> When I run the same job (~5 seconds) in rapid succession, I see the >>>>>> following error message on the second execution: >>>>>> >>>>>> /tmp/leeping/opt/qchem-4.2/bin/parallel.csh, , qcopt_reactants.in, 8, >>>>>> 0, ./qchem24825/ >>>>>> MPIRUN in parallel.csh is >>>>>> /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun >>>>>> P4_RSHCOMMAND in parallel.csh is ssh >>>>>> QCOUTFILE is stdout >>>>>> Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines >>>>>> [nid15081:24859] Warning: could not find environment variable >>>>>> "QCLOCALSCR" >>>>>> [nid15081:24859] Warning: could not find environment variable "QCREF" >>>>>> initial socket setup ...start >>>>>> ------------------------------------------------------- >>>>>> Primary job terminated normally, but 1 process returned >>>>>> a non-zero exit code.. Per user-direction, the job has been aborted. >>>>>> ------------------------------------------------------- >>>>>> -------------------------------------------------------------------------- >>>>>> mpirun detected that one or more processes exited with non-zero status, >>>>>> thus causing >>>>>> the job to be terminated. The first process to do so was: >>>>>> >>>>>> Process name: [[46773,1],0] >>>>>> Exit code: 255 >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> And here's the source code where the program is exiting (before "initial >>>>>> socket setup ...done") >>>>>> >>>>>> int GPICommSoc::init(MPI_Comm comm0) { >>>>>> >>>>>> /* setup basic MPI information */ >>>>>> init_comm(comm0); >>>>>> >>>>>> MPI_Barrier(comm); >>>>>> /*-- start inisock and set serveradd[] array --*/ >>>>>> if (me == 0) { >>>>>> fprintf(stdout,"initial socket setup ...start\n"); >>>>>> fflush(stdout); >>>>>> } >>>>>> >>>>>> // create the initial socket >>>>>> inisock = new_server_socket(NULL,0); >>>>>> >>>>>> // fill and gather the serveraddr array >>>>>> int szsock = sizeof(SOCKADDR); >>>>>> memset(&serveraddr[0],0, szsock*nproc); >>>>>> int iniport=get_sockport(inisock); >>>>>> set_sockaddr_byhname(NULL, iniport, &serveraddr[me]); >>>>>> //printsockaddr( serveraddr[me] ); >>>>>> >>>>>> SOCKADDR addrsend = serveraddr[me]; >>>>>> MPI_Allgather(&addrsend,szsock,MPI_BYTE, >>>>>> &serveraddr[0], szsock,MPI_BYTE, comm); >>>>>> if (me == 0) { >>>>>> fprintf(stdout,"initial socket setup ...done \n" >>>>>> ); >>>>>> fflush(stdout);} >>>>>> >>>>>> I didn't write this part of the program and I'm really a novice to MPI - >>>>>> but it seems like the initial execution of the program isn't freeing up >>>>>> some system resource as it should. Is there something that needs to be >>>>>> corrected in the code? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> - Lee-Ping >>>>>> >>>>>> On Sep 29, 2014, at 5:12 PM, Lee-Ping Wang <leep...@stanford.edu> wrote: >>>>>> >>>>>>> Hi there, >>>>>>> >>>>>>> My application uses MPI to run parallel jobs on a single node, so I >>>>>>> have no need of any support for communication between nodes. However, >>>>>>> when I use mpirun to launch my application I see strange errors such as: >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> No network interfaces were found for out-of-band communications. We >>>>>>> require >>>>>>> at least one available network for out-of-band messaging. >>>>>>> -------------------------------------------------------------------------- >>>>>>> >>>>>>> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP >>>>>>> socket for out-of-band communications in file oob_tcp_listener.c at >>>>>>> line 113 >>>>>>> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP >>>>>>> socket for out-of-band communications in file oob_tcp_component.c at >>>>>>> line 584 >>>>>>> -------------------------------------------------------------------------- >>>>>>> It looks like orte_init failed for some reason; your parallel process is >>>>>>> likely to abort. There are many reasons that a parallel process can >>>>>>> fail during orte_init; some of which are due to configuration or >>>>>>> environment problems. This failure appears to be an internal failure; >>>>>>> here's some additional information (which may only be relevant to an >>>>>>> Open MPI developer): >>>>>>> >>>>>>> orte_oob_base_select failed >>>>>>> --> Returned value (null) (-43) instead of ORTE_SUCCESS >>>>>>> -------------------------------------------------------------------------- >>>>>>> >>>>>>> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9] >>>>>>> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0] >>>>>>> >>>>>>> It seems like in each case, OpenMPI is trying to use some feature >>>>>>> related to networking and crashing as a result. My workaround is to >>>>>>> deduce the components that are crashing and disable them in my >>>>>>> environment variables like this: >>>>>>> >>>>>>> export OMPI_MCA_btl=self,sm >>>>>>> export OMPI_MCA_oob=^tcp >>>>>>> >>>>>>> Is there a better way to do this - i.e. explicitly prohibit OpenMPI >>>>>>> from using any network-related feature and run only on the local node? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> - Lee-Ping >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/users/2014/09/25410.php >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2014/09/25411.php >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2014/09/25412.php >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/09/25413.php >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/09/25419.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/09/25420.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25421.php