Hi Ralph, Thanks. I'll add some print statements to the code and try to figure out precisely where the failure is happening.
- Lee-Ping On Sep 30, 2014, at 12:06 PM, Ralph Castain <r...@open-mpi.org> wrote: > > On Sep 30, 2014, at 11:19 AM, Lee-Ping Wang <leep...@stanford.edu> wrote: > >> Hi Ralph, >> >>>> If so, then I should be able to (1) locate where the port number is >>>> defined in the code, and (2) randomize the port number every time it's >>>> called to work around the issue. What do you think? >>> >>> That might work, depending on the code. I'm not sure what it is trying to >>> connect to, and if that code knows how to handle arbitrary connections >> >> >> The main reason why Q-Chem is using MPI is for executing parallel tasks on a >> single node. Thus, I think it's just the MPI ranks attempting to connect >> with each other on the same machine. This could be off the mark because I'm >> still a novice with respect to MPI concepts - but I am sure it is just one >> machine. > > Your statement doesn't match what you sent us - you showed that it was your > connection code that was failing, not ours. You wouldn't have gotten that far > if our connections failed as you would have failed in MPI_Init. You are > clearly much further than that as you already passed an MPI_Barrier before > reaching the code in question. > >> >>> You might check about those warnings - could be that QCLOCALSCR and QCREF >>> need to be set for the code to work. >> >> Thanks; I don't think these environment variables are the issue but I will >> check again. The calculation runs without any problems on four different >> clusters (where I don't set these environment variables either), it's only >> broken on the Blue Waters compute node. Also, the calculation runs without >> any problems the first time it's executed on the BW compute node - it's only >> subsequent executions that give the error messages. >> >> Thanks, >> >> - Lee-Ping >> >> On Sep 30, 2014, at 11:05 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> >>> On Sep 30, 2014, at 10:49 AM, Lee-Ping Wang <leep...@stanford.edu> wrote: >>> >>>> Hi Ralph, >>>> >>>> Thank you. I think your diagnosis is probably correct. Are these sockets >>>> the same as TCP/UDP ports (though different numbers) that are used in web >>>> servers, email etc? >>> >>> Yes >>> >>>> If so, then I should be able to (1) locate where the port number is >>>> defined in the code, and (2) randomize the port number every time it's >>>> called to work around the issue. What do you think? >>> >>> That might work, depending on the code. I'm not sure what it is trying to >>> connect to, and if that code knows how to handle arbitrary connections >>> >>> You might check about those warnings - could be that QCLOCALSCR and QCREF >>> need to be set for the code to work. >>> >>>> >>>> - Lee-Ping >>>> >>>> On Sep 29, 2014, at 8:45 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>> >>>>> I don't know anything about your application, or what the functions in >>>>> your code are doing. I imagine it's possible that you are trying to open >>>>> statically defined ports, which means that running the job again too soon >>>>> could leave the OS thinking the socket is already busy. It takes awhile >>>>> for the OS to release a socket resource. >>>>> >>>>> >>>>> On Sep 29, 2014, at 5:49 PM, Lee-Ping Wang <leep...@stanford.edu> wrote: >>>>> >>>>>> Here's another data point that might be useful: The error message is >>>>>> much more rare if I run my application on 4 cores instead of 8. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> - Lee-Ping >>>>>> >>>>>> On Sep 29, 2014, at 5:38 PM, Lee-Ping Wang <leep...@stanford.edu> wrote: >>>>>> >>>>>>> Sorry for my last email - I think I spoke too quick. I realized after >>>>>>> reading some more documentation that OpenMPI always uses TCP sockets >>>>>>> for out-of-band communication, so it doesn't make sense for me to set >>>>>>> OMPI_MCA_oob=^tcp. That said, I am still running into a strange >>>>>>> problem in my application when running on a specific machine (Blue >>>>>>> Waters compute node); I don't see this problem on any other nodes. >>>>>>> >>>>>>> When I run the same job (~5 seconds) in rapid succession, I see the >>>>>>> following error message on the second execution: >>>>>>> >>>>>>> /tmp/leeping/opt/qchem-4.2/bin/parallel.csh, , qcopt_reactants.in, 8, >>>>>>> 0, ./qchem24825/ >>>>>>> MPIRUN in parallel.csh is >>>>>>> /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun >>>>>>> P4_RSHCOMMAND in parallel.csh is ssh >>>>>>> QCOUTFILE is stdout >>>>>>> Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines >>>>>>> [nid15081:24859] Warning: could not find environment variable >>>>>>> "QCLOCALSCR" >>>>>>> [nid15081:24859] Warning: could not find environment variable "QCREF" >>>>>>> initial socket setup ...start >>>>>>> ------------------------------------------------------- >>>>>>> Primary job terminated normally, but 1 process returned >>>>>>> a non-zero exit code.. Per user-direction, the job has been aborted. >>>>>>> ------------------------------------------------------- >>>>>>> -------------------------------------------------------------------------- >>>>>>> mpirun detected that one or more processes exited with non-zero status, >>>>>>> thus causing >>>>>>> the job to be terminated. The first process to do so was: >>>>>>> >>>>>>> Process name: [[46773,1],0] >>>>>>> Exit code: 255 >>>>>>> -------------------------------------------------------------------------- >>>>>>> >>>>>>> And here's the source code where the program is exiting (before >>>>>>> "initial socket setup ...done") >>>>>>> >>>>>>> int GPICommSoc::init(MPI_Comm comm0) { >>>>>>> >>>>>>> /* setup basic MPI information */ >>>>>>> init_comm(comm0); >>>>>>> >>>>>>> MPI_Barrier(comm); >>>>>>> /*-- start inisock and set serveradd[] array --*/ >>>>>>> if (me == 0) { >>>>>>> fprintf(stdout,"initial socket setup ...start\n"); >>>>>>> fflush(stdout); >>>>>>> } >>>>>>> >>>>>>> // create the initial socket >>>>>>> inisock = new_server_socket(NULL,0); >>>>>>> >>>>>>> // fill and gather the serveraddr array >>>>>>> int szsock = sizeof(SOCKADDR); >>>>>>> memset(&serveraddr[0],0, szsock*nproc); >>>>>>> int iniport=get_sockport(inisock); >>>>>>> set_sockaddr_byhname(NULL, iniport, &serveraddr[me]); >>>>>>> //printsockaddr( serveraddr[me] ); >>>>>>> >>>>>>> SOCKADDR addrsend = serveraddr[me]; >>>>>>> MPI_Allgather(&addrsend,szsock,MPI_BYTE, >>>>>>> &serveraddr[0], szsock,MPI_BYTE, comm); >>>>>>> if (me == 0) { >>>>>>> fprintf(stdout,"initial socket setup ...done \n" >>>>>>> ); >>>>>>> fflush(stdout);} >>>>>>> >>>>>>> I didn't write this part of the program and I'm really a novice to MPI >>>>>>> - but it seems like the initial execution of the program isn't freeing >>>>>>> up some system resource as it should. Is there something that needs to >>>>>>> be corrected in the code? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> - Lee-Ping >>>>>>> >>>>>>> On Sep 29, 2014, at 5:12 PM, Lee-Ping Wang <leep...@stanford.edu> wrote: >>>>>>> >>>>>>>> Hi there, >>>>>>>> >>>>>>>> My application uses MPI to run parallel jobs on a single node, so I >>>>>>>> have no need of any support for communication between nodes. However, >>>>>>>> when I use mpirun to launch my application I see strange errors such >>>>>>>> as: >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> No network interfaces were found for out-of-band communications. We >>>>>>>> require >>>>>>>> at least one available network for out-of-band messaging. >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> >>>>>>>> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP >>>>>>>> socket for out-of-band communications in file oob_tcp_listener.c at >>>>>>>> line 113 >>>>>>>> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP >>>>>>>> socket for out-of-band communications in file oob_tcp_component.c at >>>>>>>> line 584 >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> It looks like orte_init failed for some reason; your parallel process >>>>>>>> is >>>>>>>> likely to abort. There are many reasons that a parallel process can >>>>>>>> fail during orte_init; some of which are due to configuration or >>>>>>>> environment problems. This failure appears to be an internal failure; >>>>>>>> here's some additional information (which may only be relevant to an >>>>>>>> Open MPI developer): >>>>>>>> >>>>>>>> orte_oob_base_select failed >>>>>>>> --> Returned value (null) (-43) instead of ORTE_SUCCESS >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> >>>>>>>> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9] >>>>>>>> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0] >>>>>>>> >>>>>>>> It seems like in each case, OpenMPI is trying to use some feature >>>>>>>> related to networking and crashing as a result. My workaround is to >>>>>>>> deduce the components that are crashing and disable them in my >>>>>>>> environment variables like this: >>>>>>>> >>>>>>>> export OMPI_MCA_btl=self,sm >>>>>>>> export OMPI_MCA_oob=^tcp >>>>>>>> >>>>>>>> Is there a better way to do this - i.e. explicitly prohibit OpenMPI >>>>>>>> from using any network-related feature and run only on the local node? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> - Lee-Ping >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/users/2014/09/25410.php >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/users/2014/09/25411.php >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2014/09/25412.php >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2014/09/25413.php >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/09/25419.php >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/09/25420.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/09/25421.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25422.php