Yes, i am setting the config correcty. Our IB machines seem to run just fine so far using srun and openmpi v1.5.
As another data point, we enabled mpi-threads in Openmpi and that also seems to trigger the Srun/TCP behavior, but on the IB fabric. Running the program within an salloc rather the straight srun and the problem seems to go away On Tue, Jan 25, 2011 at 2:59 PM, Nathan Hjelm <hje...@lanl.gov> wrote: > We are seeing the similar problem with our infiniband machines. After some > investigation I discovered that we were not setting our slurm environment > correctly (ref: > https://computing.llnl.gov/linux/slurm/mpi_guide.html#open_mpi). Are you > setting the ports in your slurm.conf and executing srun with --resv-ports? > > I have yet to see if this fixes the problem for LANL. Waiting on a sysadmin > to modify the slurm.conf. > > -Nathan > HPC-3, LANL > > On Tue, 25 Jan 2011, Michael Di Domenico wrote: > >> Thanks. We're only seeing it on machines with Ethernet only as the >> interconnect. fortunately for us that only equates to one small >> machine, but it's still annoying. unfortunately, i don't have enough >> knowledge to dive into the code to help fix, but i can certainly help >> test >> >> On Mon, Jan 24, 2011 at 1:41 PM, Nathan Hjelm <hje...@lanl.gov> wrote: >>> >>> I am seeing similar issues on our slurm clusters. We are looking into the >>> issue. >>> >>> -Nathan >>> HPC-3, LANL >>> >>> On Tue, 11 Jan 2011, Michael Di Domenico wrote: >>> >>>> Any ideas on what might be causing this one? Or atleast what >>>> additional debug information someone might need? >>>> >>>> On Fri, Jan 7, 2011 at 4:03 PM, Michael Di Domenico >>>> <mdidomeni...@gmail.com> wrote: >>>>> >>>>> I'm still testing the slurm integration, which seems to work fine so >>>>> far. However, i just upgraded another cluster to openmpi-1.5 and >>>>> slurm 2.1.15 but this machine has no infiniband >>>>> >>>>> if i salloc the nodes and mpirun the command it seems to run and >>>>> complete >>>>> fine >>>>> however if i srun the command i get >>>>> >>>>> [btl_tcp_endpoint:486] mca_btl_tcp_endpoint_recv_connect_ack received >>>>> unexpected prcoess identifier >>>>> >>>>> the job does not seem to run, but exhibits two behaviors >>>>> running a single process per node the job runs and does not present >>>>> the error (srun -N40 --ntasks-per-node=1) >>>>> running multiple processes per node, the job spits out the error but >>>>> does not run (srun -n40 --ntasks-per-node=8) >>>>> >>>>> I copied the configs from the other machine, so (i think) everything >>>>> should be configured correctly (but i can't rule it out) >>>>> >>>>> I saw (and reported) a similar error to above with the 1.4-dev branch >>>>> (see mailing list) and slurm, I can't say whether they're related or >>>>> not though >>>>> >>>>> >>>>> On Mon, Jan 3, 2011 at 3:00 PM, Jeff Squyres <jsquy...@cisco.com> >>>>> wrote: >>>>>> >>>>>> Yo Ralph -- >>>>>> >>>>>> I see this was committed >>>>>> https://svn.open-mpi.org/trac/ompi/changeset/24197. Do you want to >>>>>> add a >>>>>> blurb in README about it, and/or have this executable compiled as part >>>>>> of >>>>>> the PSM MTL and then installed into $bindir (maybe named >>>>>> ompi-psm-keygen)? >>>>>> >>>>>> Right now, it's only compiled as part of "make check" and not >>>>>> installed, >>>>>> right? >>>>>> >>>>>> >>>>>> >>>>>> On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote: >>>>>> >>>>>>> Run the program only once - it can be in the prolog of the job if you >>>>>>> like. The output value needs to be in the env of every rank. >>>>>>> >>>>>>> You can reuse the value as many times as you like - it doesn't have >>>>>>> to >>>>>>> be unique for each job. There is nothing magic about the value >>>>>>> itself. >>>>>>> >>>>>>> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote: >>>>>>> >>>>>>>> How early does this need to run? Can I run it as part of a task >>>>>>>> prolog, or does it need to be the shell env for each rank? And does >>>>>>>> it need to run on one node or all the nodes in the job? >>>>>>>> >>>>>>>> On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Well, I couldn't do it as a patch - proved too complicated as the >>>>>>>>> psm >>>>>>>>> system looks for the value early in the boot procedure. >>>>>>>>> >>>>>>>>> What I can do is give you the attached key generator program. It >>>>>>>>> outputs the envar required to run your program. So if you run the >>>>>>>>> attached >>>>>>>>> program and then export the output into your environment, you >>>>>>>>> should be >>>>>>>>> okay. Looks like this: >>>>>>>>> >>>>>>>>> $ ./psm_keygen >>>>>>>>> >>>>>>>>> >>>>>>>>> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954 >>>>>>>>> $ >>>>>>>>> >>>>>>>>> You compile the program with the usual mpicc. >>>>>>>>> >>>>>>>>> Let me know if this solves the problem (or not). >>>>>>>>> Ralph >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote: >>>>>>>>> >>>>>>>>>> Sure, i'll give it a go >>>>>>>>>> >>>>>>>>>> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Ah, yes - that is going to be a problem. The PSM key gets >>>>>>>>>>> generated >>>>>>>>>>> by mpirun as it is shared info - i.e., every proc has to get the >>>>>>>>>>> same value. >>>>>>>>>>> >>>>>>>>>>> I can create a patch that will do this for the srun direct-launch >>>>>>>>>>> scenario, if you want to try it. Would be later today, though. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: >>>>>>>>>>> >>>>>>>>>>>> Well maybe not horray, yet. I might have jumped the gun a bit, >>>>>>>>>>>> it's >>>>>>>>>>>> looking like srun works in general, but perhaps not with PSM >>>>>>>>>>>> >>>>>>>>>>>> With PSM i get this error, (at least now i know what i changed) >>>>>>>>>>>> >>>>>>>>>>>> Error obtaining unique transport key from ORTE >>>>>>>>>>>> (orte_precondition_transports not present in the environment) >>>>>>>>>>>> PML add procs failed >>>>>>>>>>>> --> Returned "Error" (-1) instead of "Success" (0) >>>>>>>>>>>> >>>>>>>>>>>> Turn off PSM and srun works fine >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain >>>>>>>>>>>> <r...@open-mpi.org> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hooray! >>>>>>>>>>>>> >>>>>>>>>>>>> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I think i take it all back. I just tried it again and it >>>>>>>>>>>>>> seems >>>>>>>>>>>>>> to >>>>>>>>>>>>>> work now. I'm not sure what I changed (between my first and >>>>>>>>>>>>>> this >>>>>>>>>>>>>> msg), but it does appear to work now. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico >>>>>>>>>>>>>> <mdidomeni...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes that's true, error messages help. I was hoping there was >>>>>>>>>>>>>>> some >>>>>>>>>>>>>>> documentation to see what i've done wrong. I can't easily >>>>>>>>>>>>>>> cut >>>>>>>>>>>>>>> and >>>>>>>>>>>>>>> paste errors from my cluster. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Here's a snippet (hand typed) of the error message, but it >>>>>>>>>>>>>>> does >>>>>>>>>>>>>>> look >>>>>>>>>>>>>>> like a rank communications error >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ORTE_ERROR_LOG: A message is attempting to be sent to a >>>>>>>>>>>>>>> process >>>>>>>>>>>>>>> whose >>>>>>>>>>>>>>> contact information is unknown in file rml_oob_send.c at line >>>>>>>>>>>>>>> 145. >>>>>>>>>>>>>>> *** MPI_INIT failure message (snipped) *** >>>>>>>>>>>>>>> orte_grpcomm_modex failed >>>>>>>>>>>>>>> --> Returned "A messages is attempting to be sent to a >>>>>>>>>>>>>>> process >>>>>>>>>>>>>>> whose >>>>>>>>>>>>>>> contact information us uknown" (-117) instead of "Success" >>>>>>>>>>>>>>> (0) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This msg repeats for each rank, an ultimately hangs the srun >>>>>>>>>>>>>>> which i >>>>>>>>>>>>>>> have to Ctrl-C and terminate >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I have mpiports defined in my slurm config and running srun >>>>>>>>>>>>>>> with >>>>>>>>>>>>>>> -resv-ports does show the SLURM_RESV_PORTS environment >>>>>>>>>>>>>>> variable >>>>>>>>>>>>>>> getting parts to the shell >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain >>>>>>>>>>>>>>> <r...@open-mpi.org> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm not sure there is any documentation yet - not much >>>>>>>>>>>>>>>> clamor >>>>>>>>>>>>>>>> for it. :-/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It would really help if you included the error message. >>>>>>>>>>>>>>>> Otherwise, all I can do is guess, which wastes both of our >>>>>>>>>>>>>>>> time :-( >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> My best guess is that the port reservation didn't get passed >>>>>>>>>>>>>>>> down to the MPI procs properly - but that's just a guess. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Can anyone point me towards the most recent documentation >>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>> using >>>>>>>>>>>>>>>>> srun and openmpi? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I followed what i found on the web with enabling the >>>>>>>>>>>>>>>>> MpiPorts >>>>>>>>>>>>>>>>> config >>>>>>>>>>>>>>>>> in slurm and using the --resv-ports switch, but I'm getting >>>>>>>>>>>>>>>>> an error >>>>>>>>>>>>>>>>> from openmpi during setup. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm sure I'm missing a step. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> >>>>>> -- >>>>>> Jeff Squyres >>>>>> jsquy...@cisco.com >>>>>> For corporate legal information go to: >>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >