I'm still testing the slurm integration, which seems to work fine so far. However, i just upgraded another cluster to openmpi-1.5 and slurm 2.1.15 but this machine has no infiniband
if i salloc the nodes and mpirun the command it seems to run and complete fine however if i srun the command i get [btl_tcp_endpoint:486] mca_btl_tcp_endpoint_recv_connect_ack received unexpected prcoess identifier the job does not seem to run, but exhibits two behaviors running a single process per node the job runs and does not present the error (srun -N40 --ntasks-per-node=1) running multiple processes per node, the job spits out the error but does not run (srun -n40 --ntasks-per-node=8) I copied the configs from the other machine, so (i think) everything should be configured correctly (but i can't rule it out) I saw (and reported) a similar error to above with the 1.4-dev branch (see mailing list) and slurm, I can't say whether they're related or not though On Mon, Jan 3, 2011 at 3:00 PM, Jeff Squyres <jsquy...@cisco.com> wrote: > Yo Ralph -- > > I see this was committed https://svn.open-mpi.org/trac/ompi/changeset/24197. > Do you want to add a blurb in README about it, and/or have this executable > compiled as part of the PSM MTL and then installed into $bindir (maybe named > ompi-psm-keygen)? > > Right now, it's only compiled as part of "make check" and not installed, > right? > > > > On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote: > >> Run the program only once - it can be in the prolog of the job if you like. >> The output value needs to be in the env of every rank. >> >> You can reuse the value as many times as you like - it doesn't have to be >> unique for each job. There is nothing magic about the value itself. >> >> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote: >> >>> How early does this need to run? Can I run it as part of a task >>> prolog, or does it need to be the shell env for each rank? And does >>> it need to run on one node or all the nodes in the job? >>> >>> On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>> Well, I couldn't do it as a patch - proved too complicated as the psm >>>> system looks for the value early in the boot procedure. >>>> >>>> What I can do is give you the attached key generator program. It outputs >>>> the envar required to run your program. So if you run the attached program >>>> and then export the output into your environment, you should be okay. >>>> Looks like this: >>>> >>>> $ ./psm_keygen >>>> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954 >>>> $ >>>> >>>> You compile the program with the usual mpicc. >>>> >>>> Let me know if this solves the problem (or not). >>>> Ralph >>>> >>>> >>>> >>>> >>>> On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote: >>>> >>>>> Sure, i'll give it a go >>>>> >>>>> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>> Ah, yes - that is going to be a problem. The PSM key gets generated by >>>>>> mpirun as it is shared info - i.e., every proc has to get the same value. >>>>>> >>>>>> I can create a patch that will do this for the srun direct-launch >>>>>> scenario, if you want to try it. Would be later today, though. >>>>>> >>>>>> >>>>>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: >>>>>> >>>>>>> Well maybe not horray, yet. I might have jumped the gun a bit, it's >>>>>>> looking like srun works in general, but perhaps not with PSM >>>>>>> >>>>>>> With PSM i get this error, (at least now i know what i changed) >>>>>>> >>>>>>> Error obtaining unique transport key from ORTE >>>>>>> (orte_precondition_transports not present in the environment) >>>>>>> PML add procs failed >>>>>>> --> Returned "Error" (-1) instead of "Success" (0) >>>>>>> >>>>>>> Turn off PSM and srun works fine >>>>>>> >>>>>>> >>>>>>> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain <r...@open-mpi.org> >>>>>>> wrote: >>>>>>>> Hooray! >>>>>>>> >>>>>>>> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: >>>>>>>> >>>>>>>>> I think i take it all back. I just tried it again and it seems to >>>>>>>>> work now. I'm not sure what I changed (between my first and this >>>>>>>>> msg), but it does appear to work now. >>>>>>>>> >>>>>>>>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico >>>>>>>>> <mdidomeni...@gmail.com> wrote: >>>>>>>>>> Yes that's true, error messages help. I was hoping there was some >>>>>>>>>> documentation to see what i've done wrong. I can't easily cut and >>>>>>>>>> paste errors from my cluster. >>>>>>>>>> >>>>>>>>>> Here's a snippet (hand typed) of the error message, but it does look >>>>>>>>>> like a rank communications error >>>>>>>>>> >>>>>>>>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose >>>>>>>>>> contact information is unknown in file rml_oob_send.c at line 145. >>>>>>>>>> *** MPI_INIT failure message (snipped) *** >>>>>>>>>> orte_grpcomm_modex failed >>>>>>>>>> --> Returned "A messages is attempting to be sent to a process whose >>>>>>>>>> contact information us uknown" (-117) instead of "Success" (0) >>>>>>>>>> >>>>>>>>>> This msg repeats for each rank, an ultimately hangs the srun which i >>>>>>>>>> have to Ctrl-C and terminate >>>>>>>>>> >>>>>>>>>> I have mpiports defined in my slurm config and running srun with >>>>>>>>>> -resv-ports does show the SLURM_RESV_PORTS environment variable >>>>>>>>>> getting parts to the shell >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>>>> wrote: >>>>>>>>>>> I'm not sure there is any documentation yet - not much clamor for >>>>>>>>>>> it. :-/ >>>>>>>>>>> >>>>>>>>>>> It would really help if you included the error message. Otherwise, >>>>>>>>>>> all I can do is guess, which wastes both of our time :-( >>>>>>>>>>> >>>>>>>>>>> My best guess is that the port reservation didn't get passed down >>>>>>>>>>> to the MPI procs properly - but that's just a guess. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: >>>>>>>>>>> >>>>>>>>>>>> Can anyone point me towards the most recent documentation for using >>>>>>>>>>>> srun and openmpi? >>>>>>>>>>>> >>>>>>>>>>>> I followed what i found on the web with enabling the MpiPorts >>>>>>>>>>>> config >>>>>>>>>>>> in slurm and using the --resv-ports switch, but I'm getting an >>>>>>>>>>>> error >>>>>>>>>>>> from openmpi during setup. >>>>>>>>>>>> >>>>>>>>>>>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM >>>>>>>>>>>> >>>>>>>>>>>> I'm sure I'm missing a step. >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >