Any ideas on what might be causing this one? Or atleast what additional debug information someone might need?
On Fri, Jan 7, 2011 at 4:03 PM, Michael Di Domenico <mdidomeni...@gmail.com> wrote: > I'm still testing the slurm integration, which seems to work fine so > far. However, i just upgraded another cluster to openmpi-1.5 and > slurm 2.1.15 but this machine has no infiniband > > if i salloc the nodes and mpirun the command it seems to run and complete fine > however if i srun the command i get > > [btl_tcp_endpoint:486] mca_btl_tcp_endpoint_recv_connect_ack received > unexpected prcoess identifier > > the job does not seem to run, but exhibits two behaviors > running a single process per node the job runs and does not present > the error (srun -N40 --ntasks-per-node=1) > running multiple processes per node, the job spits out the error but > does not run (srun -n40 --ntasks-per-node=8) > > I copied the configs from the other machine, so (i think) everything > should be configured correctly (but i can't rule it out) > > I saw (and reported) a similar error to above with the 1.4-dev branch > (see mailing list) and slurm, I can't say whether they're related or > not though > > > On Mon, Jan 3, 2011 at 3:00 PM, Jeff Squyres <jsquy...@cisco.com> wrote: >> Yo Ralph -- >> >> I see this was committed https://svn.open-mpi.org/trac/ompi/changeset/24197. >> Do you want to add a blurb in README about it, and/or have this executable >> compiled as part of the PSM MTL and then installed into $bindir (maybe named >> ompi-psm-keygen)? >> >> Right now, it's only compiled as part of "make check" and not installed, >> right? >> >> >> >> On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote: >> >>> Run the program only once - it can be in the prolog of the job if you like. >>> The output value needs to be in the env of every rank. >>> >>> You can reuse the value as many times as you like - it doesn't have to be >>> unique for each job. There is nothing magic about the value itself. >>> >>> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote: >>> >>>> How early does this need to run? Can I run it as part of a task >>>> prolog, or does it need to be the shell env for each rank? And does >>>> it need to run on one node or all the nodes in the job? >>>> >>>> On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>> Well, I couldn't do it as a patch - proved too complicated as the psm >>>>> system looks for the value early in the boot procedure. >>>>> >>>>> What I can do is give you the attached key generator program. It outputs >>>>> the envar required to run your program. So if you run the attached >>>>> program and then export the output into your environment, you should be >>>>> okay. Looks like this: >>>>> >>>>> $ ./psm_keygen >>>>> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954 >>>>> $ >>>>> >>>>> You compile the program with the usual mpicc. >>>>> >>>>> Let me know if this solves the problem (or not). >>>>> Ralph >>>>> >>>>> >>>>> >>>>> >>>>> On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote: >>>>> >>>>>> Sure, i'll give it a go >>>>>> >>>>>> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>> Ah, yes - that is going to be a problem. The PSM key gets generated by >>>>>>> mpirun as it is shared info - i.e., every proc has to get the same >>>>>>> value. >>>>>>> >>>>>>> I can create a patch that will do this for the srun direct-launch >>>>>>> scenario, if you want to try it. Would be later today, though. >>>>>>> >>>>>>> >>>>>>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: >>>>>>> >>>>>>>> Well maybe not horray, yet. I might have jumped the gun a bit, it's >>>>>>>> looking like srun works in general, but perhaps not with PSM >>>>>>>> >>>>>>>> With PSM i get this error, (at least now i know what i changed) >>>>>>>> >>>>>>>> Error obtaining unique transport key from ORTE >>>>>>>> (orte_precondition_transports not present in the environment) >>>>>>>> PML add procs failed >>>>>>>> --> Returned "Error" (-1) instead of "Success" (0) >>>>>>>> >>>>>>>> Turn off PSM and srun works fine >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>> wrote: >>>>>>>>> Hooray! >>>>>>>>> >>>>>>>>> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: >>>>>>>>> >>>>>>>>>> I think i take it all back. I just tried it again and it seems to >>>>>>>>>> work now. I'm not sure what I changed (between my first and this >>>>>>>>>> msg), but it does appear to work now. >>>>>>>>>> >>>>>>>>>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico >>>>>>>>>> <mdidomeni...@gmail.com> wrote: >>>>>>>>>>> Yes that's true, error messages help. I was hoping there was some >>>>>>>>>>> documentation to see what i've done wrong. I can't easily cut and >>>>>>>>>>> paste errors from my cluster. >>>>>>>>>>> >>>>>>>>>>> Here's a snippet (hand typed) of the error message, but it does look >>>>>>>>>>> like a rank communications error >>>>>>>>>>> >>>>>>>>>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process >>>>>>>>>>> whose >>>>>>>>>>> contact information is unknown in file rml_oob_send.c at line 145. >>>>>>>>>>> *** MPI_INIT failure message (snipped) *** >>>>>>>>>>> orte_grpcomm_modex failed >>>>>>>>>>> --> Returned "A messages is attempting to be sent to a process whose >>>>>>>>>>> contact information us uknown" (-117) instead of "Success" (0) >>>>>>>>>>> >>>>>>>>>>> This msg repeats for each rank, an ultimately hangs the srun which i >>>>>>>>>>> have to Ctrl-C and terminate >>>>>>>>>>> >>>>>>>>>>> I have mpiports defined in my slurm config and running srun with >>>>>>>>>>> -resv-ports does show the SLURM_RESV_PORTS environment variable >>>>>>>>>>> getting parts to the shell >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>>>>> wrote: >>>>>>>>>>>> I'm not sure there is any documentation yet - not much clamor for >>>>>>>>>>>> it. :-/ >>>>>>>>>>>> >>>>>>>>>>>> It would really help if you included the error message. Otherwise, >>>>>>>>>>>> all I can do is guess, which wastes both of our time :-( >>>>>>>>>>>> >>>>>>>>>>>> My best guess is that the port reservation didn't get passed down >>>>>>>>>>>> to the MPI procs properly - but that's just a guess. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Can anyone point me towards the most recent documentation for >>>>>>>>>>>>> using >>>>>>>>>>>>> srun and openmpi? >>>>>>>>>>>>> >>>>>>>>>>>>> I followed what i found on the web with enabling the MpiPorts >>>>>>>>>>>>> config >>>>>>>>>>>>> in slurm and using the --resv-ports switch, but I'm getting an >>>>>>>>>>>>> error >>>>>>>>>>>>> from openmpi during setup. >>>>>>>>>>>>> >>>>>>>>>>>>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM >>>>>>>>>>>>> >>>>>>>>>>>>> I'm sure I'm missing a step. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >