I'm still testing the slurm integration, which seems to work fine so
far.  However, i just upgraded another cluster to openmpi-1.5 and
slurm 2.1.15 but this machine has no infiniband

if i salloc the nodes and mpirun the command it seems to run and complete fine
however if i srun the command i get

[btl_tcp_endpoint:486] mca_btl_tcp_endpoint_recv_connect_ack received
unexpected prcoess identifier

the job does not seem to run, but exhibits two behaviors
running a single process per node the job runs and does not present
the error (srun -N40 --ntasks-per-node=1)
running multiple processes per node, the job spits out the error but
does not run (srun -n40 --ntasks-per-node=8)

I copied the configs from the other machine, so (i think) everything
should be configured correctly (but i can't rule it out)

I saw (and reported) a similar error to above with the 1.4-dev branch
(see mailing list) and slurm, I can't say whether they're related or
not though


On Mon, Jan 3, 2011 at 3:00 PM, Jeff Squyres <jsquy...@cisco.com> wrote:
> Yo Ralph --
>
> I see this was committed https://svn.open-mpi.org/trac/ompi/changeset/24197.  
> Do you want to add a blurb in README about it, and/or have this executable 
> compiled as part of the PSM MTL and then installed into $bindir (maybe named 
> ompi-psm-keygen)?
>
> Right now, it's only compiled as part of "make check" and not installed, 
> right?
>
>
>
> On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote:
>
>> Run the program only once - it can be in the prolog of the job if you like. 
>> The output value needs to be in the env of every rank.
>>
>> You can reuse the value as many times as you like - it doesn't have to be 
>> unique for each job. There is nothing magic about the value itself.
>>
>> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote:
>>
>>> How early does this need to run? Can I run it as part of a task
>>> prolog, or does it need to be the shell env for each rank?  And does
>>> it need to run on one node or all the nodes in the job?
>>>
>>> On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> Well, I couldn't do it as a patch - proved too complicated as the psm 
>>>> system looks for the value early in the boot procedure.
>>>>
>>>> What I can do is give you the attached key generator program. It outputs 
>>>> the envar required to run your program. So if you run the attached program 
>>>> and then export the output into your environment, you should be okay. 
>>>> Looks like this:
>>>>
>>>> $ ./psm_keygen
>>>> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
>>>> $
>>>>
>>>> You compile the program with the usual mpicc.
>>>>
>>>> Let me know if this solves the problem (or not).
>>>> Ralph
>>>>
>>>>
>>>>
>>>>
>>>> On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote:
>>>>
>>>>> Sure, i'll give it a go
>>>>>
>>>>> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>> Ah, yes - that is going to be a problem. The PSM key gets generated by 
>>>>>> mpirun as it is shared info - i.e., every proc has to get the same value.
>>>>>>
>>>>>> I can create a patch that will do this for the srun direct-launch 
>>>>>> scenario, if you want to try it. Would be later today, though.
>>>>>>
>>>>>>
>>>>>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:
>>>>>>
>>>>>>> Well maybe not horray, yet.  I might have jumped the gun a bit, it's
>>>>>>> looking like srun works in general, but perhaps not with PSM
>>>>>>>
>>>>>>> With PSM i get this error, (at least now i know what i changed)
>>>>>>>
>>>>>>> Error obtaining unique transport key from ORTE
>>>>>>> (orte_precondition_transports not present in the environment)
>>>>>>> PML add procs failed
>>>>>>> --> Returned "Error" (-1) instead of "Success" (0)
>>>>>>>
>>>>>>> Turn off PSM and srun works fine
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain <r...@open-mpi.org> 
>>>>>>> wrote:
>>>>>>>> Hooray!
>>>>>>>>
>>>>>>>> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
>>>>>>>>
>>>>>>>>> I think i take it all back.  I just tried it again and it seems to
>>>>>>>>> work now.  I'm not sure what I changed (between my first and this
>>>>>>>>> msg), but it does appear to work now.
>>>>>>>>>
>>>>>>>>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>>>>>>>>> <mdidomeni...@gmail.com> wrote:
>>>>>>>>>> Yes that's true, error messages help.  I was hoping there was some
>>>>>>>>>> documentation to see what i've done wrong.  I can't easily cut and
>>>>>>>>>> paste errors from my cluster.
>>>>>>>>>>
>>>>>>>>>> Here's a snippet (hand typed) of the error message, but it does look
>>>>>>>>>> like a rank communications error
>>>>>>>>>>
>>>>>>>>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
>>>>>>>>>> contact information is unknown in file rml_oob_send.c at line 145.
>>>>>>>>>> *** MPI_INIT failure message (snipped) ***
>>>>>>>>>> orte_grpcomm_modex failed
>>>>>>>>>> --> Returned "A messages is attempting to be sent to a process whose
>>>>>>>>>> contact information us uknown" (-117) instead of "Success" (0)
>>>>>>>>>>
>>>>>>>>>> This msg repeats for each rank, an ultimately hangs the srun which i
>>>>>>>>>> have to Ctrl-C and terminate
>>>>>>>>>>
>>>>>>>>>> I have mpiports defined in my slurm config and running srun with
>>>>>>>>>> -resv-ports does show the SLURM_RESV_PORTS environment variable
>>>>>>>>>> getting parts to the shell
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain <r...@open-mpi.org> 
>>>>>>>>>> wrote:
>>>>>>>>>>> I'm not sure there is any documentation yet - not much clamor for 
>>>>>>>>>>> it. :-/
>>>>>>>>>>>
>>>>>>>>>>> It would really help if you included the error message. Otherwise, 
>>>>>>>>>>> all I can do is guess, which wastes both of our time :-(
>>>>>>>>>>>
>>>>>>>>>>> My best guess is that the port reservation didn't get passed down 
>>>>>>>>>>> to the MPI procs properly - but that's just a guess.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Can anyone point me towards the most recent documentation for using
>>>>>>>>>>>> srun and openmpi?
>>>>>>>>>>>>
>>>>>>>>>>>> I followed what i found on the web with enabling the MpiPorts 
>>>>>>>>>>>> config
>>>>>>>>>>>> in slurm and using the --resv-ports switch, but I'm getting an 
>>>>>>>>>>>> error
>>>>>>>>>>>> from openmpi during setup.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
>>>>>>>>>>>>
>>>>>>>>>>>> I'm sure I'm missing a step.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to