Ah, yes - that is going to be a problem. The PSM key gets generated by mpirun 
as it is shared info - i.e., every proc has to get the same value.

I can create a patch that will do this for the srun direct-launch scenario, if 
you want to try it. Would be later today, though.


On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:

> Well maybe not horray, yet.  I might have jumped the gun a bit, it's
> looking like srun works in general, but perhaps not with PSM
> 
> With PSM i get this error, (at least now i know what i changed)
> 
> Error obtaining unique transport key from ORTE
> (orte_precondition_transports not present in the environment)
> PML add procs failed
> --> Returned "Error" (-1) instead of "Success" (0)
> 
> Turn off PSM and srun works fine
> 
> 
> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> Hooray!
>> 
>> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
>> 
>>> I think i take it all back.  I just tried it again and it seems to
>>> work now.  I'm not sure what I changed (between my first and this
>>> msg), but it does appear to work now.
>>> 
>>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>>> <mdidomeni...@gmail.com> wrote:
>>>> Yes that's true, error messages help.  I was hoping there was some
>>>> documentation to see what i've done wrong.  I can't easily cut and
>>>> paste errors from my cluster.
>>>> 
>>>> Here's a snippet (hand typed) of the error message, but it does look
>>>> like a rank communications error
>>>> 
>>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
>>>> contact information is unknown in file rml_oob_send.c at line 145.
>>>> *** MPI_INIT failure message (snipped) ***
>>>> orte_grpcomm_modex failed
>>>> --> Returned "A messages is attempting to be sent to a process whose
>>>> contact information us uknown" (-117) instead of "Success" (0)
>>>> 
>>>> This msg repeats for each rank, an ultimately hangs the srun which i
>>>> have to Ctrl-C and terminate
>>>> 
>>>> I have mpiports defined in my slurm config and running srun with
>>>> -resv-ports does show the SLURM_RESV_PORTS environment variable
>>>> getting parts to the shell
>>>> 
>>>> 
>>>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>> I'm not sure there is any documentation yet - not much clamor for it. :-/
>>>>> 
>>>>> It would really help if you included the error message. Otherwise, all I 
>>>>> can do is guess, which wastes both of our time :-(
>>>>> 
>>>>> My best guess is that the port reservation didn't get passed down to the 
>>>>> MPI procs properly - but that's just a guess.
>>>>> 
>>>>> 
>>>>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
>>>>> 
>>>>>> Can anyone point me towards the most recent documentation for using
>>>>>> srun and openmpi?
>>>>>> 
>>>>>> I followed what i found on the web with enabling the MpiPorts config
>>>>>> in slurm and using the --resv-ports switch, but I'm getting an error
>>>>>> from openmpi during setup.
>>>>>> 
>>>>>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
>>>>>> 
>>>>>> I'm sure I'm missing a step.
>>>>>> 
>>>>>> Thanks
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to