Any ideas on what might be causing this one?  Or atleast what
additional debug information someone might need?

On Fri, Jan 7, 2011 at 4:03 PM, Michael Di Domenico
<mdidomeni...@gmail.com> wrote:
> I'm still testing the slurm integration, which seems to work fine so
> far.  However, i just upgraded another cluster to openmpi-1.5 and
> slurm 2.1.15 but this machine has no infiniband
>
> if i salloc the nodes and mpirun the command it seems to run and complete fine
> however if i srun the command i get
>
> [btl_tcp_endpoint:486] mca_btl_tcp_endpoint_recv_connect_ack received
> unexpected prcoess identifier
>
> the job does not seem to run, but exhibits two behaviors
> running a single process per node the job runs and does not present
> the error (srun -N40 --ntasks-per-node=1)
> running multiple processes per node, the job spits out the error but
> does not run (srun -n40 --ntasks-per-node=8)
>
> I copied the configs from the other machine, so (i think) everything
> should be configured correctly (but i can't rule it out)
>
> I saw (and reported) a similar error to above with the 1.4-dev branch
> (see mailing list) and slurm, I can't say whether they're related or
> not though
>
>
> On Mon, Jan 3, 2011 at 3:00 PM, Jeff Squyres <jsquy...@cisco.com> wrote:
>> Yo Ralph --
>>
>> I see this was committed https://svn.open-mpi.org/trac/ompi/changeset/24197. 
>>  Do you want to add a blurb in README about it, and/or have this executable 
>> compiled as part of the PSM MTL and then installed into $bindir (maybe named 
>> ompi-psm-keygen)?
>>
>> Right now, it's only compiled as part of "make check" and not installed, 
>> right?
>>
>>
>>
>> On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote:
>>
>>> Run the program only once - it can be in the prolog of the job if you like. 
>>> The output value needs to be in the env of every rank.
>>>
>>> You can reuse the value as many times as you like - it doesn't have to be 
>>> unique for each job. There is nothing magic about the value itself.
>>>
>>> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote:
>>>
>>>> How early does this need to run? Can I run it as part of a task
>>>> prolog, or does it need to be the shell env for each rank?  And does
>>>> it need to run on one node or all the nodes in the job?
>>>>
>>>> On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>> Well, I couldn't do it as a patch - proved too complicated as the psm 
>>>>> system looks for the value early in the boot procedure.
>>>>>
>>>>> What I can do is give you the attached key generator program. It outputs 
>>>>> the envar required to run your program. So if you run the attached 
>>>>> program and then export the output into your environment, you should be 
>>>>> okay. Looks like this:
>>>>>
>>>>> $ ./psm_keygen
>>>>> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
>>>>> $
>>>>>
>>>>> You compile the program with the usual mpicc.
>>>>>
>>>>> Let me know if this solves the problem (or not).
>>>>> Ralph
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote:
>>>>>
>>>>>> Sure, i'll give it a go
>>>>>>
>>>>>> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>> Ah, yes - that is going to be a problem. The PSM key gets generated by 
>>>>>>> mpirun as it is shared info - i.e., every proc has to get the same 
>>>>>>> value.
>>>>>>>
>>>>>>> I can create a patch that will do this for the srun direct-launch 
>>>>>>> scenario, if you want to try it. Would be later today, though.
>>>>>>>
>>>>>>>
>>>>>>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:
>>>>>>>
>>>>>>>> Well maybe not horray, yet.  I might have jumped the gun a bit, it's
>>>>>>>> looking like srun works in general, but perhaps not with PSM
>>>>>>>>
>>>>>>>> With PSM i get this error, (at least now i know what i changed)
>>>>>>>>
>>>>>>>> Error obtaining unique transport key from ORTE
>>>>>>>> (orte_precondition_transports not present in the environment)
>>>>>>>> PML add procs failed
>>>>>>>> --> Returned "Error" (-1) instead of "Success" (0)
>>>>>>>>
>>>>>>>> Turn off PSM and srun works fine
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain <r...@open-mpi.org> 
>>>>>>>> wrote:
>>>>>>>>> Hooray!
>>>>>>>>>
>>>>>>>>> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
>>>>>>>>>
>>>>>>>>>> I think i take it all back.  I just tried it again and it seems to
>>>>>>>>>> work now.  I'm not sure what I changed (between my first and this
>>>>>>>>>> msg), but it does appear to work now.
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>>>>>>>>>> <mdidomeni...@gmail.com> wrote:
>>>>>>>>>>> Yes that's true, error messages help.  I was hoping there was some
>>>>>>>>>>> documentation to see what i've done wrong.  I can't easily cut and
>>>>>>>>>>> paste errors from my cluster.
>>>>>>>>>>>
>>>>>>>>>>> Here's a snippet (hand typed) of the error message, but it does look
>>>>>>>>>>> like a rank communications error
>>>>>>>>>>>
>>>>>>>>>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process 
>>>>>>>>>>> whose
>>>>>>>>>>> contact information is unknown in file rml_oob_send.c at line 145.
>>>>>>>>>>> *** MPI_INIT failure message (snipped) ***
>>>>>>>>>>> orte_grpcomm_modex failed
>>>>>>>>>>> --> Returned "A messages is attempting to be sent to a process whose
>>>>>>>>>>> contact information us uknown" (-117) instead of "Success" (0)
>>>>>>>>>>>
>>>>>>>>>>> This msg repeats for each rank, an ultimately hangs the srun which i
>>>>>>>>>>> have to Ctrl-C and terminate
>>>>>>>>>>>
>>>>>>>>>>> I have mpiports defined in my slurm config and running srun with
>>>>>>>>>>> -resv-ports does show the SLURM_RESV_PORTS environment variable
>>>>>>>>>>> getting parts to the shell
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain <r...@open-mpi.org> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> I'm not sure there is any documentation yet - not much clamor for 
>>>>>>>>>>>> it. :-/
>>>>>>>>>>>>
>>>>>>>>>>>> It would really help if you included the error message. Otherwise, 
>>>>>>>>>>>> all I can do is guess, which wastes both of our time :-(
>>>>>>>>>>>>
>>>>>>>>>>>> My best guess is that the port reservation didn't get passed down 
>>>>>>>>>>>> to the MPI procs properly - but that's just a guess.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Can anyone point me towards the most recent documentation for 
>>>>>>>>>>>>> using
>>>>>>>>>>>>> srun and openmpi?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I followed what i found on the web with enabling the MpiPorts 
>>>>>>>>>>>>> config
>>>>>>>>>>>>> in slurm and using the --resv-ports switch, but I'm getting an 
>>>>>>>>>>>>> error
>>>>>>>>>>>>> from openmpi during setup.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm sure I'm missing a step.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>

Reply via email to