Certainly, i reached out to several contacts I have inside qlogic (i
used to work there)...
On Fri, Apr 29, 2011 at 10:30 AM, Ralph Castain wrote:
> Hi Michael
>
> I'm told that the Qlogic contacts we used to have are no longer there. Since
> you obviously are a customer, can you ping them and a
Hi Michael
I'm told that the Qlogic contacts we used to have are no longer there. Since
you obviously are a customer, can you ping them and ask (a) what that error
message means, and (b) what's wrong with the values I computed?
You can also just send them my way, if that would help. We just nee
On Apr 29, 2011, at 8:05 AM, Michael Di Domenico wrote:
> On Fri, Apr 29, 2011 at 10:01 AM, Michael Di Domenico
> wrote:
>> On Fri, Apr 29, 2011 at 4:52 AM, Ralph Castain wrote:
>>> Hi Michael
>>>
>>> Please see the attached updated patch to try for 1.5.3. I mistakenly free'd
>>> the envar af
On Fri, Apr 29, 2011 at 10:01 AM, Michael Di Domenico
wrote:
> On Fri, Apr 29, 2011 at 4:52 AM, Ralph Castain wrote:
>> Hi Michael
>>
>> Please see the attached updated patch to try for 1.5.3. I mistakenly free'd
>> the envar after adding it to the environ :-/
>
> The patch works great, i can no
On Fri, Apr 29, 2011 at 4:52 AM, Ralph Castain wrote:
> Hi Michael
>
> Please see the attached updated patch to try for 1.5.3. I mistakenly free'd
> the envar after adding it to the environ :-/
The patch works great, i can now see the precondition environment
variable if i do
mpirun -n 2 -host
Hi Michael
Please see the attached updated patch to try for 1.5.3. I mistakenly free'd the
envar after adding it to the environ :-/
Thanks
Ralph
slurmd.diff
Description: Binary data
On Apr 28, 2011, at 2:31 PM, Michael Di Domenico wrote:
> On Thu, Apr 28, 2011 at 9:03 AM, Ralph Castain wro
On Thu, Apr 28, 2011 at 9:03 AM, Ralph Castain wrote:
>
> On Apr 28, 2011, at 6:49 AM, Michael Di Domenico wrote:
>
>> On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castain wrote:
>>>
>>> On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote:
>>>
On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain wr
Per earlier in the thread, it looks like you are using a 1.5 series release -
so here is a patch that -should- fix the PSM setup problem.
Please let me know if/how it works as I honestly have no way of testing it.
Ralph
slurmd.diff
Description: Binary data
On Apr 28, 2011, at 7:03 AM, Ralph
On Apr 28, 2011, at 6:49 AM, Michael Di Domenico wrote:
> On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castain wrote:
>>
>> On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote:
>>
>>> On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain wrote:
On Apr 27, 2011, at 12:38 PM, Michael Di Domen
On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castain wrote:
>
> On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote:
>
>> On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain wrote:
>>>
>>> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:
>>>
On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain w
On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote:
> On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain wrote:
>>
>> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:
>>
>>> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain wrote:
On Apr 27, 2011, at 10:09 AM, Michael Di Domen
On Apr 27, 2011, at 3:39 PM, Ralph Castain wrote:
> Nope, nope nope...in this mode of operation, we are using -static- ports.
Er.. right. Sorry -- my bad for not reading the full context here... ignore
what I said...
--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
ht
On Apr 27, 2011, at 1:27 PM, Jeff Squyres wrote:
> On Apr 27, 2011, at 2:46 PM, Ralph Castain wrote:
>
>> Actually, I understood you correctly. I'm just saying that I find no
>> evidence in the code that we try three times before giving up. What I see is
>> a single attempt to bind the port -
On Apr 27, 2011, at 2:46 PM, Ralph Castain wrote:
> Actually, I understood you correctly. I'm just saying that I find no evidence
> in the code that we try three times before giving up. What I see is a single
> attempt to bind the port - if it fails, then we abort. There is no parameter
> to co
On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain wrote:
>
> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:
>
>> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain wrote:
>>>
>>> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
>>>
Was this ever committed to the OMPI src as someth
On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:
> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain wrote:
>>
>> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
>>
>>> Was this ever committed to the OMPI src as something not having to be
>>> run outside of OpenMPI, but as part o
On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain wrote:
>
> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
>
>> Was this ever committed to the OMPI src as something not having to be
>> run outside of OpenMPI, but as part of the PSM setup that OpenMPI
>> does?
>
> Not that I know of - I don
On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
> Was this ever committed to the OMPI src as something not having to be
> run outside of OpenMPI, but as part of the PSM setup that OpenMPI
> does?
Not that I know of - I don't think the PSM developers ever looked at it.
>
> I'm having s
Was this ever committed to the OMPI src as something not having to be
run outside of OpenMPI, but as part of the PSM setup that OpenMPI
does?
I'm having some trouble getting Slurm/OpenMPI to play nice with the
setup of this key. Namely, with slurm you cannot export variables
from the --prolog of
Yes, i am setting the config correcty. Our IB machines seem to run
just fine so far using srun and openmpi v1.5.
As another data point, we enabled mpi-threads in Openmpi and that also
seems to trigger the Srun/TCP behavior, but on the IB fabric. Running
the program within an salloc rather the st
We are seeing the similar problem with our infiniband machines. After some
investigation I discovered that we were not setting our slurm environment
correctly (ref:
https://computing.llnl.gov/linux/slurm/mpi_guide.html#open_mpi). Are you
setting the ports in your slurm.conf and executing srun
Thanks. We're only seeing it on machines with Ethernet only as the
interconnect. fortunately for us that only equates to one small
machine, but it's still annoying. unfortunately, i don't have enough
knowledge to dive into the code to help fix, but i can certainly help
test
On Mon, Jan 24, 2011
I am seeing similar issues on our slurm clusters. We are looking into the issue.
-Nathan
HPC-3, LANL
On Tue, 11 Jan 2011, Michael Di Domenico wrote:
Any ideas on what might be causing this one? Or atleast what
additional debug information someone might need?
On Fri, Jan 7, 2011 at 4:03 PM, M
Any ideas on what might be causing this one? Or atleast what
additional debug information someone might need?
On Fri, Jan 7, 2011 at 4:03 PM, Michael Di Domenico
wrote:
> I'm still testing the slurm integration, which seems to work fine so
> far. However, i just upgraded another cluster to open
I'm still testing the slurm integration, which seems to work fine so
far. However, i just upgraded another cluster to openmpi-1.5 and
slurm 2.1.15 but this machine has no infiniband
if i salloc the nodes and mpirun the command it seems to run and complete fine
however if i srun the command i get
Yo Ralph --
I see this was committed https://svn.open-mpi.org/trac/ompi/changeset/24197.
Do you want to add a blurb in README about it, and/or have this executable
compiled as part of the PSM MTL and then installed into $bindir (maybe named
ompi-psm-keygen)?
Right now, it's only compiled as
Run the program only once - it can be in the prolog of the job if you like. The
output value needs to be in the env of every rank.
You can reuse the value as many times as you like - it doesn't have to be
unique for each job. There is nothing magic about the value itself.
On Dec 30, 2010, at 2:
How early does this need to run? Can I run it as part of a task
prolog, or does it need to be the shell env for each rank? And does
it need to run on one node or all the nodes in the job?
On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain wrote:
> Well, I couldn't do it as a patch - proved too compl
Should have also warned you: you'll need to configure OMPI --with-devel-headers
to get this program to build/run.
On Dec 30, 2010, at 1:54 PM, Ralph Castain wrote:
> Well, I couldn't do it as a patch - proved too complicated as the psm system
> looks for the value early in the boot procedure.
Well, I couldn't do it as a patch - proved too complicated as the psm system
looks for the value early in the boot procedure.
What I can do is give you the attached key generator program. It outputs the
envar required to run your program. So if you run the attached program and then
export the o
Sure, i'll give it a go
On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain wrote:
> Ah, yes - that is going to be a problem. The PSM key gets generated by mpirun
> as it is shared info - i.e., every proc has to get the same value.
>
> I can create a patch that will do this for the srun direct-launch
Ah, yes - that is going to be a problem. The PSM key gets generated by mpirun
as it is shared info - i.e., every proc has to get the same value.
I can create a patch that will do this for the srun direct-launch scenario, if
you want to try it. Would be later today, though.
On Dec 30, 2010, at
Well maybe not horray, yet. I might have jumped the gun a bit, it's
looking like srun works in general, but perhaps not with PSM
With PSM i get this error, (at least now i know what i changed)
Error obtaining unique transport key from ORTE
(orte_precondition_transports not present in the environ
Hooray!
On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
> I think i take it all back. I just tried it again and it seems to
> work now. I'm not sure what I changed (between my first and this
> msg), but it does appear to work now.
>
> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenic
I think i take it all back. I just tried it again and it seems to
work now. I'm not sure what I changed (between my first and this
msg), but it does appear to work now.
On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
wrote:
> Yes that's true, error messages help. I was hoping there was so
Yes that's true, error messages help. I was hoping there was some
documentation to see what i've done wrong. I can't easily cut and
paste errors from my cluster.
Here's a snippet (hand typed) of the error message, but it does look
like a rank communications error
ORTE_ERROR_LOG: A message is at
I'm not sure there is any documentation yet - not much clamor for it. :-/
It would really help if you included the error message. Otherwise, all I can do
is guess, which wastes both of our time :-(
My best guess is that the port reservation didn't get passed down to the MPI
procs properly - but
37 matches
Mail list logo