Hi,
Any users of Open MPI on Blue Waters here?
And then I specifically mean in "native" mode, not inside CCM.
After configuring and building as I do on other Cray's, mpirun gives me the
following:
[nid25263:31700] [[23896,0],0] ORTE_ERROR_LOG: Authentication failed in file
../../../../../orte/m
Mark,
How are you building ompi? Also what happens if you use. aprun. I work
with ompi on the nersc edison and hopper daily. typically i use aprun
though.
you definitely dont need to use ccm.
and shouldnt.
On Mar 25, 2015 6:00 AM, "Mark Santcroos"
wrote:
> Hi,
>
> Any users of Open MPI on Bl
Hi Howard,
> On 25 Mar 2015, at 14:58 , Howard Pritchard wrote:
> How are you building ompi?
My configure is rather straightforward:
./configure --prefix=$OMPI_PREFIX --disable-getpwuid
Maybe I got spoiled on Hopper/Edison and I need more explicit configuration on
BW ...
> Also what happens
turn off the disable getpwuid.
On Mar 25, 2015 8:14 AM, "Mark Santcroos"
wrote:
> Hi Howard,
>
> > On 25 Mar 2015, at 14:58 , Howard Pritchard wrote:
> > How are you building ompi?
>
> My configure is rather straightforward:
> ./configure --prefix=$OMPI_PREFIX --disable-getpwuid
>
> Maybe I got
Hi Nathan,
thank you for the update, it works without problems so far (kernels:
3.19.2, 3.18.9, 3.11(openSUSE 13.1) )
Kernel 3.4 (openSUSE 12.2) needs some changes:
xpmem_attach.c:
VM_DONTDUMP -> VM_RESERVED
xpmem_pfn.c:
+#include
xpmem_misc.c:
+#include
Regards,
Tobias
On 03/18/2015 12:1
Hey Mark
Your original error flag indicates that you are picking up a connection from
some proc built against a different OMPI installation. It’s a very low-level
check that looks for matching version numbers. Not sure who is trying to
connect, but that is the problem.
Check you LD_LIBRARY_PAT
> On 25 Mar 2015, at 15:46 , Howard Pritchard wrote:
> turn off the disable getpwuid.
That doesn't seem to make a difference.
Have their been changes in this area? Last time I checked this a couple of
months ago on Edison I needed this flag not to get spammed.
Yeah, I removed the need for this flag on Cray. As I said in my other note,
this is a red-herring - the issue is in the mismatched libraries.
> On Mar 25, 2015, at 8:36 AM, Mark Santcroos
> wrote:
>
>
>> On 25 Mar 2015, at 15:46 , Howard Pritchard wrote:
>> turn off the disable getpwuid.
>
Hi Ralph,
There is no OMPI in system space and PATH and LD_LIBRARY_PATH look good.
Any suggestion on how to get more relevant debugging info above the table?
Thanks
Mark
> On 25 Mar 2015, at 16:33 , Ralph Castain wrote:
>
> Hey Mark
>
> Your original error flag indicates that you are pickin
Hmmm…well, it will generate some output, so keep the system down to two nodes
if you can just to minimize the chatter. Add “-mca oob_base_verbose 100” to
your cmd line
> On Mar 25, 2015, at 8:45 AM, Mark Santcroos
> wrote:
>
> Hi Ralph,
>
> There is no OMPI in system space and PATH and LD_LI
marksant@nid25257:~> /u/sciteam/marksant/openmpi/installation/bin/mpirun -mca
oob_base_verbose 100 ./a.out
[nid25257:09350] mca: base: components_register: registering oob components
[nid25257:09350] mca: base: components_register: found loaded component usock
[nid25257:09350] mca: base: componen
Hmmm…okay, sorry to keep drilling down here, but let’s try adding “-mca
sec_base_verbose 100” now
> On Mar 25, 2015, at 8:51 AM, Mark Santcroos
> wrote:
>
> marksant@nid25257:~> /u/sciteam/marksant/openmpi/installation/bin/mpirun -mca
> oob_base_verbose 100 ./a.out
> [nid25257:09350] mca: ba
> On 25 Mar 2015, at 16:52 , Ralph Castain wrote:
>
> Hmmm…okay, sorry to keep drilling down here, but let’s try adding “-mca
> sec_base_verbose 100” now
> /u/sciteam/marksant/openmpi/installation/bin/mpirun -mca oob_base_verbose 100
> -mca sec_base_verbose 100 ./a.out
[nid25257:09727] mca:
OHO! You have munge running on the head node, but not on the backends!
Okay, all you have to do is set the MCA param “sec” to “basic” in your
environment, or add “-mca sec basic” on your cmd line
> On Mar 25, 2015, at 8:53 AM, Mark Santcroos
> wrote:
>
> nid25257:09727] sec: munge validate_c
> On 25 Mar 2015, at 17:06 , Ralph Castain wrote:
> OHO! You have munge running on the head node, but not on the backends!
Im all for munching, but what does that mean? ;-)
Is that something actively running or do you mean library available or such?
> Okay, all you have to do is set the MCA pa
Mark,
If you're wanting to use the orte-submit feature, you will need to get
mpirun working.
Could you rerun using the mpirun launch method but with
--mca oob_base_verbose 10 --mca ess_base_verbose 2
set?
Also, you may want to make sure you are using the ipogif0 eth device. This
can be contro
It’s working just fine, Howard - we found the problem.
> On Mar 25, 2015, at 9:12 AM, Howard Pritchard wrote:
>
> Mark,
>
> If you're wanting to use the orte-submit feature, you will need to get mpirun
> working.
>
> Could you rerun using the mpirun launch method but with
>
> --mca oob_base_
kind of working fine. I don't like users having to add these kind of
specialized --mca settings
just to get something to work. sounds like time for yet another cray
specific component.
2015-03-25 10:14 GMT-06:00 Ralph Castain :
> It’s working just fine, Howard - we found the problem.
>
> On M
Oh come on, Howard - before you go dumping more components into the system,
let’s explore WHY he hit this problem.
Geez…
> On Mar 25, 2015, at 9:16 AM, Howard Pritchard wrote:
>
> kind of working fine. I don't like users having to add these kind of
> specialized --mca settings
> just to get
> On 25 Mar 2015, at 17:06 , Ralph Castain wrote:
>
> OHO! You have munge running on the head node, but not on the backends!
Ok, so I now know that munge is ... :)
It's running on the MOM node (not on the head node):
daemon 18800 0.0 0.0 118476 3212 ?Sl 01:27 0:00
/usr/sbin/
Yeah, what’s happening is that mpirun is picking one security mechanism for
authenticating connections, but the backend daemons are picking another, and
hence we get the conflict. The weird thing here is that you usually don’t see
this kind of mismatch for the very reason you are hitting - it be
Ok.
FYI:
> aprun munge -n
munge: Error: Unable to access "/var/run/munge/munge.socket.2": No such file or
directory
Application 23792792 exit codes: 6
Application 23792792 resources: utime ~0s, stime ~1s, Rss ~27304, inblocks ~35,
outblocks ~58
> On 25 Mar 2015, at 17:29 , Ralph Castain wrote
Not surprising - I’m surprised to find munge on the mom’s node anyway given
that you are using Torque.
I have to finish something else first, and it sounds like you aren’t blocked at
the moment. I’ll provide a patch for you to try later, if you’re willing.
> On Mar 25, 2015, at 9:32 AM, Mark Sa
> On 25 Mar 2015, at 17:39 , Ralph Castain wrote:
> Not surprising - I’m surprised to find munge on the mom’s node anyway given
> that you are using Torque.
>
> I have to finish something else first, and it sounds like you aren’t blocked
> at the moment. I’ll provide a patch for you to try lat
Much appreciated! Interesting problem/configuration :-)
> On Mar 25, 2015, at 9:42 AM, Mark Santcroos
> wrote:
>
>
>> On 25 Mar 2015, at 17:39 , Ralph Castain wrote:
>> Not surprising - I’m surprised to find munge on the mom’s node anyway given
>> that you are using Torque.
>>
>> I have to
Received from Ralph Castain on Wed, Mar 04, 2015 at 10:03:06AM EST:
> > On Mar 3, 2015, at 9:41 AM, Lev Givon wrote:
> >
> > Received from Ralph Castain on Sun, Mar 01, 2015 at 10:31:15AM EST:
> >>> On Feb 26, 2015, at 1:19 PM, Lev Givon wrote:
> >>>
> >>> Received from Ralph Castain on Thu, Fe
Thanks for confirming it!!
> On Mar 25, 2015, at 10:57 AM, Lev Givon wrote:
>
> Received from Ralph Castain on Wed, Mar 04, 2015 at 10:03:06AM EST:
>>> On Mar 3, 2015, at 9:41 AM, Lev Givon wrote:
>>>
>>> Received from Ralph Castain on Sun, Mar 01, 2015 at 10:31:15AM EST:
> On Feb 26, 201
I think I have this resolved, though that I still suspect their is something
wrong on that system. You shouldn’t have some nodes running munge and others
not running it. I wonder if someone was experimenting and started munge on some
of the nodes, and forgot to turn it off afterwards??
Anyway,
Hi Ralph,
> On 25 Mar 2015, at 21:25 , Ralph Castain wrote:
> I think I have this resolved,
> though that I still suspect their is something wrong on that system. You
> shouldn’t have some nodes running munge and others not running it.
For completeness, it's not "some" nodes, its the MOM (servi
> On Mar 25, 2015, at 1:59 PM, Mark Santcroos
> wrote:
>
> Hi Ralph,
>
>> On 25 Mar 2015, at 21:25 , Ralph Castain wrote:
>> I think I have this resolved,
>> though that I still suspect their is something wrong on that system. You
>> shouldn’t have some nodes running munge and others not run
Mark,
munge is an authentication mechanism based
on a secret key shared between hosts.
there are both a daemon part and a library/client part.
it its simplest form, you can run on node0 :
echo "hello" | munge | ssh node1 unmunge
(see sample output below)
if everything is correctly set (e.g. sa
I’ve asked Mark to check with the sys admins as to the logic behind their
configuration. I would not immediately presume that they are doing something
wrong or that munge is not needed - could be used for other purposes.
I fully recognize that this change doesn’t resolve all problems, but it wil
Hi,
I just started to use openmpi and am trying to run a MPI/GPU code. My code
compiles but when I run, I get this error:
The library attempted to open the following supporting CUDA libraries,
but each of them failed. CUDA-aware support is disabled.
/usr/lib/libcuda.so.1: wrong ELF class: ELFCLA
33 matches
Mail list logo