Mark,
thanks for the link.
i tried to read between the lines, and "found" that in the case of
torque+munge,
munge might be required only on admin nodes and submission hosts (which
could be restricted
to login nodes on most systems)
on the other hand, slurm does require munge on compute nodes, ev
> On 26 Mar 2015, at 16:01 , Ralph Castain wrote:
>
>>
>> On Mar 26, 2015, at 1:33 AM, Mark Santcroos
>> wrote:
>>
>> Hi guys,
>>
>> Thanks for the follow-up.
>>
>> It appears that you are ruling out that Munge is required because the system
>> runs TORQUE, but as far as I can see Munge i
> On Mar 25, 2015, at 9:38 PM, Gilles Gouaillardet
> wrote:
>
> On 2015/03/26 13:00, Ralph Castain wrote:
>> Well, I did some digging around, and this PR looks like the right solution.
> ok then :-)
>
> following stuff is not directly related to ompi, but you might want to
> comment on that an
> On Mar 26, 2015, at 1:33 AM, Mark Santcroos
> wrote:
>
> Hi guys,
>
> Thanks for the follow-up.
>
> It appears that you are ruling out that Munge is required because the system
> runs TORQUE, but as far as I can see Munge is/can be used by both SLURM and
> TORQUE.
> (http://docs.adaptivec
Hi Ralph,
> On 25 Mar 2015, at 21:59 , Mark Santcroos wrote:
>> Anyway, see if this fixes the problem.
>>
>> https://github.com/open-mpi/ompi/pull/497
Can confirm the fallback works now without setting explicitly to basic (with
the merged changes).
Thanks!
Mark
Hi guys,
Thanks for the follow-up.
It appears that you are ruling out that Munge is required because the system
runs TORQUE, but as far as I can see Munge is/can be used by both SLURM and
TORQUE.
(http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/1-installConfig/serverConfig.htm#usi
On 2015/03/26 13:00, Ralph Castain wrote:
> Well, I did some digging around, and this PR looks like the right solution.
ok then :-)
following stuff is not directly related to ompi, but you might want to
comment on that anyway ...
> Second, the running of munge on the IO nodes is not only okay but
Well, I did some digging around, and this PR looks like the right solution.
First, the security issue is fine so long as we use the highest level of
security that is available. If someone configures the system with munge, then
we default to it - if not, we use the next highest one available.
Se
I’ve asked Mark to check with the sys admins as to the logic behind their
configuration. I would not immediately presume that they are doing something
wrong or that munge is not needed - could be used for other purposes.
I fully recognize that this change doesn’t resolve all problems, but it wil
Mark,
munge is an authentication mechanism based
on a secret key shared between hosts.
there are both a daemon part and a library/client part.
it its simplest form, you can run on node0 :
echo "hello" | munge | ssh node1 unmunge
(see sample output below)
if everything is correctly set (e.g. sa
> On Mar 25, 2015, at 1:59 PM, Mark Santcroos
> wrote:
>
> Hi Ralph,
>
>> On 25 Mar 2015, at 21:25 , Ralph Castain wrote:
>> I think I have this resolved,
>> though that I still suspect their is something wrong on that system. You
>> shouldn’t have some nodes running munge and others not run
Hi Ralph,
> On 25 Mar 2015, at 21:25 , Ralph Castain wrote:
> I think I have this resolved,
> though that I still suspect their is something wrong on that system. You
> shouldn’t have some nodes running munge and others not running it.
For completeness, it's not "some" nodes, its the MOM (servi
I think I have this resolved, though that I still suspect their is something
wrong on that system. You shouldn’t have some nodes running munge and others
not running it. I wonder if someone was experimenting and started munge on some
of the nodes, and forgot to turn it off afterwards??
Anyway,
Much appreciated! Interesting problem/configuration :-)
> On Mar 25, 2015, at 9:42 AM, Mark Santcroos
> wrote:
>
>
>> On 25 Mar 2015, at 17:39 , Ralph Castain wrote:
>> Not surprising - I’m surprised to find munge on the mom’s node anyway given
>> that you are using Torque.
>>
>> I have to
> On 25 Mar 2015, at 17:39 , Ralph Castain wrote:
> Not surprising - I’m surprised to find munge on the mom’s node anyway given
> that you are using Torque.
>
> I have to finish something else first, and it sounds like you aren’t blocked
> at the moment. I’ll provide a patch for you to try lat
Not surprising - I’m surprised to find munge on the mom’s node anyway given
that you are using Torque.
I have to finish something else first, and it sounds like you aren’t blocked at
the moment. I’ll provide a patch for you to try later, if you’re willing.
> On Mar 25, 2015, at 9:32 AM, Mark Sa
Ok.
FYI:
> aprun munge -n
munge: Error: Unable to access "/var/run/munge/munge.socket.2": No such file or
directory
Application 23792792 exit codes: 6
Application 23792792 resources: utime ~0s, stime ~1s, Rss ~27304, inblocks ~35,
outblocks ~58
> On 25 Mar 2015, at 17:29 , Ralph Castain wrote
Yeah, what’s happening is that mpirun is picking one security mechanism for
authenticating connections, but the backend daemons are picking another, and
hence we get the conflict. The weird thing here is that you usually don’t see
this kind of mismatch for the very reason you are hitting - it be
> On 25 Mar 2015, at 17:06 , Ralph Castain wrote:
>
> OHO! You have munge running on the head node, but not on the backends!
Ok, so I now know that munge is ... :)
It's running on the MOM node (not on the head node):
daemon 18800 0.0 0.0 118476 3212 ?Sl 01:27 0:00
/usr/sbin/
Oh come on, Howard - before you go dumping more components into the system,
let’s explore WHY he hit this problem.
Geez…
> On Mar 25, 2015, at 9:16 AM, Howard Pritchard wrote:
>
> kind of working fine. I don't like users having to add these kind of
> specialized --mca settings
> just to get
kind of working fine. I don't like users having to add these kind of
specialized --mca settings
just to get something to work. sounds like time for yet another cray
specific component.
2015-03-25 10:14 GMT-06:00 Ralph Castain :
> It’s working just fine, Howard - we found the problem.
>
> On M
It’s working just fine, Howard - we found the problem.
> On Mar 25, 2015, at 9:12 AM, Howard Pritchard wrote:
>
> Mark,
>
> If you're wanting to use the orte-submit feature, you will need to get mpirun
> working.
>
> Could you rerun using the mpirun launch method but with
>
> --mca oob_base_
Mark,
If you're wanting to use the orte-submit feature, you will need to get
mpirun working.
Could you rerun using the mpirun launch method but with
--mca oob_base_verbose 10 --mca ess_base_verbose 2
set?
Also, you may want to make sure you are using the ipogif0 eth device. This
can be contro
> On 25 Mar 2015, at 17:06 , Ralph Castain wrote:
> OHO! You have munge running on the head node, but not on the backends!
Im all for munching, but what does that mean? ;-)
Is that something actively running or do you mean library available or such?
> Okay, all you have to do is set the MCA pa
OHO! You have munge running on the head node, but not on the backends!
Okay, all you have to do is set the MCA param “sec” to “basic” in your
environment, or add “-mca sec basic” on your cmd line
> On Mar 25, 2015, at 8:53 AM, Mark Santcroos
> wrote:
>
> nid25257:09727] sec: munge validate_c
> On 25 Mar 2015, at 16:52 , Ralph Castain wrote:
>
> Hmmm…okay, sorry to keep drilling down here, but let’s try adding “-mca
> sec_base_verbose 100” now
> /u/sciteam/marksant/openmpi/installation/bin/mpirun -mca oob_base_verbose 100
> -mca sec_base_verbose 100 ./a.out
[nid25257:09727] mca:
Hmmm…okay, sorry to keep drilling down here, but let’s try adding “-mca
sec_base_verbose 100” now
> On Mar 25, 2015, at 8:51 AM, Mark Santcroos
> wrote:
>
> marksant@nid25257:~> /u/sciteam/marksant/openmpi/installation/bin/mpirun -mca
> oob_base_verbose 100 ./a.out
> [nid25257:09350] mca: ba
marksant@nid25257:~> /u/sciteam/marksant/openmpi/installation/bin/mpirun -mca
oob_base_verbose 100 ./a.out
[nid25257:09350] mca: base: components_register: registering oob components
[nid25257:09350] mca: base: components_register: found loaded component usock
[nid25257:09350] mca: base: componen
Hmmm…well, it will generate some output, so keep the system down to two nodes
if you can just to minimize the chatter. Add “-mca oob_base_verbose 100” to
your cmd line
> On Mar 25, 2015, at 8:45 AM, Mark Santcroos
> wrote:
>
> Hi Ralph,
>
> There is no OMPI in system space and PATH and LD_LI
Hi Ralph,
There is no OMPI in system space and PATH and LD_LIBRARY_PATH look good.
Any suggestion on how to get more relevant debugging info above the table?
Thanks
Mark
> On 25 Mar 2015, at 16:33 , Ralph Castain wrote:
>
> Hey Mark
>
> Your original error flag indicates that you are pickin
Yeah, I removed the need for this flag on Cray. As I said in my other note,
this is a red-herring - the issue is in the mismatched libraries.
> On Mar 25, 2015, at 8:36 AM, Mark Santcroos
> wrote:
>
>
>> On 25 Mar 2015, at 15:46 , Howard Pritchard wrote:
>> turn off the disable getpwuid.
>
> On 25 Mar 2015, at 15:46 , Howard Pritchard wrote:
> turn off the disable getpwuid.
That doesn't seem to make a difference.
Have their been changes in this area? Last time I checked this a couple of
months ago on Edison I needed this flag not to get spammed.
Hey Mark
Your original error flag indicates that you are picking up a connection from
some proc built against a different OMPI installation. It’s a very low-level
check that looks for matching version numbers. Not sure who is trying to
connect, but that is the problem.
Check you LD_LIBRARY_PAT
turn off the disable getpwuid.
On Mar 25, 2015 8:14 AM, "Mark Santcroos"
wrote:
> Hi Howard,
>
> > On 25 Mar 2015, at 14:58 , Howard Pritchard wrote:
> > How are you building ompi?
>
> My configure is rather straightforward:
> ./configure --prefix=$OMPI_PREFIX --disable-getpwuid
>
> Maybe I got
Hi Howard,
> On 25 Mar 2015, at 14:58 , Howard Pritchard wrote:
> How are you building ompi?
My configure is rather straightforward:
./configure --prefix=$OMPI_PREFIX --disable-getpwuid
Maybe I got spoiled on Hopper/Edison and I need more explicit configuration on
BW ...
> Also what happens
Mark,
How are you building ompi? Also what happens if you use. aprun. I work
with ompi on the nersc edison and hopper daily. typically i use aprun
though.
you definitely dont need to use ccm.
and shouldnt.
On Mar 25, 2015 6:00 AM, "Mark Santcroos"
wrote:
> Hi,
>
> Any users of Open MPI on Bl
36 matches
Mail list logo