FWIW, most LDAP installations I have seen have ended up doing the
same thing -- if you have a large enough cluster, you have MPI jobs
starting all the time, and rate control of a single job startup is
not sufficient to avoid overloading your LDAP server.
The solutions that I have seen typically have a job fired once a day
via cron that dumps relevant information from LDAP into local /etc/
passwd / shadow / group files and then simply use that for
authentication across the cluster.
Hope that helps.
On Mar 18, 2007, at 8:34 PM, David Bronke wrote:
That's great to hear! For now we'll just create local users for those
who need access to MPI on this system, but I'll keep an eye on the
list for when you do get a chance to finish that fix. Thanks again!
On 3/18/07, Ralph Castain <r...@lanl.gov> wrote:
Excellent! Yes, we use pipe in several places, including in the
run-time
during various stages of launch, so that could be a problem.
Also, be aware that other users have reported problems on LDAP-
based systems
when attempting to launch large jobs. The problem is that the
OpenMPI launch
system has no rate control in it - and the LDAP's slapd servers get
overwhelmed by the launch when we ssh on a large number of nodes.
I promised another user to concoct a fix for this problem, but am
taking a
break from the project for a few months so it may be a little
while before a
fix is available. When I do get it done, it may or may not make it
into an
OpenMPI release for some time - I'm not sure how they will decide to
schedule the change (is it a "bug", or a new "feature"?). So I may
do an
interim release as a patch on the OpenRTE site (since that is the
run-time
underneath OpenMPI). I'll let people know via this mailing list
either way.
Ralph
On 3/18/07 2:06 PM, "David Bronke" <whitel...@gmail.com> wrote:
I just received an email from a friend who is helping me work on
resolving this; he was able to trace the problem back to a pipe()
call
in OpenMPI apparently:
The problem is with the pipe() system call (which is invoked by the
MPI_Send() as far as I can tell) by a LDAP authenticated user.
Still
working out where exactly that goes wrong, but the fact is that
it isn't
actually a permissions problem - the reason it works as root is
because
root is a local user and does /etc/passwd normal authentication.
I had forgotten to mention that we use LDAP for authentication on
this
machine; PAM and NSS are set up to use it, but I'm guessing that
either OpenMPI itself or the pipe() system call won't check with
them
when needed... We have made some local users on the machine to get
things going, but I'll probably have to find an LDAP mailing list to
get this issue resolved.
Thanks for all the help so far!
On 3/16/07, Ralph Castain <r...@lanl.gov> wrote:
I'm afraid I have zero knowledge or experience with gentoo
portage, so I
can't help you there. I always install our releases from the
tarball source
as it is pretty trivial to do and avoids any issues.
I will have to defer to someone who knows that system to help
you from here.
It sounds like an installation or configuration issue.
Ralph
On 3/16/07 3:15 PM, "David Bronke" <whitel...@gmail.com> wrote:
On 3/15/07, Ralph Castain <r...@lanl.gov> wrote:
Hmmm...well, a few thoughts to hopefully help with the
debugging. One
initial comment, though - 1.1.2 is quite old. You might want
to upgrade to
1.2 (releasing momentarily - you can use the last release
candidate in the
interim as it is identical).
Version 1.2 doesn't seem to be in gentoo portage yet, so I may
end up
having to compile from source... I generally prefer to do
everything
from portage if possible, because it makes upgrades and
maintenance
much cleaner.
Meantime, looking at this output, there appear to be a couple
of common
possibilities. First, I don't see any of the diagnostic output
from after
we
do a local fork (we do this prior to actually launching the
daemon). Is it
possible your system doesn't allow you to fork processes (some
don't,
though
it's unusual)?
I don't see any problems with forking on this system... I'm
able to
start a dbus daemon as a regular user without any problems.
Second, it could be that the "orted" program isn't being found
in your
path.
People often forget that the path in shells started up by
programs isn't
necessarily the same as that in their login shell. You might
try executing
a
simple shellscript that outputs the results of "which orted"
to verify this
is correct.
'which orted' from a shell script gives me '/usr/bin/orted', which
seems to be correct.
BTW, I should have asked as well: what are you running this
on, and how did
you configure openmpi?
I'm running this on two identical machines with 2 dual-core
hyperthreading Xeon processors. (EM64T) I simply installed OpenMPI
using portage, with the USE flags "debug fortran pbs -threads".
(I've
also tried it with "-debug fortran pbs threads")
Ralph
On 3/15/07 5:33 PM, "David Bronke" <whitel...@gmail.com> wrote:
I'm using OpenMPI version 1.1.2. I installed it using gentoo
portage,
so I think it has the right permissions... I tried doing
'equery f
openmpi | xargs ls -dl' and inspecting the permissions of
each file,
and I don't see much out of the ordinary; it is all owned by
root:root, but every file has read permission for user,
group, and
other. (and execute for each as well when appropriate) From
the debug
output, I can tell that mpirun is creating the session tree
in /tmp,
and it does seem to be working fine... Here's the output when
using
--debug-daemons:
$ mpirun -aborted 8 -v -d --debug-daemons -np 8
/workspace/bronke/mpi/hello
[trixie:25228] [0,0,0] setting up session dir with
[trixie:25228] universe default-universe
[trixie:25228] user bronke
[trixie:25228] host trixie
[trixie:25228] jobid 0
[trixie:25228] procid 0
[trixie:25228] procdir:
/tmp/openmpi-sessions-bronke@trixie_0/default-universe/0/0
[trixie:25228] jobdir:
/tmp/openmpi-sessions-bronke@trixie_0/default-universe/0
[trixie:25228] unidir:
/tmp/openmpi-sessions-bronke@trixie_0/default-universe
[trixie:25228] top: openmpi-sessions-bronke@trixie_0
[trixie:25228] tmp: /tmp
[trixie:25228] [0,0,0] contact_file
/tmp/openmpi-sessions-bronke@trixie_0/default-universe/
universe-setup.txt
[trixie:25228] [0,0,0] wrote setup file
[trixie:25228] pls:rsh: local csh: 0, local bash: 1
[trixie:25228] pls:rsh: assuming same remote shell as local
shell
[trixie:25228] pls:rsh: remote csh: 0, remote bash: 1
[trixie:25228] pls:rsh: final template argv:
[trixie:25228] pls:rsh: /usr/bin/ssh <template> orted --
debug
--debug-daemons --bootproxy 1 --name <template> --num_procs 2
--vpid_start 0 --nodename <template> --universe
bronke@trixie:default-universe --nsreplica
"0.0.0;tcp://141.238.31.33:43838" --gprreplica
"0.0.0;tcp://141.238.31.33:43838" --mpi-call-yield 0
[trixie:25228] sess_dir_finalize: proc session dir not empty
- leaving
[trixie:25228] spawn: in job_state_callback(jobid = 1, state
= 0x100)
mpirun noticed that job rank 0 with PID 0 on node "localhost"
exited
on signal 13.
[trixie:25228] sess_dir_finalize: proc session dir not empty
- leaving
[trixie:25228] sess_dir_finalize: proc session dir not empty
- leaving
[trixie:25228] sess_dir_finalize: proc session dir not empty
- leaving
[trixie:25228] sess_dir_finalize: proc session dir not empty
- leaving
[trixie:25228] sess_dir_finalize: proc session dir not empty
- leaving
[trixie:25228] sess_dir_finalize: proc session dir not empty
- leaving
[trixie:25228] sess_dir_finalize: proc session dir not empty
- leaving
[trixie:25228] spawn: in job_state_callback(jobid = 1, state
= 0x80)
mpirun noticed that job rank 0 with PID 0 on node "localhost"
exited
on signal 13.
mpirun noticed that job rank 1 with PID 0 on node "localhost"
exited
on signal 13.
mpirun noticed that job rank 2 with PID 0 on node "localhost"
exited
on signal 13.
mpirun noticed that job rank 3 with PID 0 on node "localhost"
exited
on signal 13.
mpirun noticed that job rank 4 with PID 0 on node "localhost"
exited
on signal 13.
mpirun noticed that job rank 5 with PID 0 on node "localhost"
exited
on signal 13.
mpirun noticed that job rank 6 with PID 0 on node "localhost"
exited
on signal 13.
[trixie:25228] ERROR: A daemon on node localhost failed to
start as
expected.
[trixie:25228] ERROR: There may be more information available
from
[trixie:25228] ERROR: the remote shell (see above).
[trixie:25228] The daemon received a signal 13.
1 additional process aborted (not shown)
[trixie:25228] sess_dir_finalize: found proc session dir
empty - deleting
[trixie:25228] sess_dir_finalize: found job session dir empty
- deleting
[trixie:25228] sess_dir_finalize: found univ session dir
empty - deleting
[trixie:25228] sess_dir_finalize: found top session dir empty
- deleting
On 3/15/07, Ralph H Castain <r...@lanl.gov> wrote:
It isn't a /dev issue. The problem is likely that the system
lacks
sufficient permissions to either:
1. create the Open MPI session directory tree. We create a
hierarchy of
subdirectories for temporary storage used for things like
your shared
memory
file - the location of the head of that tree can be
specified at run
time,
but has a series of built-in defaults it can search if you
don't specify
it
(we look at your environmental variables - e.g., TMP or
TMPDIR - as well
as
the typical Linux/Unix places). You might check to see what
your tmp
directory is, and that you have write permission into it.
Alternatively,
you
can specify your own location (where you know you have
permissions!) by
setting --tmpdir your-dir on the mpirun command line.
2. execute or access the various binaries and/or libraries.
This is
usually
caused when someone installs OpenMPI as root, and then tries
to execute
as
a
non-root user. Best thing here is to either run through the
installation
directory and add the correct permissions (assuming it is a
system-level
install), or reinstall as the non-root user (if the install
is solely for
you anyway).
You can also set --debug-daemons on the mpirun command line
to get more
diagnostic output from the daemons and then send that along.
BTW: if possible, it helps us to advise you if we know which
version of
OpenMPI you are using. ;-)
Hope that helps.
Ralph
On 3/15/07 1:51 PM, "David Bronke" <whitel...@gmail.com> wrote:
Ok, now that I've figured out what the signal means, I'm
wondering
exactly what is running into permission problems... the
program I'm
running doesn't use any functions except printf, sprintf,
and MPI_*...
I was thinking that possibly changes to permissions on
certain /dev
entries in newer distros might cause this, but I'm not even
sure what
/dev entries would be used by MPI.
On 3/15/07, McCalla, Mac <macmcca...@hess.com> wrote:
Hi,
If the perror command is available on your system
it will tell
you what the message is associated with the signal value.
On my system
RHEL4U3, it is permission denied.
HTH,
mac mccalla
-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-
boun...@open-mpi.org] On
Behalf Of David Bronke
Sent: Thursday, March 15, 2007 12:25 PM
To: us...@open-mpi.org
Subject: [OMPI users] Signal 13
I've been trying to get OpenMPI working on two of the
computers at a
lab
I help administer, and I'm running into a rather large
issue. When
running anything using mpirun as a normal user, I get the
following
output:
$ mpirun --no-daemonize --host
localhost,localhost,localhost,localhost,localhost,localhost,localhost
,l>>>>>>>>
o
calhost
/workspace/bronke/mpi/hello
mpirun noticed that job rank 0 with PID 0 on node
"localhost" exited on
signal 13.
[trixie:18104] ERROR: A daemon on node localhost failed to
start as
expected.
[trixie:18104] ERROR: There may be more information
available from
[trixie:18104] ERROR: the remote shell (see above).
[trixie:18104] The daemon received a signal 13.
8 additional processes aborted (not shown)
However, running the same exact command line as root works
fine:
$ sudo mpirun --no-daemonize --host
localhost,localhost,localhost,localhost,localhost,localhost,localhost
,l>>>>>>>>
o
calhost
/workspace/bronke/mpi/hello
Password:
p is 8, my_rank is 0
p is 8, my_rank is 1
p is 8, my_rank is 2
p is 8, my_rank is 3
p is 8, my_rank is 6
p is 8, my_rank is 7
Greetings from process 1!
Greetings from process 2!
Greetings from process 3!
p is 8, my_rank is 5
p is 8, my_rank is 4
Greetings from process 4!
Greetings from process 5!
Greetings from process 6!
Greetings from process 7!
I've looked up signal 13, and have found that it is
apparently SIGPIPE;
I also found a thread on the LAM-MPI site:
http://www.lam-mpi.org/MailArchives/lam/2004/08/8486.php
However, this thread seems to indicate that the problem
would be in the
application, (/workspace/bronke/mpi/hello in this case)
but there are
no
pipes in use in this app, and the fact that it works as
expected as
root
doesn't seem to fit either. I have tried running mpirun
with --verbose
and it doesn't show any more output than without it, so
I've run into a
sort of dead-end on this issue. Does anyone know of any
way I can
figure
out what's going wrong or how I can fix it?
Thanks!
--
David H. Bronke
Lead Programmer
G33X Nexus Entertainment
http://games.g33xnexus.com/precursors/
v3sw5/7Hhw5/6ln4pr6Ock3ma7u7+8Lw3/7Tm3l6+7Gi2e4t4Mb7Hen5g8
+9ORPa22s6MSr>>>>>>>>
7
p6
hackerkey.com
Support Web Standards! http://www.webstandards.org/
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
David H. Bronke
Lead Programmer
G33X Nexus Entertainment
http://games.g33xnexus.com/precursors/
v3sw5/7Hhw5/6ln4pr6Ock3ma7u7+8Lw3/7Tm3l6+7Gi2e4t4Mb7Hen5g8
+9ORPa22s6MSr7p6
hackerkey.com
Support Web Standards! http://www.webstandards.org/
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems