Am 10.11.2009 um 23:51 schrieb Reuti:
Hi Eloi,
Am 10.11.2009 um 23:42 schrieb Eloi Gaudry:
I followed your advice and switched to a local "tmpdir" instead of
a share one. This solved the session directory issue, thanks for
your help !
what user/group is no listed for the generated temporary
directories (i.e. $TMPDIR)?
...is now listed ...
-- Reuti
However, I cannot understand how the issue disappeared. Any input
would be welcome as I really like to understand how SGE/OpenMPI
could failed when using such a configuration (i.e. with a shared
"tmpdir").
Eloi
On 10/11/2009 19:17, Eloi Gaudry wrote:
Reuti,
The acl here were just added when I tried to force the /opt/sge/
tmp subdirectories to be 777 (which I did when I first
encountered the error of subdirectories creation within OpenMPI).
I don't think the info I'll provide will be meaningfull here:
moe:~# getfacl /opt/sge/tmp
getfacl: Removing leading '/' from absolute path names
# file: opt/sge/tmp
# owner: sgeadmin
# group: fft
user::rwx
group::rwx
mask::rwx
other::rwx
default:user::rwx
default:group::rwx
default:group:fft:rwx
default:mask::rwx
default:other::rwx
I'll try to use a local directory instead of a shared one for
"tmpdir". But as this issue seems somehow related to permissions,
I don't know if this would eventually be the rigth solution.
Thanks for your help,
Eloi
Reuti wrote:
Hi,
Am 10.11.2009 um 19:01 schrieb Eloi Gaudry:
Reuti,
I'm using "tmpdir" as a shared directory that contains the
session directories created during job submission, not for
computing or local storage. Doesn't the session directory (i.e.
job_id.queue_name) need to be shared among all computing nodes
(at least the ones that would be used with orted during the
parallel computation) ?
no. orted runs happily with local $TMPDIR on each and every
node. The $TMPDIRs are intended to be used by the user for any
temporary data for his job, as they are created and removed by
SGE automatically for every job for his convenience.
All sequential job run fine, as no write operation is performed
in "tmpdir/session_directory".
All users are known on the computing nodes and the master node
(with use ldap authentication on all nodes).
As for the access checkings:
moe:~# ls -alrtd /opt/sge/tmp
drwxrwxrwx+ 2 sgeadmin fft 4096 2009-11-10 18:28 /opt/sge/tmp
Aha, the + tells that there are some ACLs set:
getfacl /opt/sge/tmp
And for the parallel environment configuration:
moe:~# qconf -sp round_robin
pe_name round_robin
slots 32
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
Okay, fine.
-- Reuti
Thanks for your help,
Eloi
Reuti wrote:
Am 10.11.2009 um 18:20 schrieb Eloi Gaudry:
Thanks for your help Reuti,
I'm using a nfs-shared directory (/opt/sge/tmp), exported
from the master node to all others computing nodes.
It's higly advisable to have the "tmpdir" local on each node.
When you use "cd $TMPDIR" in your jobscript, all is done local
on a node (when your application will just create the scratch
file in your current working directory) which will speed up
the computation and decrease the network traffic. Computing in
as shared /opt/sge/tmp is like computing in each user's home
directory.
To avoid that any user can remove someone else's files, the
"t" flag is set like for /tmp: drwxrwxrwt 14 root root 4096
2009-11-10 18:35 /tmp/
Nevertheless:
with for /etc/export on server (named moe.fft): /opt/
sge 192.168.0.0/255.255.255.0(rw,sync,no_subtree_check)
/etc/fstab on
client: moe.fft:/opt/
sge /opt/
sge nfs
rw,bg,soft,timeo=14, 0 0
Actually, the /opt/sge/tmp directory is 777 across all
machines, thus all user should be able to create a directory
inside.
All access checkings will be applied:
- on the server: what is "ls -d /opt/sge/tmp" showing?
- the one from the export (this seems to be fine)
- the one on the node (i.e., how it's mounted: cat /etc/fstab)
The issue seems somehow related to the session directory
created inside /opt/sge/tmp, let's stay /opt/sge/tmp/
29.1.smp8.q for example for the job 29 on queue smp8.q. This
subdirectory of /opt/sge/tmp is created with nobody:nogroup
drwxr-xr-x permissions... which in turn forbids
Did you try to run some simple jobs before the parallel ones -
are these working? The daemons (qmaster and execd) were
started as root?
The user is known on the file server, i.e. the machine
hosting /opt/sge?
OpenMPI to create its subtree inside (as OpenMPI won't use
nobody:nogroup credentials).
In SGE the master process (the one running the job script)
will create the /opt/sge/tmp/29.1.smp8.q and also each
started qrsh inside SGE - all with the same name. What is your
definition of the PE in SGE which you use?
-- Reuti
Ad Ralph suggested, I checked the SGE configuration, but I
haven't found anything related to nobody:nogroup
configuration so far.
Eloi
Reuti wrote:
Hi,
Am 10.11.2009 um 17:55 schrieb Eloi Gaudry:
Thanks for your help Ralph, I'll double check that.
As for the error message received, there might be some
inconsistency: "/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-
eg@charlie_0" is the
often /opt/sge is shared across the nodes, while the /tmp
(sometimes implemented as /scratch in a partition on its
own) should be local on each node.
What is the setting of "tmpdir" in your queue definition?
If you want to share /opt/sge/tmp, everyone must be able to
write into this location. As for me it's working fine (with
the local /tmp), I assume the nobody/nogroup comes from any
squash-setting in the /etc/export of you master node.
-- Reuti
parent directory and "/opt/sge/tmp/25.1.smp8.q/openmpi-
sessions-eg@charlie_0/53199/0/0" is the subdirectory... not
the other way around.
Eloi
Ralph Castain wrote:
Creating a directory with such credentials sounds like a
bug in SGE to me...perhaps an SGE config issue?
Only thing you could do is tell OMPI to use some other
directory as the root for its session dir tree - check
"mpirun -h", or ompi_info for the required option.
But I would first check your SGE config as that just
doesn't sound right.
On Nov 10, 2009, at 9:40 AM, Eloi Gaudry wrote:
Hi there,
I'm experiencing some issues using GE6.2U4 and
OpenMPI-1.3.3 (with gridengine compnent).
During any job submission, SGE creates a session
directory in $TMPDIR, named after the job id and the
computing node name. This session directory is created
using nobody/nogroup credentials.
When using OpenMPI with tight-integration, opal creates
different subdirectories in this session directory. The
issue I'm facing now is that OpenMPI fails to create
these subdirectories:
[charlie:03882] opal_os_dirpath_create: Error: Unable to
create the sub-directory (/opt/sge/tmp/25.1.smp8.q/
openmpi-sessions-eg@charlie_0) of (/opt/sge/tmp/
25.1.smp8.q/openmpi-sessions-eg@charlie_0
[charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in
file ../../openmpi-1.3.3/orte/util/session_dir.c at line 101
[charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in
file ../../openmpi-1.3.3/orte/util/session_dir.c at line 425
[charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in
file ../../../../../openmpi-1.3.3/orte/mca/ess/hnp/
ess_hnp_module.c at line 273
------------------------------------------------------------
--------------
It looks like orte_init failed for some reason; your
parallel process is
likely to abort. There are many reasons that a parallel
process can
fail during orte_init; some of which are due to
configuration or
environment problems. This failure appears to be an
internal failure;
here's some additional information (which may only be
relevant to an
Open MPI developer):
orte_session_dir failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
------------------------------------------------------------
--------------
[charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in
file ../../openmpi-1.3.3/orte/runtime/orte_init.c at line
132
------------------------------------------------------------
--------------
It looks like orte_init failed for some reason; your
parallel process is
likely to abort. There are many reasons that a parallel
process can
fail during orte_init; some of which are due to
configuration or
environment problems. This failure appears to be an
internal failure;
here's some additional information (which may only be
relevant to an
Open MPI developer):
orte_ess_set_name failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
------------------------------------------------------------
--------------
[charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in
file ../../../../openmpi-1.3.3/orte/tools/orterun/
orterun.c at line 473
This seems very likely related to the permissions set on
$TMPDIR.
I'd like to know if someone might have experienced the
same or a similar issue and if any solution was found.
Thanks for your help,
Eloi
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users