Am 10.11.2009 um 18:20 schrieb Eloi Gaudry:
Thanks for your help Reuti,
I'm using a nfs-shared directory (/opt/sge/tmp), exported from
the master node to all others computing nodes.
It's higly advisable to have the "tmpdir" local on each node.
When you use "cd $TMPDIR" in your jobscript, all is done local on
a node (when your application will just create the scratch file
in your current working directory) which will speed up the
computation and decrease the network traffic. Computing in as
shared /opt/sge/tmp is like computing in each user's home
directory.
To avoid that any user can remove someone else's files, the "t"
flag is set like for /tmp: drwxrwxrwt 14 root root 4096
2009-11-10 18:35 /tmp/
Nevertheless:
with for /etc/export on server (named moe.fft): /opt/sge
192.168.0.0/255.255.255.0(rw,sync,no_subtree_check)
/etc/fstab on
client:
moe.fft:/opt/sge
/opt/sge nfs
rw,bg,soft,timeo=14, 0 0
Actually, the /opt/sge/tmp directory is 777 across all machines,
thus all user should be able to create a directory inside.
All access checkings will be applied:
- on the server: what is "ls -d /opt/sge/tmp" showing?
- the one from the export (this seems to be fine)
- the one on the node (i.e., how it's mounted: cat /etc/fstab)
The issue seems somehow related to the session directory created
inside /opt/sge/tmp, let's stay /opt/sge/tmp/29.1.smp8.q for
example for the job 29 on queue smp8.q. This subdirectory of
/opt/sge/tmp is created with nobody:nogroup drwxr-xr-x
permissions... which in turn forbids
Did you try to run some simple jobs before the parallel ones -
are these working? The daemons (qmaster and execd) were started
as root?
The user is known on the file server, i.e. the machine hosting
/opt/sge?
OpenMPI to create its subtree inside (as OpenMPI won't use
nobody:nogroup credentials).
In SGE the master process (the one running the job script) will
create the /opt/sge/tmp/29.1.smp8.q and also each started qrsh
inside SGE - all with the same name. What is your definition of
the PE in SGE which you use?
-- Reuti
Ad Ralph suggested, I checked the SGE configuration, but I
haven't found anything related to nobody:nogroup configuration
so far.
Eloi
Reuti wrote:
Hi,
Am 10.11.2009 um 17:55 schrieb Eloi Gaudry:
Thanks for your help Ralph, I'll double check that.
As for the error message received, there might be some
inconsistency:
"/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-eg@charlie_0" is the
often /opt/sge is shared across the nodes, while the /tmp
(sometimes implemented as /scratch in a partition on its own)
should be local on each node.
What is the setting of "tmpdir" in your queue definition?
If you want to share /opt/sge/tmp, everyone must be able to
write into this location. As for me it's working fine (with the
local /tmp), I assume the nobody/nogroup comes from any
squash-setting in the /etc/export of you master node.
-- Reuti
parent directory and
"/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-eg@charlie_0/53199/0/0"
is the subdirectory... not the other way around.
Eloi
Ralph Castain wrote:
Creating a directory with such credentials sounds like a bug
in SGE to me...perhaps an SGE config issue?
Only thing you could do is tell OMPI to use some other
directory as the root for its session dir tree - check
"mpirun -h", or ompi_info for the required option.
But I would first check your SGE config as that just doesn't
sound right.
On Nov 10, 2009, at 9:40 AM, Eloi Gaudry wrote:
Hi there,
I'm experiencing some issues using GE6.2U4 and OpenMPI-1.3.3
(with gridengine compnent).
During any job submission, SGE creates a session directory
in $TMPDIR, named after the job id and the computing node
name. This session directory is created using nobody/nogroup
credentials.
When using OpenMPI with tight-integration, opal creates
different subdirectories in this session directory. The
issue I'm facing now is that OpenMPI fails to create these
subdirectories:
[charlie:03882] opal_os_dirpath_create: Error: Unable to
create the sub-directory
(/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-eg@charlie_0) of
(/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-eg@charlie_0
[charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file
../../openmpi-1.3.3/orte/util/session_dir.c at line 101
[charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file
../../openmpi-1.3.3/orte/util/session_dir.c at line 425
[charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file
../../../../../openmpi-1.3.3/orte/mca/ess/hnp/ess_hnp_module.c
at line 273
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your
parallel process is
likely to abort. There are many reasons that a parallel
process can
fail during orte_init; some of which are due to
configuration or
environment problems. This failure appears to be an
internal failure;
here's some additional information (which may only be
relevant to an
Open MPI developer):
orte_session_dir failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file
../../openmpi-1.3.3/orte/runtime/orte_init.c at line 132
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your
parallel process is
likely to abort. There are many reasons that a parallel
process can
fail during orte_init; some of which are due to
configuration or
environment problems. This failure appears to be an
internal failure;
here's some additional information (which may only be
relevant to an
Open MPI developer):
orte_ess_set_name failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file
../../../../openmpi-1.3.3/orte/tools/orterun/orterun.c at
line 473
This seems very likely related to the permissions set on
$TMPDIR.
I'd like to know if someone might have experienced the same
or a similar issue and if any solution was found.
Thanks for your help,
Eloi