Thanks for your help Reuti,

I'm using a nfs-shared directory (/opt/sge/tmp), exported from the master node to all others computing nodes. with for /etc/export on server (named moe.fft): /opt/sge 192.168.0.0/255.255.255.0(rw,sync,no_subtree_check) /etc/fstab on client: moe.fft:/opt/sge /opt/sge nfs rw,bg,soft,timeo=14, 0 0 Actually, the /opt/sge/tmp directory is 777 across all machines, thus all user should be able to create a directory inside.

The issue seems somehow related to the session directory created inside /opt/sge/tmp, let's stay /opt/sge/tmp/29.1.smp8.q for example for the job 29 on queue smp8.q. This subdirectory of /opt/sge/tmp is created with nobody:nogroup drwxr-xr-x permissions... which in turn forbids OpenMPI to create its subtree inside (as OpenMPI won't use nobody:nogroup credentials).

Ad Ralph suggested, I checked the SGE configuration, but I haven't found anything related to nobody:nogroup configuration so far.

Eloi


Reuti wrote:
Hi,

Am 10.11.2009 um 17:55 schrieb Eloi Gaudry:

Thanks for your help Ralph, I'll double check that.

As for the error message received, there might be some inconsistency: "/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-eg@charlie_0" is the

often /opt/sge is shared across the nodes, while the /tmp (sometimes implemented as /scratch in a partition on its own) should be local on each node.

What is the setting of "tmpdir" in your queue definition?

If you want to share /opt/sge/tmp, everyone must be able to write into this location. As for me it's working fine (with the local /tmp), I assume the nobody/nogroup comes from any squash-setting in the /etc/export of you master node.

-- Reuti


parent directory and "/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-eg@charlie_0/53199/0/0" is the subdirectory... not the other way around.

Eloi



Ralph Castain wrote:
Creating a directory with such credentials sounds like a bug in SGE to me...perhaps an SGE config issue?

Only thing you could do is tell OMPI to use some other directory as the root for its session dir tree - check "mpirun -h", or ompi_info for the required option.

But I would first check your SGE config as that just doesn't sound right.

On Nov 10, 2009, at 9:40 AM, Eloi Gaudry wrote:

Hi there,

I'm experiencing some issues using GE6.2U4 and OpenMPI-1.3.3 (with gridengine compnent).

During any job submission, SGE creates a session directory in $TMPDIR, named after the job id and the computing node name. This session directory is created using nobody/nogroup credentials.

When using OpenMPI with tight-integration, opal creates different subdirectories in this session directory. The issue I'm facing now is that OpenMPI fails to create these subdirectories:

[charlie:03882] opal_os_dirpath_create: Error: Unable to create the sub-directory (/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-eg@charlie_0) of (/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-eg@charlie_0 [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file ../../openmpi-1.3.3/orte/util/session_dir.c at line 101 [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file ../../openmpi-1.3.3/orte/util/session_dir.c at line 425 [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file ../../../../../openmpi-1.3.3/orte/mca/ess/hnp/ess_hnp_module.c at line 273 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

orte_session_dir failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
-------------------------------------------------------------------------- [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file ../../openmpi-1.3.3/orte/runtime/orte_init.c at line 132 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

orte_ess_set_name failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
-------------------------------------------------------------------------- [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file ../../../../openmpi-1.3.3/orte/tools/orterun/orterun.c at line 473

This seems very likely related to the permissions set on $TMPDIR.

I'd like to know if someone might have experienced the same or a similar issue and if any solution was found.

Thanks for your help,
Eloi




--


Eloi Gaudry

Free Field Technologies
Axis Park Louvain-la-Neuve
Rue Emile Francqui, 1
B-1435 Mont-Saint Guibert
BELGIUM

Company Phone: +32 10 487 959
Company Fax:   +32 10 454 626

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--


Eloi Gaudry

Free Field Technologies
Axis Park Louvain-la-Neuve
Rue Emile Francqui, 1
B-1435 Mont-Saint Guibert
BELGIUM

Company Phone: +32 10 487 959
Company Fax:   +32 10 454 626

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--


Eloi Gaudry

Free Field Technologies
Axis Park Louvain-la-Neuve
Rue Emile Francqui, 1
B-1435 Mont-Saint Guibert
BELGIUM

Company Phone: +32 10 487 959
Company Fax:   +32 10 454 626

Reply via email to