Re: [OMPI users] openmpi 1.7.4rc1 and f08 interface

2014-02-03 Thread Åke Sandgren

On 02/01/2014 03:12 PM, Jeff Squyres (jsquyres) wrote:

I think that ompi_funloc_variant1 needs to do IMPORT to have access to the 
callback_variant1 definition before using it to define "FN"
I.e.
 !
  function ompi_funloc_variant1(fn)
use, intrinsic :: iso_c_binding, only: c_funptr
import
procedure(callback_variant1) :: fn


At work reading the specs it is clear that it needs the IMPORT clause.
Could probably do IMPORT :: callback_variant1 if you want to import as 
little as possible.


--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] Use of __float128 with openmpi

2014-02-03 Thread Patrick Boehl
Hello George,

thank you a lot!

Everything seems to work now! :)

Best,
Patrick


On 02.02.2014, at 14:15, George Bosilca wrote:

> Just go for the most trivial:
> 
> MPI_Type_contiguous(sizeof(__float128), MPI_BYTE, &my__float128);
> 
> A little bit more info about the optional quad-precision floating-point 
> format is available on Wikipedia 
> (https://en.wikipedia.org/wiki/Double-double_%28arithmetic%29#Double-double_arithmetic).
> 
>  George.
> 
> 
> On Feb 2, 2014, at 13:41 , Patrick Boehl 
>  wrote:
> 
>> Hello Jeff,
>> 
>> thank you a lot for your reply!
>> 
>> On 01.02.2014, at 23:07, Jeff Hammond wrote:
>> 
>>> See Section 5.9.5 of MPI-3 or the section named "User-Defined
>>> Reduction Operations" but presumably numbered differently in older
>>> copies of the MPI standard.
>>> 
>>> An older but still relevant online reference is
>>> http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report/node107.htm
>>> 
>> 
>> In this example they construct this "datatype"
>> 
>> - 
>> typedef struct {
>> double real,imag;
>> } Complex
>> -
>> 
>> and later
>> 
>> -
>> MPI_Datatype ctype;
>> /* explain to MPI how type Complex is defined
>> */
>> MPI_Type_contiguous(2, MPI_DOUBLE, &ctype);
>> -
>> 
>> Do I understand correctly that I have to find out how __float128 is 
>> constructed internally and 
>> convert it to a form which is compatible with the standard MPI Datatypes?
>> In an analogue way as they do in the example. Up to now, I only found out 
>> that __float128 should 
>> be somehow the sum of two doubles.
>> 
>> Again, I am grateful for any help!
>> 
>> Best regards,
>> Patrick
>> 
>> 
>> 
>> 
>>> On Sat, Feb 1, 2014 at 2:28 PM, Tim Prince  wrote:
 
 On 02/01/2014 12:42 PM, Patrick Boehl wrote:
> 
> Hi all,
> 
> I have a question on datatypes in openmpi:
> 
> Is there an (easy?) way to use __float128 variables with openmpi?
> 
> Specifically, functions like
> 
> MPI_Allreduce
> 
> seem to give weird results with __float128.
> 
> Essentially all I found was
> 
> http://beige.ucs.indiana.edu/I590/node100.html
> 
> where they state
> 
> MPI_LONG_DOUBLE
> This is a quadruple precision, 128-bit long floating point number.
> 
> 
> But as far as I have seen, MPI_LONG_DOUBLE is only used for long doubles.
> 
> The Open MPI Version is 1.6.3 and gcc is 4.7.3 on a x86_64 machine.
> 
 It seems unlikely that 10 year old course notes on an unspecified MPI
 implementation (hinted to be IBM power3) would deal with specific details 
 of
 openmpi on a different architecture.
 Where openmpi refers to "portable C types" I would take long double to be
 the 80-bit hardware format you would have in a standard build of gcc for
 x86_64.  You should be able to gain some insight by examining your openmpi
 build logs to see if it builds for both __float80 and __float128 (or
 neither).  gfortran has a 128-bit data type (software floating point
 real(16), corresponding to __float128); you should be able to see in the
 build logs whether that data type was used.
 
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> 
>>> -- 
>>> Jeff Hammond
>>> jeff.scie...@gmail.com
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] opal_os_dirpath_create: Error: Unable to create the, sub-directory

2014-02-03 Thread Eric Chamberland

Hi,

with OpenMPI 1.6.3 I have encountered this error which "randomly" appears:

[compile:20089] opal_os_dirpath_create: Error: Unable to create the 
sub-directory (/tmp/openmpi-sessions-cmpbib@compile_0/55528/0) of 
(/tmp/openmpi-sessions-cmpbib@compile_0/55528/0/0), mkdir failed [1]
[compile:20089] [[55528,0],0] ORTE_ERROR_LOG: Error in file 
util/session_dir.c at line 106


(view full stderr attached)

and also this mostly same one:

[compile:22876] opal_os_dirpath_create: Error: Unable to create the 
sub-directory (/tmp/openmpi-sessions-cmpbib@compile_0/53197/0) of 
(/tmp/openmpi-sessions-cmpbib@compile_0/53197/0/0), mkdir failed [1]

...

Looking deeper, I have found this in /tmp:

ls -ladtr /tmp/openmpi-sessions-cmpbib\@compile_0/* |grep -v "drwx"
-rw-r--r-- 1 cmpbib bib   93 Jan 31 06:47 
/tmp/openmpi-sessions-cmpbib@compile_0/55528
-rw-r--r-- 1 cmpbib bib   92 Jan 31 06:48 
/tmp/openmpi-sessions-cmpbib@compile_0/41437
-rw-r--r-- 1 cmpbib bib   93 Jan 31 07:01 
/tmp/openmpi-sessions-cmpbib@compile_0/59324
-rw-r--r-- 1 cmpbib bib   92 Jan 31 09:49 
/tmp/openmpi-sessions-cmpbib@compile_0/53197
-rw-r--r-- 1 cmpbib bib   93 Jan 31 11:10 
/tmp/openmpi-sessions-cmpbib@compile_0/54532
-rw-r--r-- 1 cmpbib bib   93 Jan 31 14:18 
/tmp/openmpi-sessions-cmpbib@compile_0/36511
-rw-r--r-- 1 cmpbib bib   93 Feb  1 18:50 
/tmp/openmpi-sessions-cmpbib@compile_0/63980



So there are some *files* in /tmp which are named like the directories 
which are tried to be created


The content of the file /tmp/openmpi-sessions-cmpbib@compile_0/55528 is:

4016963584.0;tcp://10.1.1.46:51427;tcp://132.203.7.103:51427;tcp://192.168.122.1:51427
31231

which looks like the content of the file "contact.txt" which seems to 
appear in a successfully created directory.  Also, the files have been 
created far before the executions which aborted...


So, is this a bug in 1.6.3 and is there a "solution" for that?
(I know I can cleanup the files, but I expect OpenMPI to not try to 
create a directory if a file with the same name exists...)


Thanks,

Eric
[compile:20089] opal_os_dirpath_create: Error: Unable to create the 
sub-directory (/tmp/openmpi-sessions-cmpbib@compile_0/55528/0) of 
(/tmp/openmpi-sessions-cmpbib@compile_0/55528/0/0), mkdir failed [1]
[compile:20089] [[55528,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c 
at line 106
[compile:20089] [[55528,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c 
at line 399
[compile:20089] [[55528,0],0] ORTE_ERROR_LOG: Error in file ess_hnp_module.c at 
line 320
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--
[compile:20089] [[55528,0],0] ORTE_ERROR_LOG: Error in file runtime/orte_init.c 
at line 128
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--
[compile:20089] [[55528,0],0] ORTE_ERROR_LOG: Error in file orted/orted_main.c 
at line 353
[compile:20034] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on 
the local node in file ess_singleton_module.c at line 343
[compile:20034] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on 
the local node in file ess_singleton_module.c at line 140
[compile:20034] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on 
the local node in file runtime/orte_init.c at line 128
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Unable to start a daemon on the local node (-128) instead 
of ORTE_SUCCESS
--
-

Re: [OMPI users] opal_os_dirpath_create: Error: Unable to create the, sub-directory

2014-02-03 Thread Ralph Castain
Seems rather odd - is your /tmp by any chance network mounted?

On Feb 3, 2014, at 9:41 AM, Eric Chamberland  
wrote:

> Hi,
> 
> with OpenMPI 1.6.3 I have encountered this error which "randomly" appears:
> 
> [compile:20089] opal_os_dirpath_create: Error: Unable to create the 
> sub-directory (/tmp/openmpi-sessions-cmpbib@compile_0/55528/0) of 
> (/tmp/openmpi-sessions-cmpbib@compile_0/55528/0/0), mkdir failed [1]
> [compile:20089] [[55528,0],0] ORTE_ERROR_LOG: Error in file 
> util/session_dir.c at line 106
> 
> (view full stderr attached)
> 
> and also this mostly same one:
> 
> [compile:22876] opal_os_dirpath_create: Error: Unable to create the 
> sub-directory (/tmp/openmpi-sessions-cmpbib@compile_0/53197/0) of 
> (/tmp/openmpi-sessions-cmpbib@compile_0/53197/0/0), mkdir failed [1]
> ...
> 
> Looking deeper, I have found this in /tmp:
> 
> ls -ladtr /tmp/openmpi-sessions-cmpbib\@compile_0/* |grep -v "drwx"
> -rw-r--r-- 1 cmpbib bib   93 Jan 31 06:47 
> /tmp/openmpi-sessions-cmpbib@compile_0/55528
> -rw-r--r-- 1 cmpbib bib   92 Jan 31 06:48 
> /tmp/openmpi-sessions-cmpbib@compile_0/41437
> -rw-r--r-- 1 cmpbib bib   93 Jan 31 07:01 
> /tmp/openmpi-sessions-cmpbib@compile_0/59324
> -rw-r--r-- 1 cmpbib bib   92 Jan 31 09:49 
> /tmp/openmpi-sessions-cmpbib@compile_0/53197
> -rw-r--r-- 1 cmpbib bib   93 Jan 31 11:10 
> /tmp/openmpi-sessions-cmpbib@compile_0/54532
> -rw-r--r-- 1 cmpbib bib   93 Jan 31 14:18 
> /tmp/openmpi-sessions-cmpbib@compile_0/36511
> -rw-r--r-- 1 cmpbib bib   93 Feb  1 18:50 
> /tmp/openmpi-sessions-cmpbib@compile_0/63980
> 
> 
> So there are some *files* in /tmp which are named like the directories which 
> are tried to be created
> 
> The content of the file /tmp/openmpi-sessions-cmpbib@compile_0/55528 is:
> 
> 4016963584.0;tcp://10.1.1.46:51427;tcp://132.203.7.103:51427;tcp://192.168.122.1:51427
> 31231
> 
> which looks like the content of the file "contact.txt" which seems to appear 
> in a successfully created directory.  Also, the files have been created far 
> before the executions which aborted...
> 
> So, is this a bug in 1.6.3 and is there a "solution" for that?
> (I know I can cleanup the files, but I expect OpenMPI to not try to create a 
> directory if a file with the same name exists...)
> 
> Thanks,
> 
> Eric
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] opal_os_dirpath_create: Error: Unable to create the, sub-directory

2014-02-03 Thread Eric Chamberland

On 02/03/2014 02:49 PM, Ralph Castain wrote:

Seems rather odd - is your /tmp by any chance network mounted?


No it is a "normal" /tmp:

"cd /tmp; df -h ." gives:

Filesystem  Size  Used Avail Use% Mounted on
/dev/sda149G   17G   30G  37% /

And there is plenty of disk space...

I agree it is odd, but how should OpenMPI react when trying to create a 
directory over an existing file name?  I mean what is it programmed to do?


Thanks,

Eric



Re: [OMPI users] opal_os_dirpath_create: Error: Unable to create the, sub-directory

2014-02-03 Thread Ralph Castain
OMPI will error out in that case, as you originally reported. What seems to be 
happening is that you have a bunch of stale session directories, but I'm 
puzzled because the creation dates are so current - for whatever reason, OMPI 
seems to be getting the same jobid much more often than it should. Can you tell 
me something about the environment - e.g., is it managed or just using hostfile?


On Feb 3, 2014, at 12:00 PM, Eric Chamberland 
 wrote:

> On 02/03/2014 02:49 PM, Ralph Castain wrote:
>> Seems rather odd - is your /tmp by any chance network mounted?
> 
> No it is a "normal" /tmp:
> 
> "cd /tmp; df -h ." gives:
> 
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/sda149G   17G   30G  37% /
> 
> And there is plenty of disk space...
> 
> I agree it is odd, but how should OpenMPI react when trying to create a 
> directory over an existing file name?  I mean what is it programmed to do?
> 
> Thanks,
> 
> Eric
> 



Re: [OMPI users] opal_os_dirpath_create: Error: Unable to create the, sub-directory

2014-02-03 Thread Eric Chamberland

Hi,

On 02/03/2014 03:09 PM, Ralph Castain wrote:

OMPI will error out in that case, as you originally reported. What seems to be 
happening is that you have a bunch of stale session directories, but I'm 
puzzled because the creation dates are so current - for whatever reason, OMPI 
seems to be getting the same jobid much more often than it should. Can you tell 
me something about the environment - e.g., is it managed or just using hostfile?


This computer is used about 11 times a day to launch about 1500 
executions on our in-house (finite element) code.


We do launch at most 12 single process executions at the same time, but 
we use PETSc, which always initialize the MPI environment...


Also, we are launching some tests which use between 2 to 128 processes 
(on the same computer) just to ensure proper code testing.  In fact, 
performance is not quit an issue in these 128 processes tests and we set 
the following environment variable:


export OMPI_MCA_mpi_yield_when_idle=1

because we encountered timeout problems before...

The whole testing lasts about 1 hour and the result is used to give a 
feed-back for users who "pushed" modifications to the code


So I would add: sometime the tests may be interrupted by segfaults, 
"kill -TERM" or anything you can imagine...  The problem now is that it 
won't even start if a mere file exists...


I can flush those files right now, but I am almost sure they will 
reappear it the following days, leading to false "bad results" for the 
tests... and I will have to setup a cleanup procedure before launching 
all the tests... But that will not prevent the fact that those files may 
be created while running the firsts of the 1500 tests and have 1 or some 
of the rest to fail


I hope this is the information you wanted... Is it?

Thanks,

Eric





On Feb 3, 2014, at 12:00 PM, Eric Chamberland 
 wrote:


On 02/03/2014 02:49 PM, Ralph Castain wrote:

Seems rather odd - is your /tmp by any chance network mounted?


No it is a "normal" /tmp:

"cd /tmp; df -h ." gives:

Filesystem  Size  Used Avail Use% Mounted on
/dev/sda149G   17G   30G  37% /

And there is plenty of disk space...

I agree it is odd, but how should OpenMPI react when trying to create a 
directory over an existing file name?  I mean what is it programmed to do?

Thanks,

Eric





Re: [OMPI users] opal_os_dirpath_create: Error: Unable to create the, sub-directory

2014-02-03 Thread Ralph Castain
Very strange - even if you kill the job with SIGTERM, or have processes that 
segfault, OMPI should clean itself up and remove those session directories. 
Granted, the 1.6 series isn't as good about doing so as the 1.7 series, but it 
at least to-date has done pretty well.

Best I can suggest for now is to do the following in your test script:

(1) set TMPDIR=

(2) run your tests

(3) rm -rf /tmp/regression/*

That will ensure you only blow away the session dirs from your regression 
tests. Hopefully, you'll find the directory empty more often than not...

HTH
Ralph

On Feb 3, 2014, at 12:31 PM, Eric Chamberland 
 wrote:

> Hi,
> 
> On 02/03/2014 03:09 PM, Ralph Castain wrote:
>> OMPI will error out in that case, as you originally reported. What seems to 
>> be happening is that you have a bunch of stale session directories, but I'm 
>> puzzled because the creation dates are so current - for whatever reason, 
>> OMPI seems to be getting the same jobid much more often than it should. Can 
>> you tell me something about the environment - e.g., is it managed or just 
>> using hostfile?
> 
> This computer is used about 11 times a day to launch about 1500 executions on 
> our in-house (finite element) code.
> 
> We do launch at most 12 single process executions at the same time, but we 
> use PETSc, which always initialize the MPI environment...
> 
> Also, we are launching some tests which use between 2 to 128 processes (on 
> the same computer) just to ensure proper code testing.  In fact, performance 
> is not quit an issue in these 128 processes tests and we set the following 
> environment variable:
> 
> export OMPI_MCA_mpi_yield_when_idle=1
> 
> because we encountered timeout problems before...
> 
> The whole testing lasts about 1 hour and the result is used to give a 
> feed-back for users who "pushed" modifications to the code
> 
> So I would add: sometime the tests may be interrupted by segfaults, "kill 
> -TERM" or anything you can imagine...  The problem now is that it won't even 
> start if a mere file exists...
> 
> I can flush those files right now, but I am almost sure they will reappear it 
> the following days, leading to false "bad results" for the tests... and I 
> will have to setup a cleanup procedure before launching all the tests... But 
> that will not prevent the fact that those files may be created while running 
> the firsts of the 1500 tests and have 1 or some of the rest to fail
> 
> I hope this is the information you wanted... Is it?
> 
> Thanks,
> 
> Eric
> 
> 
>> 
>> 
>> On Feb 3, 2014, at 12:00 PM, Eric Chamberland 
>>  wrote:
>> 
>>> On 02/03/2014 02:49 PM, Ralph Castain wrote:
 Seems rather odd - is your /tmp by any chance network mounted?
>>> 
>>> No it is a "normal" /tmp:
>>> 
>>> "cd /tmp; df -h ." gives:
>>> 
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> /dev/sda149G   17G   30G  37% /
>>> 
>>> And there is plenty of disk space...
>>> 
>>> I agree it is odd, but how should OpenMPI react when trying to create a 
>>> directory over an existing file name?  I mean what is it programmed to do?
>>> 
>>> Thanks,
>>> 
>>> Eric
>>> 
> 



Re: [OMPI users] opal_os_dirpath_create: Error: Unable to create the, sub-directory

2014-02-03 Thread Eric Chamberland

On 02/03/2014 03:59 PM, Ralph Castain wrote:

Very strange - even if you kill the job with SIGTERM, or have processes that 
segfault, OMPI should clean itself up and remove those session directories. 
Granted, the 1.6 series isn't as good about doing so as the 1.7 series, but it 
at least to-date has done pretty well.


Ok, one more information here that may matter: All sequential tests are 
launched *without* mpiexec...  I don't know if the "cleanup" phase is 
done by mpiexec or the binaries...




Best I can suggest for now is to do the following in your test script:

(1) set TMPDIR=

(2) run your tests

(3) rm -rf /tmp/regression/*

That will ensure you only blow away the session dirs from your regression 
tests. Hopefully, you'll find the directory empty more often than not...


Ok, I just added:

find /tmp/openmpi-sessions-${USER}* -maxdepth 1 -type f -exec rm {} \;

which should delete files that shouldn't exists... ;-)

But, IMHO, I still think OpenMPI should "choose" another directory name 
if it can't create it because a poor file exists!


How can all users be aware that they have to cleanup such files?

Maybe a good compromise would be to have the error message to tell there 
is a file with the same name of the directory chosen?


Or add a new entry to the FAQ to help users find the workaround you 
proposed... ;-)


thanks again!

Eric



HTH
Ralph

On Feb 3, 2014, at 12:31 PM, Eric Chamberland 
 wrote:


Hi,

On 02/03/2014 03:09 PM, Ralph Castain wrote:

OMPI will error out in that case, as you originally reported. What seems to be 
happening is that you have a bunch of stale session directories, but I'm 
puzzled because the creation dates are so current - for whatever reason, OMPI 
seems to be getting the same jobid much more often than it should. Can you tell 
me something about the environment - e.g., is it managed or just using hostfile?


This computer is used about 11 times a day to launch about 1500 executions on 
our in-house (finite element) code.

We do launch at most 12 single process executions at the same time, but we use 
PETSc, which always initialize the MPI environment...

Also, we are launching some tests which use between 2 to 128 processes (on the 
same computer) just to ensure proper code testing.  In fact, performance is not 
quit an issue in these 128 processes tests and we set the following environment 
variable:

export OMPI_MCA_mpi_yield_when_idle=1

because we encountered timeout problems before...

The whole testing lasts about 1 hour and the result is used to give a feed-back for users 
who "pushed" modifications to the code

So I would add: sometime the tests may be interrupted by segfaults, "kill 
-TERM" or anything you can imagine...  The problem now is that it won't even start 
if a mere file exists...

I can flush those files right now, but I am almost sure they will reappear it the 
following days, leading to false "bad results" for the tests... and I will have 
to setup a cleanup procedure before launching all the tests... But that will not prevent 
the fact that those files may be created while running the firsts of the 1500 tests and 
have 1 or some of the rest to fail

I hope this is the information you wanted... Is it?

Thanks,

Eric





On Feb 3, 2014, at 12:00 PM, Eric Chamberland 
 wrote:


On 02/03/2014 02:49 PM, Ralph Castain wrote:

Seems rather odd - is your /tmp by any chance network mounted?


No it is a "normal" /tmp:

"cd /tmp; df -h ." gives:

Filesystem  Size  Used Avail Use% Mounted on
/dev/sda149G   17G   30G  37% /

And there is plenty of disk space...

I agree it is odd, but how should OpenMPI react when trying to create a 
directory over an existing file name?  I mean what is it programmed to do?

Thanks,

Eric







Re: [OMPI users] opal_os_dirpath_create: Error: Unable to create the, sub-directory

2014-02-03 Thread Ralph Castain

On Feb 3, 2014, at 1:13 PM, Eric Chamberland  
wrote:

> On 02/03/2014 03:59 PM, Ralph Castain wrote:
>> Very strange - even if you kill the job with SIGTERM, or have processes that 
>> segfault, OMPI should clean itself up and remove those session directories. 
>> Granted, the 1.6 series isn't as good about doing so as the 1.7 series, but 
>> it at least to-date has done pretty well.
> 
> Ok, one more information here that may matter: All sequential tests are 
> launched *without* mpiexec...  I don't know if the "cleanup" phase is done by 
> mpiexec or the binaries...

Ah, yes that would be a source of the problem! We can't guarantee cleanup if 
you just kill the procs or they segfault *unless* mpiexec is used to launch the 
job. What are you using to launch? Most resource managers provide an "epilog" 
capability for precisely this purpose as all MPIs would display the same issue.

> 
>> 
>> Best I can suggest for now is to do the following in your test script:
>> 
>> (1) set TMPDIR=
>> 
>> (2) run your tests
>> 
>> (3) rm -rf /tmp/regression/*
>> 
>> That will ensure you only blow away the session dirs from your regression 
>> tests. Hopefully, you'll find the directory empty more often than not...
> 
> Ok, I just added:
> 
> find /tmp/openmpi-sessions-${USER}* -maxdepth 1 -type f -exec rm {} \;
> 
> which should delete files that shouldn't exists... ;-)
> 
> But, IMHO, I still think OpenMPI should "choose" another directory name if it 
> can't create it because a poor file exists!

We could do that - but now we get into the bottomless pit of trying every 
possible combination of directory names, and ensuring that every process comes 
up with the same answer! Remember, the session dir is where the shared memory 
regions rendezvous, so every process on a node would have to find the same place

> 
> How can all users be aware that they have to cleanup such files?

Given how long 1.6.x has been out there, and that this is about the only time 
I've heard of a problem, I'm not sure this is a general enough issue to merit 
the concern

> 
> Maybe a good compromise would be to have the error message to tell there is a 
> file with the same name of the directory chosen?

I can make that change - good suggestion.

> 
> Or add a new entry to the FAQ to help users find the workaround you 
> proposed... ;-)

we can try to do that too

> 
> thanks again!
> 
> Eric
> 
>> 
>> HTH
>> Ralph
>> 
>> On Feb 3, 2014, at 12:31 PM, Eric Chamberland 
>>  wrote:
>> 
>>> Hi,
>>> 
>>> On 02/03/2014 03:09 PM, Ralph Castain wrote:
 OMPI will error out in that case, as you originally reported. What seems 
 to be happening is that you have a bunch of stale session directories, but 
 I'm puzzled because the creation dates are so current - for whatever 
 reason, OMPI seems to be getting the same jobid much more often than it 
 should. Can you tell me something about the environment - e.g., is it 
 managed or just using hostfile?
>>> 
>>> This computer is used about 11 times a day to launch about 1500 executions 
>>> on our in-house (finite element) code.
>>> 
>>> We do launch at most 12 single process executions at the same time, but we 
>>> use PETSc, which always initialize the MPI environment...
>>> 
>>> Also, we are launching some tests which use between 2 to 128 processes (on 
>>> the same computer) just to ensure proper code testing.  In fact, 
>>> performance is not quit an issue in these 128 processes tests and we set 
>>> the following environment variable:
>>> 
>>> export OMPI_MCA_mpi_yield_when_idle=1
>>> 
>>> because we encountered timeout problems before...
>>> 
>>> The whole testing lasts about 1 hour and the result is used to give a 
>>> feed-back for users who "pushed" modifications to the code
>>> 
>>> So I would add: sometime the tests may be interrupted by segfaults, "kill 
>>> -TERM" or anything you can imagine...  The problem now is that it won't 
>>> even start if a mere file exists...
>>> 
>>> I can flush those files right now, but I am almost sure they will reappear 
>>> it the following days, leading to false "bad results" for the tests... and 
>>> I will have to setup a cleanup procedure before launching all the tests... 
>>> But that will not prevent the fact that those files may be created while 
>>> running the firsts of the 1500 tests and have 1 or some of the rest to 
>>> fail
>>> 
>>> I hope this is the information you wanted... Is it?
>>> 
>>> Thanks,
>>> 
>>> Eric
>>> 
>>> 
 
 
 On Feb 3, 2014, at 12:00 PM, Eric Chamberland 
  wrote:
 
> On 02/03/2014 02:49 PM, Ralph Castain wrote:
>> Seems rather odd - is your /tmp by any chance network mounted?
> 
> No it is a "normal" /tmp:
> 
> "cd /tmp; df -h ." gives:
> 
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/sda149G   17G   30G  37% /
> 
> And there is plenty of disk space...
> 
> I agree it is odd, but how should OpenMPI react when 

Re: [OMPI users] opal_os_dirpath_create: Error: Unable to create the, sub-directory

2014-02-03 Thread Eric Chamberland

Hi Ralph,

On 02/03/2014 04:20 PM, Ralph Castain wrote:

On Feb 3, 2014, at 1:13 PM, Eric Chamberland  
wrote:


On 02/03/2014 03:59 PM, Ralph Castain wrote:

Very strange - even if you kill the job with SIGTERM, or have processes that 
segfault, OMPI should clean itself up and remove those session directories. 
Granted, the 1.6 series isn't as good about doing so as the 1.7 series, but it 
at least to-date has done pretty well.

Ok, one more information here that may matter: All sequential tests are launched 
*without* mpiexec...  I don't know if the "cleanup" phase is done by mpiexec or 
the binaries...

Ah, yes that would be a source of the problem! We can't guarantee cleanup if you just 
kill the procs or they segfault *unless* mpiexec is used to launch the job. What are you 
using to launch? Most resource managers provide an "epilog" capability for 
precisely this purpose as all MPIs would display the same issue.
For the sequential jobs, we just launch the tests on the "command 
line"... no resource manager is ever used.  For the jobs which requires 
more than 1 process, we have "mpiexec -n ..." added to the command line...



which should delete files that shouldn't exists... ;-)

But, IMHO, I still think OpenMPI should "choose" another directory name if it 
can't create it because a poor file exists!

We could do that - but now we get into the bottomless pit of trying every 
possible combination of directory names, and ensuring that every process comes 
up with the same answer! Remember, the session dir is where the shared memory 
regions rendezvous, so every process on a node would have to find the same place
ok.  Just for my knowledge: that means if I launch 2 processes on a 
single node and they have to communicate, they will do it by the files 
in /tmp?



How can all users be aware that they have to cleanup such files?

Given how long 1.6.x has been out there, and that this is about the only time 
I've heard of a problem, I'm not sure this is a general enough issue to merit 
the concern
Ok.  I did just verified on 8 other computers/architectures that are 
running the same tests: there is only 1 which have files in the 
directory level of /tmp/openmpi-sessions-${USER}*
Since we do that kind of testing since many years, I also agree it is 
not a widespread issue...  But it just occured 2 times in the last 3 
days!!! :-/



Maybe a good compromise would be to have the error message to tell there is a 
file with the same name of the directory chosen?

I can make that change - good suggestion.

ok, thanks!




Or add a new entry to the FAQ to help users find the workaround you proposed... 
;-)

we can try to do that too


If I may suggest to test the behavior of 1.7.x... what about this: Have 
a test case that creates a bunch of files (from 0 to 65536) in 
/tmp/openmpi-sessions-${USER}... before launching an executable without 
mpirun... >:)


Anyway, thanks a lot!

Eric



Re: [OMPI users] opal_os_dirpath_create: Error: Unable to create the, sub-directory

2014-02-03 Thread Reuti
Am 03.02.2014 um 23:01 schrieb Eric Chamberland:

> Hi Ralph,
> 
> On 02/03/2014 04:20 PM, Ralph Castain wrote:
>> On Feb 3, 2014, at 1:13 PM, Eric Chamberland 
>>  wrote:
>> 
>>> On 02/03/2014 03:59 PM, Ralph Castain wrote:
 Very strange - even if you kill the job with SIGTERM, or have processes 
 that segfault, OMPI should clean itself up and remove those session 
 directories. Granted, the 1.6 series isn't as good about doing so as the 
 1.7 series, but it at least to-date has done pretty well.
>>> Ok, one more information here that may matter: All sequential tests are 
>>> launched *without* mpiexec...  I don't know if the "cleanup" phase is done 
>>> by mpiexec or the binaries...
>> Ah, yes that would be a source of the problem! We can't guarantee cleanup if 
>> you just kill the procs or they segfault *unless* mpiexec is used to launch 
>> the job. What are you using to launch? Most resource managers provide an 
>> "epilog" capability for precisely this purpose as all MPIs would display the 
>> same issue.
> For the sequential jobs, we just launch the tests on the "command line"... no 
> resource manager is ever used.  For the jobs which requires more than 1 
> process, we have "mpiexec -n ..." added to the command line...
> 
>>> which should delete files that shouldn't exists... ;-)
>>> 
>>> But, IMHO, I still think OpenMPI should "choose" another directory name if 
>>> it can't create it because a poor file exists!
>> We could do that - but now we get into the bottomless pit of trying every 
>> possible combination of directory names, and ensuring that every process 
>> comes up with the same answer! Remember, the session dir is where the shared 
>> memory regions rendezvous, so every process on a node would have to find the 
>> same place
> ok.  Just for my knowledge: that means if I launch 2 processes on a single 
> node and they have to communicate, they will do it by the files in /tmp?
> 
>>> How can all users be aware that they have to cleanup such files?
>> Given how long 1.6.x has been out there, and that this is about the only 
>> time I've heard of a problem, I'm not sure this is a general enough issue to 
>> merit the concern
> Ok.  I did just verified on 8 other computers/architectures that are running 
> the same tests: there is only 1 which have files in the directory level of 
> /tmp/openmpi-sessions-${USER}*
> Since we do that kind of testing since many years, I also agree it is not a 
> widespread issue...  But it just occured 2 times in the last 3 days!!! :-/

What about using a queuing system? Open MPI will put the created files into a 
subdirectory dedicated for this job by the queuing system. Even if Open MPI 
fails to remove the files, the queuing system will do.

-- Reuti


>> 
>>> Maybe a good compromise would be to have the error message to tell there is 
>>> a file with the same name of the directory chosen?
>> I can make that change - good suggestion.
> ok, thanks!
> 
>> 
>>> Or add a new entry to the FAQ to help users find the workaround you 
>>> proposed... ;-)
>> we can try to do that too
> 
> If I may suggest to test the behavior of 1.7.x... what about this: Have a 
> test case that creates a bunch of files (from 0 to 65536) in 
> /tmp/openmpi-sessions-${USER}... before launching an executable without 
> mpirun... >:)
> 
> Anyway, thanks a lot!
> 
> Eric
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] opal_os_dirpath_create: Error: Unable to create the, sub-directory

2014-02-03 Thread Ralph Castain

On Feb 3, 2014, at 2:01 PM, Eric Chamberland  
wrote:

> Hi Ralph,
> 
> On 02/03/2014 04:20 PM, Ralph Castain wrote:
>> On Feb 3, 2014, at 1:13 PM, Eric Chamberland 
>>  wrote:
>> 
>>> On 02/03/2014 03:59 PM, Ralph Castain wrote:
 Very strange - even if you kill the job with SIGTERM, or have processes 
 that segfault, OMPI should clean itself up and remove those session 
 directories. Granted, the 1.6 series isn't as good about doing so as the 
 1.7 series, but it at least to-date has done pretty well.
>>> Ok, one more information here that may matter: All sequential tests are 
>>> launched *without* mpiexec...  I don't know if the "cleanup" phase is done 
>>> by mpiexec or the binaries...
>> Ah, yes that would be a source of the problem! We can't guarantee cleanup if 
>> you just kill the procs or they segfault *unless* mpiexec is used to launch 
>> the job. What are you using to launch? Most resource managers provide an 
>> "epilog" capability for precisely this purpose as all MPIs would display the 
>> same issue.
> For the sequential jobs, we just launch the tests on the "command line"... no 
> resource manager is ever used.  For the jobs which requires more than 1 
> process, we have "mpiexec -n ..." added to the command line...

Understood. FWIW, if those sequential jobs call "MPI_Init", then they will 
create a session directory tree. I've been removing that in the 1.7 series so 
it only gets created when needed, but not in the 1.6 series.

> 
>>> which should delete files that shouldn't exists... ;-)
>>> 
>>> But, IMHO, I still think OpenMPI should "choose" another directory name if 
>>> it can't create it because a poor file exists!
>> We could do that - but now we get into the bottomless pit of trying every 
>> possible combination of directory names, and ensuring that every process 
>> comes up with the same answer! Remember, the session dir is where the shared 
>> memory regions rendezvous, so every process on a node would have to find the 
>> same place
> ok.  Just for my knowledge: that means if I launch 2 processes on a single 
> node and they have to communicate, they will do it by the files in /tmp?

They won't communicate via the files - they just use the files as a rendezvous 
point to exchange shared memory region pointers.

> 
>>> How can all users be aware that they have to cleanup such files?
>> Given how long 1.6.x has been out there, and that this is about the only 
>> time I've heard of a problem, I'm not sure this is a general enough issue to 
>> merit the concern
> Ok.  I did just verified on 8 other computers/architectures that are running 
> the same tests: there is only 1 which have files in the directory level of 
> /tmp/openmpi-sessions-${USER}*
> Since we do that kind of testing since many years, I also agree it is not a 
> widespread issue...  But it just occured 2 times in the last 3 days!!! :-/

Bummer :-(

>> 
>>> Maybe a good compromise would be to have the error message to tell there is 
>>> a file with the same name of the directory chosen?
>> I can make that change - good suggestion.
> ok, thanks!
> 
>> 
>>> Or add a new entry to the FAQ to help users find the workaround you 
>>> proposed... ;-)
>> we can try to do that too
> 
> If I may suggest to test the behavior of 1.7.x... what about this: Have a 
> test case that creates a bunch of files (from 0 to 65536) in 
> /tmp/openmpi-sessions-${USER}... before launching an executable without 
> mpirun... >:)

Ick - it will actually only conflict if/when the pid's wrap, so it's a pretty 
rare issue.

> 
> Anyway, thanks a lot!
> 
> Eric
>