[OMPI users] Fatal Error: Cannot read module file 'mpi.mod' opened at (1), because it was created by a different version of GNU Fortran

2015-07-28 Thread Syed Ahsan Ali
I am getting this error during installation of an application.
Apparently the error seems to be complaining about openmpi being
compiled with different version of gnu fortran but I am sure that it
was compiled with gcc-4.9.2. The same is also being used for
application compilation.

I am using openmpi-1.8.4

Ahsan


Re: [OMPI users] Fatal Error: Cannot read module file 'mpi.mod' opened at (1), because it was created by a different version of GNU Fortran

2015-07-28 Thread Gilles Gouaillardet

Hi,

you can run
zcat mpi.mod | head to confirm which gfortran was used to build the 
application


GFORTRAN module version '10' => gcc 4.8.3
GFORTRAN module version '12' => gcc 4.9.2
GFORTRAN module version '14' => gcc 5.1.0

i assume the failing command is mpifort ...
so you can run
mpifort -showme ...
to see the how gfortran is invoked.

it is likely mpifort simply run gfortran, and your PATH does not point 
to gfortran 4.9.2


Cheers,

Gilles

On 7/28/2015 1:47 PM, Syed Ahsan Ali wrote:

I am getting this error during installation of an application.
Apparently the error seems to be complaining about openmpi being
compiled with different version of gnu fortran but I am sure that it
was compiled with gcc-4.9.2. The same is also being used for
application compilation.

I am using openmpi-1.8.4

Ahsan
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/07/27341.php





[OMPI users] strange behavior of MPI_wait() method

2015-07-28 Thread Cristian RUIZ

Hello,

I'm measuring the overhead of using Linux container for HPC 
applications. To do so I was comparing the execution time of NAS 
parallel benchmarks on two infrastructures:


1) real: 16 real machines
2) container: 16 containers distributed over 16 real machines

Each machine used is equipped with two Intel Xeon E5-2630v3 processors 
(with 8 cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.


In my results, I found a particular performance degradation for CG.B 
benchmark:


walltime numprocess  type  ci1  ci2overhead
1   6615085 16native  6473340  6756830   1.1271473
2   6349030 32native  6315947  6382112   2.2187747
3   5811724 64native  5771509  5851938   0.8983445
4   4002865128native  3966314  4039416 *180.7472715*
5   4077885256native  4044667  403 *402.8036531

*walltime numprocess  type  ci1  ci2overhead
6   6540523 16 container  6458503  6622543   0.000
7   6208159 32 container  6184888  6231431   0.000
8   5759514 64 container  5719453  5799575   0.000
9  11237935128 container 10762906 11712963   0.000
10 20503755256 container 19830425 21177085   0.000

(16 MPI processes per machine/container)

When I use containers everything is fine before 128 MPI processes. I got 
180% and 400% performance degration with 128  and 256 MPI processes 
respectively. I repeated again the meaures and I had statistically the 
same results. So, I decided to generate a trace of the execution using 
TAU. I discovered that the source of the overhead is the MPI_wait() 
method that sometimes takes around 0.2 seconds and this happens around 
20 times which adds around 4 seconds to the execution time. The method 
is called 25992 times and in avarage takes between 50 and 300 usecs 
(values obtained with profiling).

This strange behavior was reported in this paper[1] (page 10)  that says:

"We can see two outstanding zones of MPI_Send and MPI_Wait. Such 
operations typically take few microseconds to less than a millisecond. 
Here they take 0.2 seconds"


They attributed that strange behavior to package loss and network 
malfunctioning. In my experiments I measured the number of packets 
dropped and nothing unusual happened.
I used two versions of OpenMPI 1.6.5 and 1.8.5 and in both versions I 
got the same strange behavior. Any clues of what could be the source of 
that strange behavior? could you please suggest any method to

debug this problem?


Thank you in advance

[1] https://hal.inria.fr/hal-00919507/file/smpi_pmbs13.pdf





Re: [OMPI users] strange behavior of MPI_wait() method

2015-07-28 Thread Gilles Gouaillardet
Cristian,

If the message takes some extra time to land into the receiver, then
MPI_Wait will take more time.
or even worse, if the sender is late, the receiver will spend even more
time in MPI_Wait.

First, how do you run 128 tasks on 16 nodes ?
if you do a simple mpirun, then you will use sm or vader btl.
containers can only use the tcp btl, even within the same physical node.
so I encourage you to mpirun --mca tcp,self -np 128 ...
and see if you observe any degradation.

I know very few about containers, but if I remember correctly, you can do
stuff such as cgroup (cpu caping, network bandwidth caping, memory limit)
do you use such things ?
a possible explanation is a container reaches it's limit and is given a
very low priority.

regardless the containers, you end up having 16 tasks sharing the same
interconnect.
I can imagine that an unfair share can lead to this kind of behaviour.

on the network, did you measure zero or few errors ?
few errors take some extra time to be fixed, and if your application is
communication intensive, these delays get propagated and you can end up
with huge performance hit.

Cheers,

Gilles

On Tuesday, July 28, 2015, Cristian RUIZ  wrote:

>  Hello,
>
> I'm measuring the overhead of using Linux container for HPC applications.
> To do so I was comparing the execution time of NAS parallel benchmarks on
> two infrastructures:
>
> 1) real: 16 real machines
> 2) container: 16 containers distributed over 16 real machines
>
> Each machine used is equipped with two Intel Xeon E5-2630v3 processors
> (with 8 cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.
>
> In my results, I found a particular performance degradation for CG.B
> benchmark:
>
> walltime numprocess  type  ci1  ci2overhead
> 1   6615085 16native  6473340  6756830   1.1271473
> 2   6349030 32native  6315947  6382112   2.2187747
> 3   5811724 64native  5771509  5851938   0.8983445
> 4   4002865128native  3966314  4039416 *180.7472715*
> 5   4077885256native  4044667  403
>
> *402.8036531 *walltime numprocess  type  ci1  ci2
> overhead
> 6   6540523 16 container  6458503  6622543   0.000
> 7   6208159 32 container  6184888  6231431   0.000
> 8   5759514 64 container  5719453  5799575   0.000
> 9  11237935128 container 10762906 11712963   0.000
> 10 20503755256 container 19830425 21177085   0.000
>
> (16 MPI processes per machine/container)
>
> When I use containers everything is fine before 128 MPI processes.  I got
> 180% and 400% performance degration with 128  and 256 MPI processes
> respectively. I repeated again the meaures and I had statistically the same
> results. So, I decided to generate a trace of the execution using TAU. I
> discovered that the source of the overhead is the MPI_wait() method that
> sometimes takes around 0.2 seconds and this happens around 20 times which
> adds around 4 seconds to the execution time. The method is called 25992
> times and in avarage takes between 50 and 300 usecs (values obtained with
> profiling).
> This strange behavior was reported in this paper[1] (page 10)  that says:
>
> "We can see two outstanding zones of MPI_Send and MPI_Wait. Such
> operations typically take few microseconds to less than a millisecond. Here
> they take 0.2 seconds"
>
> They attributed that strange behavior to package loss and network
> malfunctioning. In my experiments I measured the number of packets dropped
> and nothing unusual happened.
> I used two versions of OpenMPI 1.6.5 and 1.8.5 and in both versions I got
> the same strange behavior. Any clues of what could be the source of that
> strange behavior? could you please suggest any method to
> debug this problem?
>
>
> Thank you in advance
>
> [1] https://hal.inria.fr/hal-00919507/file/smpi_pmbs13.pdf
>
>
>
>


Re: [OMPI users] strange behavior of MPI_wait() method

2015-07-28 Thread Gilles Gouaillardet
Cristian,

one more thing...
make sure tasks run on the same physical node with and without containers.
for example, if in native mode, tasks 0 to 15 run on node 0, then in
container mode, tasks 0 to 15 should run on 16 containers hosted by node 0

Cheers,

Gilles

On Tuesday, July 28, 2015, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Cristian,
>
> If the message takes some extra time to land into the receiver, then
> MPI_Wait will take more time.
> or even worse, if the sender is late, the receiver will spend even more
> time in MPI_Wait.
>
> First, how do you run 128 tasks on 16 nodes ?
> if you do a simple mpirun, then you will use sm or vader btl.
> containers can only use the tcp btl, even within the same physical node.
> so I encourage you to mpirun --mca tcp,self -np 128 ...
> and see if you observe any degradation.
>
> I know very few about containers, but if I remember correctly, you can do
> stuff such as cgroup (cpu caping, network bandwidth caping, memory limit)
> do you use such things ?
> a possible explanation is a container reaches it's limit and is given a
> very low priority.
>
> regardless the containers, you end up having 16 tasks sharing the same
> interconnect.
> I can imagine that an unfair share can lead to this kind of behaviour.
>
> on the network, did you measure zero or few errors ?
> few errors take some extra time to be fixed, and if your application is
> communication intensive, these delays get propagated and you can end up
> with huge performance hit.
>
> Cheers,
>
> Gilles
>
> On Tuesday, July 28, 2015, Cristian RUIZ  > wrote:
>
>>  Hello,
>>
>> I'm measuring the overhead of using Linux container for HPC applications.
>> To do so I was comparing the execution time of NAS parallel benchmarks on
>> two infrastructures:
>>
>> 1) real: 16 real machines
>> 2) container: 16 containers distributed over 16 real machines
>>
>> Each machine used is equipped with two Intel Xeon E5-2630v3 processors
>> (with 8 cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.
>>
>> In my results, I found a particular performance degradation for CG.B
>> benchmark:
>>
>> walltime numprocess  type  ci1  ci2overhead
>> 1   6615085 16native  6473340  6756830   1.1271473
>> 2   6349030 32native  6315947  6382112   2.2187747
>> 3   5811724 64native  5771509  5851938   0.8983445
>> 4   4002865128native  3966314  4039416 *180.7472715*
>> 5   4077885256native  4044667  403
>>
>> *402.8036531 *walltime numprocess  type  ci1  ci2
>> overhead
>> 6   6540523 16 container  6458503  6622543   0.000
>> 7   6208159 32 container  6184888  6231431   0.000
>> 8   5759514 64 container  5719453  5799575   0.000
>> 9  11237935128 container 10762906 11712963   0.000
>> 10 20503755256 container 19830425 21177085   0.000
>>
>> (16 MPI processes per machine/container)
>>
>> When I use containers everything is fine before 128 MPI processes.  I got
>> 180% and 400% performance degration with 128  and 256 MPI processes
>> respectively. I repeated again the meaures and I had statistically the same
>> results. So, I decided to generate a trace of the execution using TAU. I
>> discovered that the source of the overhead is the MPI_wait() method that
>> sometimes takes around 0.2 seconds and this happens around 20 times which
>> adds around 4 seconds to the execution time. The method is called 25992
>> times and in avarage takes between 50 and 300 usecs (values obtained with
>> profiling).
>> This strange behavior was reported in this paper[1] (page 10)  that says:
>>
>> "We can see two outstanding zones of MPI_Send and MPI_Wait. Such
>> operations typically take few microseconds to less than a millisecond. Here
>> they take 0.2 seconds"
>>
>> They attributed that strange behavior to package loss and network
>> malfunctioning. In my experiments I measured the number of packets dropped
>> and nothing unusual happened.
>> I used two versions of OpenMPI 1.6.5 and 1.8.5 and in both versions I got
>> the same strange behavior. Any clues of what could be the source of that
>> strange behavior? could you please suggest any method to
>> debug this problem?
>>
>>
>> Thank you in advance
>>
>> [1] https://hal.inria.fr/hal-00919507/file/smpi_pmbs13.pdf
>>
>>
>>
>>


Re: [OMPI users] strange behavior of MPI_wait() method

2015-07-28 Thread Cristian RUIZ

Thank you for answering. I executed the test with the following command:

mpirun  --mca btl self,sm,tcp --machinefile machine_file cg.B.128 in 
both setups. My machine file is composed of 128 lines (each machine 
hostname is repeated 16 times). There is just one container per machine 
and the container is configured with 16 cores. So, they are able to use 
"sm". Everything is set properly I used LXC[1], I dont observe any 
problem with the other benchmarks I executed.


on the network I observe 2 dropped packets over almost all interfaces of 
the participating nodes. I think this is normal becuase I observe the 
same thing when I use real machine and the perfomance in this case is 
much better.


[1] https://linuxcontainers.org/



On 07/28/2015 02:31 PM, Gilles Gouaillardet wrote:

Cristian,

If the message takes some extra time to land into the receiver, then 
MPI_Wait will take more time.
or even worse, if the sender is late, the receiver will spend even 
more time in MPI_Wait.


First, how do you run 128 tasks on 16 nodes ?
if you do a simple mpirun, then you will use sm or vader btl.
containers can only use the tcp btl, even within the same physical node.
so I encourage you to mpirun --mca tcp,self -np 128 ...
and see if you observe any degradation.

I know very few about containers, but if I remember correctly, you can 
do stuff such as cgroup (cpu caping, network bandwidth caping, memory 
limit)

do you use such things ?
a possible explanation is a container reaches it's limit and is given 
a very low priority.


regardless the containers, you end up having 16 tasks sharing the same 
interconnect.

I can imagine that an unfair share can lead to this kind of behaviour.

on the network, did you measure zero or few errors ?
few errors take some extra time to be fixed, and if your application 
is communication intensive, these delays get propagated and you can 
end up with huge performance hit.


Cheers,

Gilles

On Tuesday, July 28, 2015, Cristian RUIZ > wrote:


Hello,

I'm measuring the overhead of using Linux container for HPC
applications. To do so I was comparing the execution time of NAS
parallel benchmarks on two infrastructures:

1) real: 16 real machines
2) container: 16 containers distributed over 16 real machines

Each machine used is equipped with two Intel Xeon E5-2630v3
processors (with 8 cores each), 128 GB of RAM and a 10 Gigabit
Ethernet adapter.

In my results, I found a particular performance degradation for
CG.B benchmark:

walltime numprocess  type  ci1  ci2 overhead
1   6615085 16native  6473340  6756830 1.1271473
2   6349030 32native  6315947  6382112 2.2187747
3   5811724 64native  5771509  5851938 0.8983445
4   4002865128native  3966314  4039416 *180.7472715*
5   4077885256native  4044667  403 *402.8036531

*walltime numprocess  type  ci1 ci2overhead
6   6540523 16 container  6458503  6622543 0.000
7   6208159 32 container  6184888  6231431 0.000
8   5759514 64 container  5719453  5799575 0.000
9  11237935128 container 10762906 11712963 0.000
10 20503755256 container 19830425 21177085 0.000

(16 MPI processes per machine/container)

When I use containers everything is fine before 128 MPI
processes.  I got 180% and 400% performance degration with 128 
and 256 MPI processes respectively. I repeated again the meaures

and I had statistically the same results. So, I decided to
generate a trace of the execution using TAU. I discovered that the
source of the overhead is the MPI_wait() method that sometimes
takes around 0.2 seconds and this happens around 20 times which
adds around 4 seconds to the execution time. The method is called
25992 times and in avarage takes between 50 and 300 usecs (values
obtained with profiling).
This strange behavior was reported in this paper[1] (page 10) 
that says:


"We can see two outstanding zones of MPI_Send and MPI_Wait. Such
operations typically take few microseconds to less than a
millisecond. Here they take 0.2 seconds"

They attributed that strange behavior to package loss and network
malfunctioning. In my experiments I measured the number of packets
dropped and nothing unusual happened.
I used two versions of OpenMPI 1.6.5 and 1.8.5 and in both
versions I got the same strange behavior. Any clues of what could
be the source of that strange behavior? could you please suggest
any method to
debug this problem?


Thank you in advance

[1] https://hal.inria.fr/hal-00919507/file/smpi_pmbs13.pdf





___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http:

Re: [OMPI users] strange behavior of MPI_wait() method

2015-07-28 Thread Gilles Gouaillardet
thanks for clarifying there is only one container per host.

do you always run 16 tasks per host/container ?
or do you always run 16 hosts/containers ?

also, do lxc sets iptables when you start a container ?

Cheers,

Gilles

On Tuesday, July 28, 2015, Cristian RUIZ  wrote:

>  Thank you for answering. I executed the test with the following command:
>
> mpirun  --mca btl self,sm,tcp --machinefile machine_file cg.B.128 in both
> setups. My machine file is composed of 128 lines (each machine hostname is
> repeated 16 times). There is just one container per machine and the
> container is configured with 16 cores. So, they are able to use "sm".
> Everything is set properly I used LXC[1], I dont observe any problem with
> the other benchmarks I executed.
>
> on the network I observe 2 dropped packets over almost all interfaces of
> the participating nodes. I think this is normal becuase I observe the same
> thing when I use real machine and the perfomance in this case is much
> better.
>
> [1] https://linuxcontainers.org/
>
>
>
> On 07/28/2015 02:31 PM, Gilles Gouaillardet wrote:
>
> Cristian,
>
>  If the message takes some extra time to land into the receiver, then
> MPI_Wait will take more time.
> or even worse, if the sender is late, the receiver will spend even more
> time in MPI_Wait.
>
>  First, how do you run 128 tasks on 16 nodes ?
> if you do a simple mpirun, then you will use sm or vader btl.
> containers can only use the tcp btl, even within the same physical node.
> so I encourage you to mpirun --mca tcp,self -np 128 ...
> and see if you observe any degradation.
>
>  I know very few about containers, but if I remember correctly, you can do
> stuff such as cgroup (cpu caping, network bandwidth caping, memory limit)
> do you use such things ?
> a possible explanation is a container reaches it's limit and is given a
> very low priority.
>
>  regardless the containers, you end up having 16 tasks sharing the same
> interconnect.
> I can imagine that an unfair share can lead to this kind of behaviour.
>
>  on the network, did you measure zero or few errors ?
> few errors take some extra time to be fixed, and if your application is
> communication intensive, these delays get propagated and you can end up
> with huge performance hit.
>
> Cheers,
>
>  Gilles
>
> On Tuesday, July 28, 2015, Cristian RUIZ  > wrote:
>
>>  Hello,
>>
>> I'm measuring the overhead of using Linux container for HPC applications.
>> To do so I was comparing the execution time of NAS parallel benchmarks on
>> two infrastructures:
>>
>> 1) real: 16 real machines
>> 2) container: 16 containers distributed over 16 real machines
>>
>> Each machine used is equipped with two Intel Xeon E5-2630v3 processors
>> (with 8 cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.
>>
>> In my results, I found a particular performance degradation for CG.B
>> benchmark:
>>
>> walltime numprocess  type  ci1  ci2overhead
>> 1   6615085 16native  6473340  6756830   1.1271473
>> 2   6349030 32native  6315947  6382112   2.2187747
>> 3   5811724 64native  5771509  5851938   0.8983445
>> 4   4002865128native  3966314  4039416 *180.7472715*
>> 5   4077885256native  4044667  403
>>
>> *402.8036531 *walltime numprocess  type  ci1  ci2
>> overhead
>> 6   6540523 16 container  6458503  6622543   0.000
>> 7   6208159 32 container  6184888  6231431   0.000
>> 8   5759514 64 container  5719453  5799575   0.000
>> 9  11237935128 container 10762906 11712963   0.000
>> 10 20503755256 container 19830425 21177085   0.000
>>
>> (16 MPI processes per machine/container)
>>
>> When I use containers everything is fine before 128 MPI processes.  I got
>> 180% and 400% performance degration with 128  and 256 MPI processes
>> respectively. I repeated again the meaures and I had statistically the same
>> results. So, I decided to generate a trace of the execution using TAU. I
>> discovered that the source of the overhead is the MPI_wait() method that
>> sometimes takes around 0.2 seconds and this happens around 20 times which
>> adds around 4 seconds to the execution time. The method is called 25992
>> times and in avarage takes between 50 and 300 usecs (values obtained with
>> profiling).
>> This strange behavior was reported in this paper[1] (page 10)  that says:
>>
>> "We can see two outstanding zones of MPI_Send and MPI_Wait. Such
>> operations typically take few microseconds to less than a millisecond. Here
>> they take 0.2 seconds"
>>
>> They attributed that strange behavior to package loss and network
>> malfunctioning. In my experiments I measured the number of packets dropped
>> and nothing unusual happened.
>> I used two versions of OpenMPI 1.6.5 and 1.8.5 and in both versions I got
>> the same strange behavior. Any clues of what could be the source of that
>> strange behavior? could you please suggest any 

Re: [OMPI users] Invalid read of size 4 (Valgrind error) with OpenMPI 1.8.7

2015-07-28 Thread Schlottke-Lakemper, Michael
Hi Ralph,

That’s what I suspected. Thank you for your confirmation.

Michael

On 25 Jul 2015, at 16:10 , Ralph Castain 
mailto:r...@open-mpi.org>> wrote:

Looks to me like a false positive - we do malloc some space, and do access 
different parts of it. However, it looks like we are inside the space at all 
times.

I’d suppress it


On Jul 23, 2015, at 12:47 AM, Schlottke-Lakemper, Michael 
mailto:m.schlottke-lakem...@aia.rwth-aachen.de>>
 wrote:

Hi folks,

recently we’ve been getting a Valgrind error in PMPI_Init for our suite of 
regression tests:

==5922== Invalid read of size 4
==5922==at 0x61CC5C0: opal_os_dirpath_create (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-pal.so.6.2.2)
==5922==by 0x5F207E5: orte_session_dir (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x5F34F04: orte_ess_base_app_setup (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x7E96679: rte_init (in 
/aia/opt/openmpi-1.8.7/lib64/openmpi/mca_ess_env.so)
==5922==by 0x5F12A77: orte_init (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x509883C: ompi_mpi_init (in 
/aia/opt/openmpi-1.8.7/lib64/libmpi.so.1.6.2)
==5922==by 0x50B843A: PMPI_Init (in 
/aia/opt/openmpi-1.8.7/lib64/libmpi.so.1.6.2)
==5922==by 0xEBA79C: ZFS::run() (in 
/aia/r018/scratch/mic/.zfstester/.zacc_cron/zacc_cron_r9063/zfs_gnu_production)
==5922==by 0x4DC243: main (in 
/aia/r018/scratch/mic/.zfstester/.zacc_cron/zacc_cron_r9063/zfs_gnu_production)
==5922==  Address 0x710f670 is 48 bytes inside a block of size 51 alloc'd
==5922==at 0x4C29110: malloc (in 
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==5922==by 0x61CC572: opal_os_dirpath_create (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-pal.so.6.2.2)
==5922==by 0x5F207E5: orte_session_dir (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x5F34F04: orte_ess_base_app_setup (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x7E96679: rte_init (in 
/aia/opt/openmpi-1.8.7/lib64/openmpi/mca_ess_env.so)
==5922==by 0x5F12A77: orte_init (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x509883C: ompi_mpi_init (in 
/aia/opt/openmpi-1.8.7/lib64/libmpi.so.1.6.2)
==5922==by 0x50B843A: PMPI_Init (in 
/aia/opt/openmpi-1.8.7/lib64/libmpi.so.1.6.2)
==5922==by 0xEBA79C: ZFS::run() (in 
/aia/r018/scratch/mic/.zfstester/.zacc_cron/zacc_cron_r9063/zfs_gnu_production)
==5922==by 0x4DC243: main (in 
/aia/r018/scratch/mic/.zfstester/.zacc_cron/zacc_cron_r9063/zfs_gnu_production)
==5922==

What is weird is that it seems to depend on the pbs/torque session we’re in: 
sometimes the error does not occur and all and all tests run fine (this is in 
fact the only Valgrind error we’re having at the moment). Other times every 
single test we’re running has this error.

Has anyone seen this or might be able to offer an explanation? If it is a 
false-positive, I’d be happy to suppress it :)

Thanks a lot in advance

Michael

P.S.: This error is not covered/suppressed by the default ompi suppression file 
in $PREFIX/share/openmpi.


--
Michael Schlottke-Lakemper

SimLab Highly Scalable Fluids & Solids Engineering
Jülich Aachen Research Alliance (JARA-HPC)
RWTH Aachen University
Wüllnerstraße 5a
52062 Aachen
Germany

Phone: +49 (241) 80 95188
Fax: +49 (241) 80 92257
Mail: 
m.schlottke-lakem...@aia.rwth-aachen.de
Web: http://www.jara.org/jara-hpc

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/07/27303.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/07/27328.php



Re: [OMPI users] Fatal Error: Cannot read module file 'mpi.mod' opened at (1), because it was created by a different version of GNU Fortran

2015-07-28 Thread Syed Ahsan Ali
Thanks Gilles

It solved my issue. Your support is much appreciated.

Ahsan
On Tue, Jul 28, 2015 at 10:15 AM, Gilles Gouaillardet  wrote:
> Hi,
>
> you can run
> zcat mpi.mod | head to confirm which gfortran was used to build the
> application
>
> GFORTRAN module version '10' => gcc 4.8.3
> GFORTRAN module version '12' => gcc 4.9.2
> GFORTRAN module version '14' => gcc 5.1.0
>
> i assume the failing command is mpifort ...
> so you can run
> mpifort -showme ...
> to see the how gfortran is invoked.
>
> it is likely mpifort simply run gfortran, and your PATH does not point to
> gfortran 4.9.2
>
> Cheers,
>
> Gilles
>
>
> On 7/28/2015 1:47 PM, Syed Ahsan Ali wrote:
>>
>> I am getting this error during installation of an application.
>> Apparently the error seems to be complaining about openmpi being
>> compiled with different version of gnu fortran but I am sure that it
>> was compiled with gcc-4.9.2. The same is also being used for
>> application compilation.
>>
>> I am using openmpi-1.8.4
>>
>> Ahsan
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/07/27341.php
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/07/27342.php


Re: [OMPI users] SGE segfaulting with OpenMPI 1.8.6

2015-07-28 Thread Dave Love
Ralph Castain  writes:

> I believe qrsh will execute a simple env command across the allocated nodes - 
> have you looked into that?

qrsh -inherit will run something on any node in the allocation that has
a free slot from a tightly integrated parallel environment, but I'm not
sure for various reasons that you could rely on it showing the problem
directly.

> The bottom line is that you simply are not getting the right orted on the 
> remote nodes - you are getting the old one, which doesn’t recognize the new 
> command line option that mpirun is giving.


Re: [OMPI users] File coherence issues with OpenMPI/torque/NFS (?)

2015-07-28 Thread Dave Love
Gilles Gouaillardet  writes:

> Dave,
>
> On 7/24/2015 1:53 AM, Dave Love wrote:
>> ompio in 1.8 only has pvfs2 (== orangefs) and ufs support -- which might
>> be a good reason to use pvfs2.  You'll need an expert to say if you can
>> use ufs correctly over an nfs filesystem.  (I assume you are actually
>> picking up the romio nfs support.)
>
> on my system :
> $ grep FILE_SYSTEM ./ompi/mca/io/romio/romio/config.status
> S["FILE_SYSTEM"]="testfs ufs nfs"
>
> unless i am misunderstanding, nfs is there

How is it related to ompio?  I thought they are completely separate.



Re: [OMPI users] File coherence issues with OpenMPI/torque/NFS (?)

2015-07-28 Thread Gilles Gouaillardet
You are right and I misread your comment.

Michael is using ROMIO, which is independent of ompio.

Cheers,

Gilles

On Wednesday, July 29, 2015, Dave Love  wrote:

> Gilles Gouaillardet > writes:
>
> > Dave,
> >
> > On 7/24/2015 1:53 AM, Dave Love wrote:
> >> ompio in 1.8 only has pvfs2 (== orangefs) and ufs support -- which might
> >> be a good reason to use pvfs2.  You'll need an expert to say if you can
> >> use ufs correctly over an nfs filesystem.  (I assume you are actually
> >> picking up the romio nfs support.)
> >
> > on my system :
> > $ grep FILE_SYSTEM ./ompi/mca/io/romio/romio/config.status
> > S["FILE_SYSTEM"]="testfs ufs nfs"
> >
> > unless i am misunderstanding, nfs is there
>
> How is it related to ompio?  I thought they are completely separate.
>
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/07/27351.php
>


Re: [OMPI users] Building OpenMPI 1.8.7 on XC30

2015-07-28 Thread Erik Schnetter
Thank you for all the pointers. I was able to
build openmpi-v2.x-dev-96-g918650a without problems on Edison, and also on
other systems.

I'm circumventing the OS X warning by ignoring it via "grep -v"; the other
suggestion (--mca oob ^usock) did not work for me. I've
tried openmpi-v2.x-dev-100-g26c3f03, but it still leads to the same warning.

-erik


On Mon, Jul 27, 2015 at 10:17 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Eric,
>
> these warnings are not important and you can simply ignore them.
> fwiw, this is a race condition evidenced by recent "asynchrousity".
>
> i will push a fix tomorrow.
>
> in the mean time, you can
> mpirun --mca oob ^tcp ...
> (if you run on one node only)
> or
> mpirun --mca oob ^usock
> (if you have an OS X cluster ...)
>
> Cheers,
>
> Gilles
>
> On Sunday, July 26, 2015, Erik Schnetter  wrote:
>
>> Mark
>>
>> No, it doesn't need to be 1.8.7.
>>
>> I just tried v2.x-dev-96-g918650a. This leads to run-time warnings on OS
>> X; I see messages such as
>>
>> [warn] select: Bad file descriptor
>>
>> Are these important? If not, how can I suppress them?
>>
>> -erik
>>
>>
>> On Sat, Jul 25, 2015 at 7:49 AM, Mark Santcroos <
>> mark.santcr...@rutgers.edu> wrote:
>>
>>> Hi Erik,
>>>
>>> Do you really want 1.8.7, otherwise you might want to give latest master
>>> a try. Other including myself had more luck with that on Cray's, including
>>> Edison.
>>>
>>> Mark
>>>
>>> > On 25 Jul 2015, at 1:35 , Erik Schnetter  wrote:
>>> >
>>> > I want to build OpenMPI 1.8.7 on a Cray XC30 (Edison at NERSC). I've
>>> tried various configuration options, but I am always encountering either
>>> OpenMPI build errors, application build errors, or run-time errors.
>>> >
>>> > I'm currently looking at <
>>> http://www.open-mpi.org/community/lists/users/2015/06/27230.php>, which
>>> seems to describe my case. I'm now configuring OpenMPI without any options,
>>> except setting compilers to clang/gfortran and pointing it to a self-built
>>> hwloc. For completeness, here are my configure options as recorded by
>>> config.status:
>>> >
>>> >
>>> '/project/projectdirs/m152/schnette/edison/software/src/openmpi-1.8.7/src/openmpi-1.8.7/configure'
>>> '--prefix=/project/projectdirs/m152/schnette/edison/software/openmpi-1.8.7'
>>> '--with-hwloc=/project/projectdirs/m152/schnette/edison/software/hwloc-1.11.0'
>>> '--disable-vt'
>>> 'CC=/project/projectdirs/m152/schnette/edison/software/llvm-3.6.2/bin/wrap-clang'
>>> 'CXX=/project/projectdirs/m152/schnette/edison/software/llvm-3.6.2/bin/wrap-clang++'
>>> 'FC=/project/projectdirs/m152/schnette/edison/software/gcc-5.2.0/bin/wrap-gfortran'
>>> 'CFLAGS=-I/opt/ofed/include
>>> -I/project/projectdirs/m152/schnette/edison/software/hwloc-1.11.0/include'
>>> 'CXXFLAGS=-I/opt/ofed/include
>>> -I/project/projectdirs/m152/schnette/edison/software/hwloc-1.11.0/include'
>>> 'LDFLAGS=-L/opt/ofed/lib64
>>> -L/project/projectdirs/m152/schnette/edison/software/hwloc-1.11.0/lib
>>> -Wl,-rpath,/project/projectdirs/m152/schnette/edison/software/hwloc-1.11.0/lib'
>>> 'LIBS=-lhwloc -lpthread -lpthread'
>>> '--with-wrapper-ldflags=-L/project/projectdirs/
>>>  m152/schnette/edison/software/hwloc-1.11.0/lib
>>> -Wl,-rpath,/project/projectdirs/m152/schnette/edison/software/hwloc-1.11.0/lib'
>>> '--with-wrapper-libs=-lhwloc -lpthread'
>>> >
>>> > This builds and installs fine, and works when running on a single
>>> node. However, multi-node runs are stalling: The queue starts the job, but
>>> mpirun produces no output. The "-v" option to mpirun doesn't help.
>>> >
>>> > When I use aprun instead of mpirun to start my application, then all
>>> processes think they are rank 0.
>>> >
>>> > Do you have any pointers for how to debug this?
>>> >
>>> > -erik
>>> >
>>> > --
>>> > Erik Schnetter 
>>> http://www.perimeterinstitute.ca/personal/eschnetter/
>>> > ___
>>> > users mailing list
>>> > us...@open-mpi.org
>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> > Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/07/27324.php
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/07/27327.php
>>>
>>
>>
>>
>> --
>> Erik Schnetter 
>> http://www.perimeterinstitute.ca/personal/eschnetter/
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/07/27334.php
>



-- 
Erik Schnetter 
http://www.perimeterinstitute.ca/personal/eschnetter/


Re: [OMPI users] Building OpenMPI 1.8.7 on XC30

2015-07-28 Thread Gilles Gouaillardet

Erik,

the OS X warning (which should not be OS X specific) is fixed in 
https://github.com/open-mpi/ompi-release/pull/430

it will land into the v2.x series once reviewed
in the mean time, feel free to manually apply the patch on the tarball

Cheers,

Gilles

On 7/29/2015 10:35 AM, Erik Schnetter wrote:
Thank you for all the pointers. I was able to 
build openmpi-v2.x-dev-96-g918650a without problems on Edison, and 
also on other systems.


I'm circumventing the OS X warning by ignoring it via "grep -v"; the 
other suggestion (--mca oob ^usock) did not work for me. I've 
tried openmpi-v2.x-dev-100-g26c3f03, but it still leads to the same 
warning.


-erik


On Mon, Jul 27, 2015 at 10:17 AM, Gilles Gouaillardet 
mailto:gilles.gouaillar...@gmail.com>> 
wrote:


Eric,

these warnings are not important and you can simply ignore them.
fwiw, this is a race condition evidenced by recent "asynchrousity".

i will push a fix tomorrow.

in the mean time, you can
mpirun --mca oob ^tcp ...
(if you run on one node only)
or
mpirun --mca oob ^usock
(if you have an OS X cluster ...)

Cheers,

Gilles

On Sunday, July 26, 2015, Erik Schnetter mailto:schnet...@gmail.com>> wrote:

Mark

No, it doesn't need to be 1.8.7.

I just tried v2.x-dev-96-g918650a. This leads to run-time
warnings on OS X; I see messages such as

[warn] select: Bad file descriptor

Are these important? If not, how can I suppress them?

-erik


On Sat, Jul 25, 2015 at 7:49 AM, Mark Santcroos
 wrote:

Hi Erik,

Do you really want 1.8.7, otherwise you might want to give
latest master a try. Other including myself had more luck
with that on Cray's, including Edison.

Mark

> On 25 Jul 2015, at 1:35 , Erik Schnetter
 wrote:
>
> I want to build OpenMPI 1.8.7 on a Cray XC30 (Edison at
NERSC). I've tried various configuration options, but I am
always encountering either OpenMPI build errors,
application build errors, or run-time errors.
>
> I'm currently looking at 
,
which seems to describe my case. I'm now configuring
OpenMPI without any options, except setting compilers to
clang/gfortran and pointing it to a self-built hwloc. For
completeness, here are my configure options as recorded by
config.status:
>
>

'/project/projectdirs/m152/schnette/edison/software/src/openmpi-1.8.7/src/openmpi-1.8.7/configure'

'--prefix=/project/projectdirs/m152/schnette/edison/software/openmpi-1.8.7'

'--with-hwloc=/project/projectdirs/m152/schnette/edison/software/hwloc-1.11.0'
'--disable-vt'

'CC=/project/projectdirs/m152/schnette/edison/software/llvm-3.6.2/bin/wrap-clang'

'CXX=/project/projectdirs/m152/schnette/edison/software/llvm-3.6.2/bin/wrap-clang++'

'FC=/project/projectdirs/m152/schnette/edison/software/gcc-5.2.0/bin/wrap-gfortran'
'CFLAGS=-I/opt/ofed/include

-I/project/projectdirs/m152/schnette/edison/software/hwloc-1.11.0/include'
'CXXFLAGS=-I/opt/ofed/include

-I/project/projectdirs/m152/schnette/edison/software/hwloc-1.11.0/include'
'LDFLAGS=-L/opt/ofed/lib64

-L/project/projectdirs/m152/schnette/edison/software/hwloc-1.11.0/lib

-Wl,-rpath,/project/projectdirs/m152/schnette/edison/software/hwloc-1.11.0/lib'
'LIBS=-lhwloc -lpthread -lpthread'
'--with-wrapper-ldflags=-L/project/projectdirs/
 m152/schnette/edison/software/hwloc-1.11.0/lib

-Wl,-rpath,/project/projectdirs/m152/schnette/edison/software/hwloc-1.11.0/lib'
'--with-wrapper-libs=-lhwloc -lpthread'
>
> This builds and installs fine, and works when running on
a single node. However, multi-node runs are stalling: The
queue starts the job, but mpirun produces no output. The
"-v" option to mpirun doesn't help.
>
> When I use aprun instead of mpirun to start my
application, then all processes think they are rank 0.
>
> Do you have any pointers for how to debug this?
>
> -erik
>
> --
> Erik Schnetter 
http://www.perimeterinstitute.ca/personal/eschnetter/
> ___
> users mailing list
> us...@open-mpi.org
> Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/2