Re: [OMPI users] Double free or corruption with OpenMPI 2.0

2017-06-14 Thread ashwin .D
Hello,
  I found a thread with Intel MPI(although I am using gfortran
4.8.5 and OpenMPI 2.1.1) -
https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/564266
but the error the OP gets is the same as mine

*** glibc detected *** ./a.out: double free or corruption (!prev):
0x7fc6dc80 ***
04 === Backtrace: =
05 /lib64/libc.so.6[0x3411e75e66]
06/lib64/libc.so.6[0x3411e789b3]

So the explanation given in that post is this -
"From their examination our Development team concluded the underlying
problem with openmpi 1.8.6 resulted from mixing out-of-date/incompatible
Fortran RTLs. In short, there were older static Fortran RTL bodies
incorporated in the openmpi library that when mixed with newer Fortran RTL
led to the failure. They found the issue is resolved in the newer
openmpi-1.10.1rc2 and recommend resolving requires using a newer openmpi
release with our 15.0 (or newer) release." Could this be possible with my
version as well ?


I am willing to debug this provided I am given some clue on how to approach
my problem. At the moment I am unable to proceed further and the only thing
I can add is I ran tests with the sequential form of my application and it
is much slower although I am using shared memory and all the cores are in
the same machine.

Best regards,
Ashwin.





On Tue, Jun 13, 2017 at 5:52 PM, ashwin .D  wrote:

> Also when I try to build and run a make check I get these errors - Am I
> clear to proceed or is my installation broken ? This is on Ubuntu 16.04
> LTS.
>
> ==
>Open MPI 2.1.1: test/datatype/test-suite.log
> ==
>
> # TOTAL: 9
> # PASS:  8
> # SKIP:  0
> # XFAIL: 0
> # FAIL:  1
> # XPASS: 0
> # ERROR: 0
>
> .. contents:: :depth: 2
>
> FAIL: external32
> 
>
> /home/t/openmpi-2.1.1/test/datatype/.libs/lt-external32: symbol lookup
> error: /home/openmpi-2.1.1/test/datatype/.libs/lt-external32: undefined
> symbol: ompi_datatype_pack_external_size
> FAIL external32 (exit status:
>
> On Tue, Jun 13, 2017 at 5:24 PM, ashwin .D  wrote:
>
>> Hello,
>>   I am using OpenMPI 2.0.0 with a computational fluid dynamics
>> software and I am encountering a series of errors when running this with
>> mpirun. This is my lscpu output
>>
>> CPU(s):4
>> On-line CPU(s) list:   0-3
>> Thread(s) per core:2
>> Core(s) per socket:2
>> Socket(s): 1 and I am running OpenMPI's mpirun in the following
>>
>> way
>>
>> mpirun -np 4  cfd_software
>>
>> and I get double free or corruption every single time.
>>
>> I have two questions -
>>
>> 1) I am unable to capture the standard error that mpirun throws in a file
>>
>> How can I go about capturing the standard error of mpirun ?
>>
>> 2) Has this error i.e. double free or corruption been reported by others ? 
>> Is there a Is a
>>
>> bug fix available ?
>>
>> Regards,
>>
>> Ashwin.
>>
>>
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Double free or corruption with OpenMPI 2.0

2017-06-14 Thread gilles
 Hi,

at first, i suggest you decide which Open MPI version you want to use.

the most up to date versions are 2.0.3 and 2.1.1

then please provide all the info Jeff previously requested.

ideally, you would write a simple and standalone program that exhibits 
the issue, so we can reproduce and investigate it.

if not, i suggest you use an other MPI library (mvapich, Intel MPI or 
any mpich-based MPI) and see if the issue is still there.

if the double free error still occurs, it is very likely the issue comes 
from your application and not the MPI library.

if you have a parallel debugger such as allinea ddt, then you can run 
your program under the debugger with thorough memory debugging. the 
program will halt when the memory corruption occurs, and this will be a 
hint

(app issue vs mpi issue).

if you did not configure Open MPI with --enable-debug, then please do so 
and try again,

you will increase the likelyhood of trapping such a memory corruption 
error earlier, and you will get a clean Open MPI stack trace if a crash 
occurs.

you might also want to try to

mpirun --mca btl tcp,self ...

and see if you get a different behavior.

this will only use TCP for inter process communication, and this is way 
easier to debug than shared memory or rdma

Cheers,

Gilles

- Original Message -

Hello,
  I found a thread with Intel MPI(although I am using 
gfortran 4.8.5 and OpenMPI 2.1.1) - 
https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/564266
 but the error the OP gets is the same as mine

*** glibc detected *** ./a.out: double free or corruption (!prev): 
0x7fc6dc80 ***
04  === Backtrace: =
05  /lib64/libc.so.6[0x3411e75e66]
06 /lib64/libc.so.6[0x3411e789b3]

So the explanation given in that post is this -
"From their examination our Development team concluded the 
underlying problem with openmpi 1.8.6 resulted from mixing out-of-date/
incompatible Fortran RTLs. In short, there were older static Fortran RTL 
bodies incorporated in the openmpi library that when mixed with newer 
Fortran RTL led to the failure. They found the issue is resolved in the 
newer openmpi-1.10.1rc2 and recommend resolving requires using a newer 
openmpi release with our 15.0 (or newer) release." Could this be 
possible with my version as well ?


I am willing to debug this provided I am given some clue on how to 
approach my problem. At the moment I am unable to proceed further and 
the only thing I can add is I ran tests with the sequential form of my 
application and it is much slower although I am using shared memory and 
all the cores are in the same machine.

Best regards,
Ashwin.





On Tue, Jun 13, 2017 at 5:52 PM, ashwin .D  
wrote:

Also when I try to build and run a make check I get these errors 
- Am I clear to proceed or is my installation broken ? This is on Ubuntu 
16.04 LTS.

==
   Open MPI 2.1.1: test/datatype/test-suite.log
==

# TOTAL: 9
# PASS:  8
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: external32


/home/t/openmpi-2.1.1/test/datatype/.libs/lt-external32: symbol 
lookup error: /home/openmpi-2.1.1/test/datatype/.libs/lt-external32: 
undefined symbol: ompi_datatype_pack_external_size
FAIL external32 (exit status:

On Tue, Jun 13, 2017 at 5:24 PM, ashwin .D  
wrote:

Hello,
  I am using OpenMPI 2.0.0 with a computational 
fluid dynamics software and I am encountering a series of errors when 
running this with mpirun. This is my lscpu output

CPU(s):4
On-line CPU(s) list:   0-3
Thread(s) per core:2
Core(s) per socket:2
Socket(s): 1 and I am running OpenMPI's mpirun 
in the following

way

mpirun -np 4  cfd_software



and I get double free or corruption every single time.



I have two questions -



1) I am unable to capture the standard error that mpirun 
throws in a file

How can I go about capturing the standard error of mpirun ? 

2) Has this error i.e. double free or corruption been 
reported by others ? Is there a Is a 

bug fix available ?



Regards,

Ashwin.





___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Double free or corruption with OpenMPI 2.0

2017-06-14 Thread Jeff Hammond
The "error *** glibc detected *** $(PROGRAM): double free or corruption" is
ubiquitous and rarely has anything to do with MPI.


As Gilles said, use a debugger to figure out why your application is
corrupting the heap.


Jeff



On Wed, Jun 14, 2017 at 3:31 AM, ashwin .D  wrote:

> Hello,
>   I found a thread with Intel MPI(although I am using gfortran
> 4.8.5 and OpenMPI 2.1.1) - https://software.intel.com/en-
> us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/564266 but
> the error the OP gets is the same as mine
>
> *** glibc detected *** ./a.out: double free or corruption (!prev):
> 0x7fc6dc80 ***
> 04 === Backtrace: =
> 05 /lib64/libc.so.6[0x3411e75e66]
> 06/lib64/libc.so.6[0x3411e789b3]
>
> So the explanation given in that post is this -
> "From their examination our Development team concluded the underlying
> problem with openmpi 1.8.6 resulted from mixing out-of-date/incompatible
> Fortran RTLs. In short, there were older static Fortran RTL bodies
> incorporated in the openmpi library that when mixed with newer Fortran RTL
> led to the failure. They found the issue is resolved in the newer
> openmpi-1.10.1rc2 and recommend resolving requires using a newer openmpi
> release with our 15.0 (or newer) release." Could this be possible with my
> version as well ?
>
>
> I am willing to debug this provided I am given some clue on how to
> approach my problem. At the moment I am unable to proceed further and the
> only thing I can add is I ran tests with the sequential form of my
> application and it is much slower although I am using shared memory and all
> the cores are in the same machine.
>
> Best regards,
> Ashwin.
>
>
>
>
>
> On Tue, Jun 13, 2017 at 5:52 PM, ashwin .D  wrote:
>
>> Also when I try to build and run a make check I get these errors - Am I
>> clear to proceed or is my installation broken ? This is on Ubuntu 16.04
>> LTS.
>>
>> ==
>>Open MPI 2.1.1: test/datatype/test-suite.log
>> ==
>>
>> # TOTAL: 9
>> # PASS:  8
>> # SKIP:  0
>> # XFAIL: 0
>> # FAIL:  1
>> # XPASS: 0
>> # ERROR: 0
>>
>> .. contents:: :depth: 2
>>
>> FAIL: external32
>> 
>>
>> /home/t/openmpi-2.1.1/test/datatype/.libs/lt-external32: symbol lookup
>> error: /home/openmpi-2.1.1/test/datatype/.libs/lt-external32: undefined
>> symbol: ompi_datatype_pack_external_size
>> FAIL external32 (exit status:
>>
>> On Tue, Jun 13, 2017 at 5:24 PM, ashwin .D  wrote:
>>
>>> Hello,
>>>   I am using OpenMPI 2.0.0 with a computational fluid dynamics
>>> software and I am encountering a series of errors when running this with
>>> mpirun. This is my lscpu output
>>>
>>> CPU(s):4
>>> On-line CPU(s) list:   0-3
>>> Thread(s) per core:2
>>> Core(s) per socket:2
>>> Socket(s): 1 and I am running OpenMPI's mpirun in the following
>>>
>>> way
>>>
>>> mpirun -np 4  cfd_software
>>>
>>> and I get double free or corruption every single time.
>>>
>>> I have two questions -
>>>
>>> 1) I am unable to capture the standard error that mpirun throws in a file
>>>
>>> How can I go about capturing the standard error of mpirun ?
>>>
>>> 2) Has this error i.e. double free or corruption been reported by others ? 
>>> Is there a Is a
>>>
>>> bug fix available ?
>>>
>>> Regards,
>>>
>>> Ashwin.
>>>
>>>
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>



-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_ABORT, indirect execution of executables by mpirun, Open MPI 2.1.1

2017-06-14 Thread Gilles Gouaillardet

Ted,


fwiw, the 'master' branch has the behavior you expect.


meanwhile, you can simple edit your 'dum.sh' script and replace

/home/buildadina/src/aborttest02/aborttest02.exe

with

exec /home/buildadina/src/aborttest02/aborttest02.exe


Cheers,


Gilles


On 6/15/2017 3:01 AM, Ted Sussman wrote:

Hello,

My question concerns MPI_ABORT, indirect execution of executables by mpirun and 
Open
MPI 2.1.1.  When mpirun runs executables directly, MPI_ABORT works as expected, 
but
when mpirun runs executables indirectly, MPI_ABORT does not work as expected.

If Open MPI 1.4.3 is used instead of Open MPI 2.1.1, MPI_ABORT works as 
expected in all
cases.

The examples given below have been simplified as far as possible to show the 
issues.

---

Example 1

Consider an MPI job run in the following way:

mpirun ... -app addmpw1

where the appfile addmpw1 lists two executables:

-n 1 -host gulftown ... aborttest02.exe
-n 1 -host gulftown ... aborttest02.exe

The two executables are executed on the local node gulftown.  aborttest02 calls 
MPI_ABORT
for rank 0, then sleeps.

The above MPI job runs as expected.  Both processes immediately abort when rank 
0 calls
MPI_ABORT.

---

Example 2

Now change the above example as follows:

mpirun ... -app addmpw2

where the appfile addmpw2 lists shell scripts:

-n 1 -host gulftown ... dum.sh
-n 1 -host gulftown ... dum.sh

dum.sh invokes aborttest02.exe.  So aborttest02.exe is executed indirectly by 
mpirun.

In this case, the MPI job only aborts process 0 when rank 0 calls MPI_ABORT.  
Process 1
continues to run.  This behavior is unexpected.



I have attached all files to this E-mail.  Since there are absolute pathnames 
in the files, to
reproduce my findings, you will need to update the pathnames in the appfiles 
and shell
scripts.  To run example 1,

sh run1.sh

and to run example 2,

sh run2.sh

---

I have tested these examples with Open MPI 1.4.3 and 2.0.3.  In Open MPI 1.4.3, 
both
examples work as expected.  Open MPI 2.0.3 has the same behavior as Open MPI 
2.1.1.

---

I would prefer that Open MPI 2.1.1 aborts both processes, even when the 
executables are
invoked indirectly by mpirun.  If there is an MCA setting that is needed to 
make Open MPI
2.1.1 abort both processes, please let me know.


Sincerely,

Theodore Sussman


The following section of this message contains a file attachment
prepared for transmission using the Internet MIME message format.
If you are using Pegasus Mail, or any other MIME-compliant system,
you should be able to save it or view it from within your mailer.
If you cannot, please ask your system administrator for assistance.

 File information ---
  File:  config.log.bz2
  Date:  14 Jun 2017, 13:35
  Size:  146548 bytes.
  Type:  Binary


The following section of this message contains a file attachment
prepared for transmission using the Internet MIME message format.
If you are using Pegasus Mail, or any other MIME-compliant system,
you should be able to save it or view it from within your mailer.
If you cannot, please ask your system administrator for assistance.

 File information ---
  File:  ompi_info.bz2
  Date:  14 Jun 2017, 13:35
  Size:  24088 bytes.
  Type:  Binary


The following section of this message contains a file attachment
prepared for transmission using the Internet MIME message format.
If you are using Pegasus Mail, or any other MIME-compliant system,
you should be able to save it or view it from within your mailer.
If you cannot, please ask your system administrator for assistance.

 File information ---
  File:  aborttest02.tgz
  Date:  14 Jun 2017, 13:52
  Size:  4285 bytes.
  Type:  Binary


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users