Re: [OMPI users] Double free or corruption problem updated result

2017-06-19 Thread Gilles Gouaillardet

Ashwin,


the valgrind logs clearly indicate you are trying to access some memory 
that was already free'd



for example

[1,0]:==4683== Invalid read of size 4
[1,0]:==4683==at 0x795DC2: __src_input_MOD_organize_input 
(src_input.f90:2318)
[1,0]:==4683==  Address 0xb4001d0 is 0 bytes inside a block of 
size 24 free'd
[1,0]:==4683==by 0x63F3690: free_NC_var (in 
/usr/local/lib/libnetcdf.so.11.0.3)


[1,0]:==4683==by 0x63BB431: nc_close (in 
/usr/local/lib/libnetcdf.so.11.0.3)
[1,0]:==4683==by 0x435A9F: __io_utilities_MOD_close_file 
(io_utilities.f90:995)

[1,0]:==4683==  Block was alloc'd at
[1,0]:==4683==by 0x63F378C: new_x_NC_var (in 
/usr/local/lib/libnetcdf.so.11.0.3)
[1,0]:==4683==by 0x63BAF85: nc_open (in 
/usr/local/lib/libnetcdf.so.11.0.3)

[1,0]:==4683==by 0x547E6F6: nf_open_ (nf_control.F90:189)

so the double-free error could be a side effect of this.

at this stage, i suggest you fix your application, and see if it 
resolves your issue.

(e.g. there is no need to try an other MPI library and/or version for now)

Cheers,

Gilles

On 6/18/2017 2:41 PM, ashwin .D wrote:

Hello Gilles,
   First of all I am extremely grateful for this communication from 
you on a weekend and that too few hours after I
posted my email. Well I am not sure I can go on posting log files as you 
rightly point out that MPI is not the source of the
problem. Still I have enclosed the valgrind log files as you requested. I have 
downloaded the MPICH packages as you suggested
and I am going to install them shortly. But before I do that I think I have a 
clue on the source of my problem(double free or corruption) and I would really 
appreciate
your advice.
As I mentioned before COSMO has been compiled with mpif90 for shared memory 
usage and with gfortran for sequential access.
But it is dependent on a lot of external third party software such as zlib, 
libcurl, hdf5, netcdf and netcdf-fortran. When I
looked at the config.log of those packages all of them had  been compiled with 
gfortran and gcc and some cases g++ with
enable-shared option. So my question then is could that be a source of the 
"mismatch" ?

In other words I would have to recompile all those packages with mpif90 and 
mpicc and then try another test. At the very
least there should be no mixing of gcc/gfortran compiled code with mpif90 
compiled code. Comments ?
Best regards,
Ashwin.

>Ashwin,

>did you try to run your app with a MPICH-based library (mvapich,
>IntelMPI or even stock mpich) ?
>or did you try with Open MPI v1.10 ?
>the stacktrace does not indicate the double free occurs in MPI...
>it seems you ran valgrind vs a shell and not your binary.
>assuming your mpirun command is
>mpirun lmparbin_all
>i suggest you try again with
>mpirun --tag-output valgrind lmparbin_all
>that will generate one valgrind log per task, but these are prefixed
>so it should be easier to figure out what is going wrong

>Cheers,

>Gilles


On Sun, Jun 18, 2017 at 11:41 AM, ashwin .D mailto:winas...@gmail.com>> wrote:
> There is a sequential version of the same program COSMO (no reference to
> MPI) that I can run without any problems. Of course it takes a lot longer to
> complete. Now I also ran valgrind (not sure whether that is useful or not)
> and I have enclosed the logs.

On Sun, Jun 18, 2017 at 8:11 AM, ashwin .D > wrote:


There is a sequential version of the same program COSMO (no
reference to MPI) that I can run without any problems. Of course
it takes a lot longer to complete. Now I also ran valgrind (not
sure whether that is useful or not) and I have enclosed the logs.

On Sat, Jun 17, 2017 at 7:20 PM, ashwin .D mailto:winas...@gmail.com>> wrote:

Hello Gilles,
   I am enclosing all the information you
requested.

1)  as an attachment I enclose the log file
2) I did rebuild OpenMPI 2.1.1 with the --enable-debug feature
and I reinstalled it /usr/lib/local.
I ran all the examples in the examples directory. All passed
except oshmem_strided_puts where I got this message

[[48654,1],0][pshmem_iput.c:70:pshmem_short_iput] Target PE #1
is not in valid range

--
SHMEM_ABORT was invoked on rank 0 (pid 13409,
host=a-Vostro-3800) with errorcode -1.

--


3) I deleted all old OpenMPI versions under /usr/local/lib.
4) I am using the COSMO weather model -
http://www.cosmo-model.org/ to run simulations
The support staff claim they have seen no errors with a
similar setup. They use

1) gfortran 4.8.5
2) OpenMPI 1.10.1

The only difference is I use OpenMPI 2.1.1.

5) I did try this option as well mpirun --mca btl tcp,self -np
4 cosmo. and I got the same

Re: [OMPI users] MPI_ABORT, indirect execution of executables by mpirun, Open MPI 2.1.1

2017-06-19 Thread Ted Sussman
Hello,

I have rebuilt Open MPI 2.1.1 on the same computer, including --enable-debug.

I have attached the abort test program aborttest10.tgz.  This version sleeps 
for 5 sec before
calling MPI_ABORT, so that I can check the pids using ps.

This is what happens (see run2.sh.out).

Open MPI invokes two instances of dum.sh.  Each instance of dum.sh invokes 
aborttest.exe.

PidProcess
---
19565  dum.sh
19566  dum.sh
19567 aborttest10.exe
19568 aborttest10.exe

When MPI_ABORT is called, Open MPI sends SIGCONT, SIGTERM and SIGKILL to both
instances of dum.sh (pids 19565 and 19566).

ps shows that both the shell processes vanish, and that one of the 
aborttest10.exe processes
vanishes.  But the other aborttest10.exe remains and continues until it is 
finished sleeping.

Hope that this information is useful.

Sincerely,

Ted Sussman



On 19 Jun 2017 at 23:06, gil...@rist.or.jp wrote:

>
>  Ted,
>  
> some traces are missing  because you did not configure with --enable-debug
> i am afraid you have to do it (and you probably want to install that debug 
> version in an other
> location since its performances are not good for production) in order to get 
> all the logs.
>  
> Cheers,
>  
> Gilles
>  
> - Original Message -
> Hello Gilles,
>
> I retried my example, with the same results as I observed before.  The 
> process with rank 1
> does not get killed by MPI_ABORT.
>
> I have attached to this E-mail:
>
>   config.log.bz2
>   ompi_info.bz2  (uses ompi_info -a)
>   aborttest09.tgz
>
> This testing is done on a computer running Linux 3.10.0.  This is a 
> different computer than
> the computer that I previously used for testing.  You can confirm that I 
> am using Open MPI
> 2.1.1.
>
> tar xvzf aborttest09.tgz
> cd aborttest09
> ./sh run2.sh
>
> run2.sh contains the command
>
> /opt/openmpi-2.1.1-GNU/bin/mpirun -np 2 -mca btl tcp,self --mca 
> odls_base_verbose 10
> ./dum.sh
>
> The output from this run is in aborttest09/run2.sh.out.
>
> The output shows that the the "default" component is selected by odls.
>
> The only messages from odls are: odls: launch spawning child ...  (two 
> messages). There
> are no messages from odls with "kill" and I see no SENDING SIGCONT / 
> SIGKILL
> messages.
>
> I am not running from within any batch manager.
>
> Sincerely,
>
> Ted Sussman
>
> On 17 Jun 2017 at 16:02, gil...@rist.or.jp wrote:
>
> > Ted,
> >
> > i do not observe the same behavior you describe with Open MPI 2.1.1
> >
> > # mpirun -np 2 -mca btl tcp,self --mca odls_base_verbose 5 ./abort.sh
> >
> > abort.sh 31361 launching abort
> > abort.sh 31362 launching abort
> > I am rank 0 with pid 31363
> > I am rank 1 with pid 31364
> > 
> > --
> > MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> > with errorcode 1.
> >
> > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> > You may or may not see output from other processes, depending on
> > exactly when Open MPI kills them.
> > 
> > --
> > [linux:31356] [[18199,0],0] odls:kill_local_proc working on WILDCARD
> > [linux:31356] [[18199,0],0] odls:kill_local_proc checking child process
> > [[18199,1],0]
> > [linux:31356] [[18199,0],0] SENDING SIGCONT TO [[18199,1],0]
> > [linux:31356] [[18199,0],0] odls:default:SENT KILL 18 TO PID 31361
> > SUCCESS
> > [linux:31356] [[18199,0],0] odls:kill_local_proc checking child process
> > [[18199,1],1]
> > [linux:31356] [[18199,0],0] SENDING SIGCONT TO [[18199,1],1]
> > [linux:31356] [[18199,0],0] odls:default:SENT KILL 18 TO PID 31362
> > SUCCESS
> > [linux:31356] [[18199,0],0] SENDING SIGTERM TO [[18199,1],0]
> > [linux:31356] [[18199,0],0] odls:default:SENT KILL 15 TO PID 31361
> > SUCCESS
> > [linux:31356] [[18199,0],0] SENDING SIGTERM TO [[18199,1],1]
> > [linux:31356] [[18199,0],0] odls:default:SENT KILL 15 TO PID 31362
> > SUCCESS
> > [linux:31356] [[18199,0],0] SENDING SIGKILL TO [[18199,1],0]
> > [linux:31356] [[18199,0],0] odls:default:SENT KILL 9 TO PID 31361
> > SUCCESS
> > [linux:31356] [[18199,0],0] SENDING SIGKILL TO [[18199,1],1]
> > [linux:31356] [[18199,0],0] odls:default:SENT KILL 9 TO PID 31362
> > SUCCESS
> > [linux:31356] [[18199,0],0] odls:kill_local_proc working on WILDCARD
> > [linux:31356] [[18199,0],0] odls:kill_local_proc checking child process
> > [[18199,1],0]
> > [linux:31356] [[18199,0],0] odls:kill_local_proc child [[18199,1],0] is
> > not alive
> > [linux:31356] [[18199,0],0] odls:kill_local_proc checking child process
> > [[18199,1],1]
> > [linux:31356] [[18199,0],0] odls:kill_local_proc child [[18199,

Re: [OMPI users] MPI_ABORT, indirect execution of executables by mpirun, Open MPI 2.1.1

2017-06-19 Thread Ted Sussman
If I replace the sleep with an infinite loop, I get the same behavior.  One 
"aborttest" process 
remains after all the signals are sent.

On 19 Jun 2017 at 10:10, r...@open-mpi.org wrote:

> 
> That is typical behavior when you throw something into "sleep" - not much we 
> can do about it, I 
> think.
> 
> On Jun 19, 2017, at 9:58 AM, Ted Sussman  wrote:
> 
> Hello,
> 
> I have rebuilt Open MPI 2.1.1 on the same computer, including 
> --enable-debug.
> 
> I have attached the abort test program aborttest10.tgz.  This version 
> sleeps for 5 sec before
> calling MPI_ABORT, so that I can check the pids using ps.
> 
> This is what happens (see run2.sh.out).
> 
> Open MPI invokes two instances of dum.sh.  Each instance of dum.sh 
> invokes aborttest.exe.
> 
> Pid    Process
> ---
> 19565  dum.sh
> 19566  dum.sh
> 19567 aborttest10.exe
> 19568 aborttest10.exe
> 
> When MPI_ABORT is called, Open MPI sends SIGCONT, SIGTERM and SIGKILL to 
> both
> instances of dum.sh (pids 19565 and 19566).
> 
> ps shows that both the shell processes vanish, and that one of the 
> aborttest10.exe processes
> vanishes.  But the other aborttest10.exe remains and continues until it 
> is finished sleeping.
> 
> Hope that this information is useful.
> 
> Sincerely,
> 
> Ted Sussman
> 
> 
> 
> On 19 Jun 2017 at 23:06,  gil...@rist.or.jp  wrote:
> 
> 
>  Ted,
>  
> some traces are missing  because you did not configure with --enable-debug
> i am afraid you have to do it (and you probably want to install that 
> debug version in an 
> other
> location since its performances are not good for production) in order to 
> get all the logs.
>  
> Cheers,
>  
> Gilles
>  
> - Original Message -
>    Hello Gilles,
> 
>    I retried my example, with the same results as I observed before.  The 
> process with rank 
> 1
>    does not get killed by MPI_ABORT.
> 
>    I have attached to this E-mail:
> 
>  config.log.bz2
>  ompi_info.bz2  (uses ompi_info -a)
>  aborttest09.tgz
> 
>    This testing is done on a computer running Linux 3.10.0.  This is a 
> different computer 
> than
>    the computer that I previously used for testing.  You can confirm that 
> I am using Open 
> MPI
>    2.1.1.
> 
>    tar xvzf aborttest09.tgz
>    cd aborttest09
>    ./sh run2.sh
> 
>    run2.sh contains the command
> 
>    /opt/openmpi-2.1.1-GNU/bin/mpirun -np 2 -mca btl tcp,self --mca 
> odls_base_verbose 
> 10
>    ./dum.sh
> 
>    The output from this run is in aborttest09/run2.sh.out.
> 
>    The output shows that the the "default" component is selected by odls.
> 
>    The only messages from odls are: odls: launch spawning child ...  (two 
> messages). 
> There
>    are no messages from odls with "kill" and I see no SENDING SIGCONT / 
> SIGKILL
>    messages.
> 
>    I am not running from within any batch manager.
> 
>    Sincerely,
> 
>    Ted Sussman
> 
>    On 17 Jun 2017 at 16:02, gil...@rist.or.jp wrote:
> 
> Ted,
> 
> i do not observe the same behavior you describe with Open MPI 2.1.1
> 
> # mpirun -np 2 -mca btl tcp,self --mca odls_base_verbose 5 ./abort.sh
> 
> abort.sh 31361 launching abort
> abort.sh 31362 launching abort
> I am rank 0 with pid 31363
> I am rank 1 with pid 31364
> 
> --
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 1.
> 
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> 
> --
> [linux:31356] [[18199,0],0] odls:kill_local_proc working on WILDCARD
> [linux:31356] [[18199,0],0] odls:kill_local_proc checking child process
> [[18199,1],0]
> [linux:31356] [[18199,0],0] SENDING SIGCONT TO [[18199,1],0]
> [linux:31356] [[18199,0],0] odls:default:SENT KILL 18 TO PID 31361
> SUCCESS
> [linux:31356] [[18199,0],0] odls:kill_local_proc checking child process
> [[18199,1],1]
> [linux:31356] [[18199,0],0] SENDING SIGCONT TO [[18199,1],1]
> [linux:31356] [[18199,0],0] odls:default:SENT KILL 18 TO PID 31362
> SUCCESS
> [linux:31356] [[18199,0],0] SENDING SIGTERM TO [[18199,1],0]
> [linux:31356] [[18199,0],0] odls:default:SENT KILL 15 TO PID 31361
> SUCCESS
> [linux:31356] [[18199,0],0] SENDING SIGTERM TO [[18199,1],1]
> [linux:31356] [[18199,0],0] odls:default:SENT KILL 15 TO PID 31362
> SUCCESS
> [linux:31356] 

Re: [OMPI users] MPI_ABORT, indirect execution of executables by mpirun, Open MPI 2.1.1

2017-06-19 Thread r...@open-mpi.org
When you fork that process off, do you set its process group? Or is it in the 
same process group as the shell script?

> On Jun 19, 2017, at 10:19 AM, Ted Sussman  wrote:
> 
> If I replace the sleep with an infinite loop, I get the same behavior.  One 
> "aborttest" process 
> remains after all the signals are sent.
> 
> On 19 Jun 2017 at 10:10, r...@open-mpi.org wrote:
> 
>> 
>> That is typical behavior when you throw something into "sleep" - not much we 
>> can do about it, I 
>> think.
>> 
>>On Jun 19, 2017, at 9:58 AM, Ted Sussman  wrote:
>> 
>>Hello,
>> 
>>I have rebuilt Open MPI 2.1.1 on the same computer, including 
>> --enable-debug.
>> 
>>I have attached the abort test program aborttest10.tgz.  This version 
>> sleeps for 5 sec before
>>calling MPI_ABORT, so that I can check the pids using ps.
>> 
>>This is what happens (see run2.sh.out).
>> 
>>Open MPI invokes two instances of dum.sh.  Each instance of dum.sh 
>> invokes aborttest.exe.
>> 
>>PidProcess
>>---
>>19565  dum.sh
>>19566  dum.sh
>>19567 aborttest10.exe
>>19568 aborttest10.exe
>> 
>>When MPI_ABORT is called, Open MPI sends SIGCONT, SIGTERM and SIGKILL to 
>> both
>>instances of dum.sh (pids 19565 and 19566).
>> 
>>ps shows that both the shell processes vanish, and that one of the 
>> aborttest10.exe processes
>>vanishes.  But the other aborttest10.exe remains and continues until it 
>> is finished sleeping.
>> 
>>Hope that this information is useful.
>> 
>>Sincerely,
>> 
>>Ted Sussman
>> 
>> 
>> 
>>On 19 Jun 2017 at 23:06,  gil...@rist.or.jp  wrote:
>> 
>> 
>> Ted,
>> 
>>some traces are missing  because you did not configure with --enable-debug
>>i am afraid you have to do it (and you probably want to install that 
>> debug version in an 
>>other
>>location since its performances are not good for production) in order to 
>> get all the logs.
>> 
>>Cheers,
>> 
>>Gilles
>> 
>>- Original Message -
>>   Hello Gilles,
>> 
>>   I retried my example, with the same results as I observed before.  The 
>> process with rank 
>>1
>>   does not get killed by MPI_ABORT.
>> 
>>   I have attached to this E-mail:
>> 
>> config.log.bz2
>> ompi_info.bz2  (uses ompi_info -a)
>> aborttest09.tgz
>> 
>>   This testing is done on a computer running Linux 3.10.0.  This is a 
>> different computer 
>>than
>>   the computer that I previously used for testing.  You can confirm that 
>> I am using Open 
>>MPI
>>   2.1.1.
>> 
>>   tar xvzf aborttest09.tgz
>>   cd aborttest09
>>   ./sh run2.sh
>> 
>>   run2.sh contains the command
>> 
>>   /opt/openmpi-2.1.1-GNU/bin/mpirun -np 2 -mca btl tcp,self --mca 
>> odls_base_verbose 
>>10
>>   ./dum.sh
>> 
>>   The output from this run is in aborttest09/run2.sh.out.
>> 
>>   The output shows that the the "default" component is selected by odls.
>> 
>>   The only messages from odls are: odls: launch spawning child ...  (two 
>> messages). 
>>There
>>   are no messages from odls with "kill" and I see no SENDING SIGCONT / 
>> SIGKILL
>>   messages.
>> 
>>   I am not running from within any batch manager.
>> 
>>   Sincerely,
>> 
>>   Ted Sussman
>> 
>>   On 17 Jun 2017 at 16:02, gil...@rist.or.jp wrote:
>> 
>>Ted,
>> 
>>i do not observe the same behavior you describe with Open MPI 2.1.1
>> 
>># mpirun -np 2 -mca btl tcp,self --mca odls_base_verbose 5 ./abort.sh
>> 
>>abort.sh 31361 launching abort
>>abort.sh 31362 launching abort
>>I am rank 0 with pid 31363
>>I am rank 1 with pid 31364
>>
>>--
>>MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>with errorcode 1.
>> 
>>NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>You may or may not see output from other processes, depending on
>>exactly when Open MPI kills them.
>>
>>--
>>[linux:31356] [[18199,0],0] odls:kill_local_proc working on WILDCARD
>>[linux:31356] [[18199,0],0] odls:kill_local_proc checking child process
>>[[18199,1],0]
>>[linux:31356] [[18199,0],0] SENDING SIGCONT TO [[18199,1],0]
>>[linux:31356] [[18199,0],0] odls:default:SENT KILL 18 TO PID 31361
>>SUCCESS
>>[linux:31356] [[18199,0],0] odls:kill_local_proc checking child process
>>[[18199,1],1]
>>[linux:31356] [[18199,0],0] SENDING SIGCONT TO [[18199,1],1]
>>[linux:31356] [[18199,0],0] odls:default:SENT KILL 18 TO PID 31362
>>SUCCESS
>>[linux:31356] [[18199,0],0] SENDING SIGTERM TO [[18199,1],0]
>>[linux:31356] [[18199,0],0] odls:default:SENT KILL 15 TO PID 31361
>>SUCCESS
>>[linux:31356] [[18199,0],0] SENDING SIGTERM TO [[18199,

Re: [OMPI users] MPI_ABORT, indirect execution of executables by mpirun, Open MPI 2.1.1

2017-06-19 Thread Ted Sussman
I don't do any setting of process groups.  dum.sh just invokes the executable:

//aborttest10.exe


On 19 Jun 2017 at 10:30, r...@open-mpi.org wrote:

> When you fork that process off, do you set its process group? Or is it in the 
> same process group as the shell script?
> 
> > On Jun 19, 2017, at 10:19 AM, Ted Sussman  wrote:
> > 
> > If I replace the sleep with an infinite loop, I get the same behavior.  One 
> > "aborttest" process 
> > remains after all the signals are sent.
> > 
> > On 19 Jun 2017 at 10:10, r...@open-mpi.org wrote:
> > 
> >> 
> >> That is typical behavior when you throw something into "sleep" - not much 
> >> we can do about it, I 
> >> think.
> >> 
> >>On Jun 19, 2017, at 9:58 AM, Ted Sussman  wrote:
> >> 
> >>Hello,
> >> 
> >>I have rebuilt Open MPI 2.1.1 on the same computer, including 
> >> --enable-debug.
> >> 
> >>I have attached the abort test program aborttest10.tgz.  This version 
> >> sleeps for 5 sec before
> >>calling MPI_ABORT, so that I can check the pids using ps.
> >> 
> >>This is what happens (see run2.sh.out).
> >> 
> >>Open MPI invokes two instances of dum.sh.  Each instance of dum.sh 
> >> invokes aborttest.exe.
> >> 
> >>PidProcess
> >>---
> >>19565  dum.sh
> >>19566  dum.sh
> >>19567 aborttest10.exe
> >>19568 aborttest10.exe
> >> 
> >>When MPI_ABORT is called, Open MPI sends SIGCONT, SIGTERM and SIGKILL 
> >> to both
> >>instances of dum.sh (pids 19565 and 19566).
> >> 
> >>ps shows that both the shell processes vanish, and that one of the 
> >> aborttest10.exe processes
> >>vanishes.  But the other aborttest10.exe remains and continues until it 
> >> is finished sleeping.
> >> 
> >>Hope that this information is useful.
> >> 
> >>Sincerely,
> >> 
> >>Ted Sussman
> >> 
> >> 
> >> 
> >>On 19 Jun 2017 at 23:06,  gil...@rist.or.jp  wrote:
> >> 
> >> 
> >> Ted,
> >> 
> >>some traces are missing  because you did not configure with 
> >> --enable-debug
> >>i am afraid you have to do it (and you probably want to install that 
> >> debug version in an 
> >>other
> >>location since its performances are not good for production) in order 
> >> to get all the logs.
> >> 
> >>Cheers,
> >> 
> >>Gilles
> >> 
> >>- Original Message -
> >>   Hello Gilles,
> >> 
> >>   I retried my example, with the same results as I observed before.  
> >> The process with rank 
> >>1
> >>   does not get killed by MPI_ABORT.
> >> 
> >>   I have attached to this E-mail:
> >> 
> >> config.log.bz2
> >> ompi_info.bz2  (uses ompi_info -a)
> >> aborttest09.tgz
> >> 
> >>   This testing is done on a computer running Linux 3.10.0.  This is a 
> >> different computer 
> >>than
> >>   the computer that I previously used for testing.  You can confirm 
> >> that I am using Open 
> >>MPI
> >>   2.1.1.
> >> 
> >>   tar xvzf aborttest09.tgz
> >>   cd aborttest09
> >>   ./sh run2.sh
> >> 
> >>   run2.sh contains the command
> >> 
> >>   /opt/openmpi-2.1.1-GNU/bin/mpirun -np 2 -mca btl tcp,self --mca 
> >> odls_base_verbose 
> >>10
> >>   ./dum.sh
> >> 
> >>   The output from this run is in aborttest09/run2.sh.out.
> >> 
> >>   The output shows that the the "default" component is selected by 
> >> odls.
> >> 
> >>   The only messages from odls are: odls: launch spawning child ...  
> >> (two messages). 
> >>There
> >>   are no messages from odls with "kill" and I see no SENDING SIGCONT / 
> >> SIGKILL
> >>   messages.
> >> 
> >>   I am not running from within any batch manager.
> >> 
> >>   Sincerely,
> >> 
> >>   Ted Sussman
> >> 
> >>   On 17 Jun 2017 at 16:02, gil...@rist.or.jp wrote:
> >> 
> >>Ted,
> >> 
> >>i do not observe the same behavior you describe with Open MPI 2.1.1
> >> 
> >># mpirun -np 2 -mca btl tcp,self --mca odls_base_verbose 5 ./abort.sh
> >> 
> >>abort.sh 31361 launching abort
> >>abort.sh 31362 launching abort
> >>I am rank 0 with pid 31363
> >>I am rank 1 with pid 31364
> >>
> >>--
> >>MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> >>with errorcode 1.
> >> 
> >>NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> >>You may or may not see output from other processes, depending on
> >>exactly when Open MPI kills them.
> >>
> >>--
> >>[linux:31356] [[18199,0],0] odls:kill_local_proc working on WILDCARD
> >>[linux:31356] [[18199,0],0] odls:kill_local_proc checking child process
> >>[[18199,1],0]
> >>[linux:31356] [[18199,0],0] SENDING SIGCONT TO [[18199,1],0]
> >>[linux:31356] [[18199,0],0] odls:default:SENT KILL 18 TO PID 31361
> >>SUCCESS

[OMPI users] Crash in libopen-pal.so

2017-06-19 Thread Justin Luitjens
I have an application that works on other systems but on the current system I'm 
running I'm seeing the following crash:

[dt04:22457] *** Process received signal ***
[dt04:22457] Signal: Segmentation fault (11)
[dt04:22457] Signal code: Address not mapped (1)
[dt04:22457] Failing at address: 0x6a1da250
[dt04:22457] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2b353370]
[dt04:22457] [ 1] 
/home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_memory_ptmalloc2_int_free+0x50)[0x2cbcf810]
[dt04:22457] [ 2] 
/home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_memory_ptmalloc2_free+0x9b)[0x2cbcff3b]
[dt04:22457] [ 3] ./hacc_tpm[0x42f068]
[dt04:22457] [ 4] ./hacc_tpm[0x42f231]
[dt04:22457] [ 5] ./hacc_tpm[0x40f64d]
[dt04:22457] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2c30db35]
[dt04:22457] [ 7] ./hacc_tpm[0x4115cf]
[dt04:22457] *** End of error message ***


This app is a CUDA app but doesn't use GPU direct so that should be irrelevant.

I'm building with ggc/5.3.0  cuda/8.0.44  openmpi/1.10.7

I'm using this on centos 7 and am using a vanilla MPI configure line:  
./configure --prefix=/home/jluitjens/libs/openmpi/

Currently I'm trying to do this with just a single MPI process but multiple MPI 
processes fail in the same way:

mpirun  --oversubscribe -np 1 ./command

What is odd is the crash occurs around the same spot in the code but not 
consistently at the same spot.  The spot in the code where the single thread is 
at the time of the crash is nowhere near MPI code.  The code where it is 
crashing is just using malloc to allocate some memory. This makes me think the 
crash is due to a thread outside of the application I'm working on (perhaps in 
OpenMPI itself) or perhaps due to openmpi hijacking malloc/free.

Does anyone have any ideas of what I could try to work around this issue?

Thanks,
Justin












---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Crash in libopen-pal.so

2017-06-19 Thread Sylvain Jeaugey
Justin, can you try setting mpi_leave_pinned to 0 to disable 
libptmalloc2 and confirm this is related to ptmalloc ?


Thanks,
Sylvain

On 06/19/2017 03:05 PM, Justin Luitjens wrote:


I have an application that works on other systems but on the current 
system I’m running I’m seeing the following crash:


[dt04:22457] *** Process received signal ***

[dt04:22457] Signal: Segmentation fault (11)

[dt04:22457] Signal code: Address not mapped (1)

[dt04:22457] Failing at address: 0x6a1da250

[dt04:22457] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2b353370]

[dt04:22457] [ 1] 
/home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_memory_ptmalloc2_int_free+0x50)[0x2cbcf810]


[dt04:22457] [ 2] 
/home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_memory_ptmalloc2_free+0x9b)[0x2cbcff3b]


[dt04:22457] [ 3] ./hacc_tpm[0x42f068]

[dt04:22457] [ 4] ./hacc_tpm[0x42f231]

[dt04:22457] [ 5] ./hacc_tpm[0x40f64d]

[dt04:22457] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2c30db35]

[dt04:22457] [ 7] ./hacc_tpm[0x4115cf]

[dt04:22457] *** End of error message ***

This app is a CUDA app but doesn’t use GPU direct so that should be 
irrelevant.


I’m building with ggc/5.3.0  cuda/8.0.44 openmpi/1.10.7

I’m using this on centos 7 and am using a vanilla MPI configure line:  
./configure --prefix=/home/jluitjens/libs/openmpi/


Currently I’m trying to do this with just a single MPI process but 
multiple MPI processes fail in the same way:


mpirun  --oversubscribe -np 1 ./command

What is odd is the crash occurs around the same spot in the code but 
not consistently at the same spot. The spot in the code where the 
single thread is at the time of the crash is nowhere near MPI code. 
 The code where it is crashing is just using malloc to allocate some 
memory. This makes me think the crash is due to a thread outside of 
the application I’m working on (perhaps in OpenMPI itself) or perhaps 
due to openmpi hijacking malloc/free.


Does anyone have any ideas of what I could try to work around this issue?

Thanks,

Justin


This email message is for the sole use of the intended recipient(s) 
and may contain confidential information.  Any unauthorized review, 
use, disclosure or distribution is prohibited.  If you are not the 
intended recipient, please contact the sender by reply email and 
destroy all copies of the original message.




___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Crash in libopen-pal.so

2017-06-19 Thread Dmitry N. Mikushin
Hi Justin,

If you can build application in debug mode, try inserting valgrind into
your MPI command. It's usually very good in tracking down failing memory
allocations origins.

Kind regards,
- Dmitry.


2017-06-20 1:10 GMT+03:00 Sylvain Jeaugey :

> Justin, can you try setting mpi_leave_pinned to 0 to disable libptmalloc2
> and confirm this is related to ptmalloc ?
>
> Thanks,
> Sylvain
> On 06/19/2017 03:05 PM, Justin Luitjens wrote:
>
> I have an application that works on other systems but on the current
> system I’m running I’m seeing the following crash:
>
>
>
> [dt04:22457] *** Process received signal ***
>
> [dt04:22457] Signal: Segmentation fault (11)
>
> [dt04:22457] Signal code: Address not mapped (1)
>
> [dt04:22457] Failing at address: 0x6a1da250
>
> [dt04:22457] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2b353370]
>
> [dt04:22457] [ 1] /home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_
> memory_ptmalloc2_int_free+0x50)[0x2cbcf810]
>
> [dt04:22457] [ 2] /home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_
> memory_ptmalloc2_free+0x9b)[0x2cbcff3b]
>
> [dt04:22457] [ 3] ./hacc_tpm[0x42f068]
>
> [dt04:22457] [ 4] ./hacc_tpm[0x42f231]
>
> [dt04:22457] [ 5] ./hacc_tpm[0x40f64d]
>
> [dt04:22457] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2c30db35]
>
> [dt04:22457] [ 7] ./hacc_tpm[0x4115cf]
>
> [dt04:22457] *** End of error message ***
>
>
>
>
>
> This app is a CUDA app but doesn’t use GPU direct so that should be
> irrelevant.
>
>
>
> I’m building with ggc/5.3.0  cuda/8.0.44  openmpi/1.10.7
>
>
>
> I’m using this on centos 7 and am using a vanilla MPI configure line:
> ./configure --prefix=/home/jluitjens/libs/openmpi/
>
>
>
> Currently I’m trying to do this with just a single MPI process but
> multiple MPI processes fail in the same way:
>
>
>
> mpirun  --oversubscribe -np 1 ./command
>
>
>
> What is odd is the crash occurs around the same spot in the code but not
> consistently at the same spot.  The spot in the code where the single
> thread is at the time of the crash is nowhere near MPI code.  The code
> where it is crashing is just using malloc to allocate some memory. This
> makes me think the crash is due to a thread outside of the application I’m
> working on (perhaps in OpenMPI itself) or perhaps due to openmpi hijacking
> malloc/free.
>
>
>
> Does anyone have any ideas of what I could try to work around this issue?
>
>
>
> Thanks,
>
> Justin
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
> This email message is for the sole use of the intended recipient(s) and
> may contain confidential information.  Any unauthorized review, use,
> disclosure or distribution is prohibited.  If you are not the intended
> recipient, please contact the sender by reply email and destroy all copies
> of the original message.
> --
>
>
> ___
> users mailing 
> listus...@lists.open-mpi.orghttps://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users