Nathan tells me that this may well be related to a fix that was literally just pulled into the v1.8 branch today:
https://github.com/open-mpi/ompi-release/pull/56 Would you mind testing any nightly tarball after tonight? (i.e., the v1.8 tarballs generated tonight will be the first ones to contain this fix) http://www.open-mpi.org/nightly/master/ On Oct 24, 2014, at 11:46 AM, <michael.rach...@dlr.de> <michael.rach...@dlr.de> wrote: > Dear developers of OPENMPI, > > I am running a small downsized Fortran-testprogram for shared memory > allocation (using MPI_WIN_ALLOCATE_SHARED and MPI_WIN_SHARED_QUERY) ) > on only 1 node of 2 different Linux-clusters with OPENMPI-1.8.3 and > Intel-14.0.4 /Intel-13.0.1, respectively. > > The program simply allocates a sequence of shared data windows, each > consisting of 1 integer*4-array. > None of the windows is freed, so the amount of allocated data in shared > windows raises during the course of the execution. > > That worked well on the 1st cluster (Laki, having 8 procs per node)) when > allocating even 1000 shared windows each having 50000 integer*4 array > elements, > i.e. a total of 200 MBytes. > On the 2nd cluster (Cluster5, having 24 procs per node) it also worked on the > login node, but it did NOT work on a compute node. > In that error case, there occurs something like an internal storage limit of > ~ 140 MB for the total storage allocated in all shared windows. > When that limit is reached, all later shared memory allocations fail (but > silently). > So the first attempt to use such a bad shared data window results in a bus > error due to the bad storage address encountered. > > That strange behavior could be observed in the small testprogram but also > with my large Fortran CFD-code. > If the error occurs, then it occurs with both codes, and both at a storage > limit of ~140 MB. > I found that this storage limit depends only weakly on the number of > processes (for np=2,4,8,16,24 it is: 144.4 , 144.0, 141.0, 137.0, 132.2 MB) > > Note that the shared memory storage available on both clusters was very large > (many GB of free memory). > > Here is the error message when running with np=2 and an array dimension of > idim_1=50000 for the integer*4 array allocated per shared window > on the compute node of Cluster5: > In that case, the error occurred at the 723-th shared window, which is the > 1st badly allocated window in that case: > (722 successfully allocated shared windows * 50000 array elements * 4 > Bytes/el. = 144.4 MB) > > > [1,0]<stdout>: ========on nodemaster: iwin= 722 : > [1,0]<stdout>: total storage [MByte] alloc. in shared windows so far: > 144.400000000000 > [1,0]<stdout>: =========== allocation of shared window no. iwin= 723 > [1,0]<stdout>: starting now with idim_1= 50000 > [1,0]<stdout>: ========on nodemaster for iwin= 723 : before writing > on shared mem > [1,0]<stderr>:[r5i5n13:12597] *** Process received signal *** > [1,0]<stderr>:[r5i5n13:12597] Signal: Bus error (7) > [1,0]<stderr>:[r5i5n13:12597] Signal code: Non-existant physical address (2) > [1,0]<stderr>:[r5i5n13:12597] Failing at address: 0x7fffe08da000 > [1,0]<stderr>:[r5i5n13:12597] [ 0] > [1,0]<stderr>:/lib64/libpthread.so.0(+0xf800)[0x7ffff6d67800] > [1,0]<stderr>:[r5i5n13:12597] [ 1] ./a.out[0x408a8b] > [1,0]<stderr>:[r5i5n13:12597] [ 2] ./a.out[0x40800c] > [1,0]<stderr>:[r5i5n13:12597] [ 3] > [1,0]<stderr>:/lib64/libc.so.6(__libc_start_main+0xe6)[0x7ffff69fec36] > [1,0]<stderr>:[r5i5n13:12597] [ 4] [1,0]<stderr>:./a.out[0x407f09] > [1,0]<stderr>:[r5i5n13:12597] *** End of error message *** > [1,1]<stderr>:forrtl: error (78): process killed (SIGTERM) > [1,1]<stderr>:Image PC Routine Line > Source > [1,1]<stderr>:libopen-pal.so.6 00007FFFF4B74580 Unknown > Unknown Unknown > [1,1]<stderr>:libmpi.so.1 00007FFFF7267F3E Unknown > Unknown Unknown > [1,1]<stderr>:libmpi.so.1 00007FFFF733B555 Unknown > Unknown Unknown > [1,1]<stderr>:libmpi.so.1 00007FFFF727DFFD Unknown > Unknown Unknown > [1,1]<stderr>:libmpi_mpifh.so.2 00007FFFF779BA03 Unknown > Unknown Unknown > [1,1]<stderr>:a.out 0000000000408D15 Unknown > Unknown Unknown > [1,1]<stderr>:a.out 000000000040800C Unknown > Unknown Unknown > [1,1]<stderr>:libc.so.6 00007FFFF69FEC36 Unknown > Unknown Unknown > [1,1]<stderr>:a.out 0000000000407F09 Unknown > Unknown Unknown > -------------------------------------------------------------------------- > mpiexec noticed that process rank 0 with PID 12597 on node r5i5n13 exited on > signal 7 (Bus error). > -------------------------------------------------------------------------- > > > The small Ftn-testprogram was built by > mpif90 sharedmemtest.f90 > mpiexec -np 2 -bind-to core -tag-output ./a.out > > Why does it work on the Laki (both on login-node and on a compute node) as > well as on the login-node of Cluster5, > but fails on an compute node of Cluster5? > > Greetings > Michael Rachner > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/10/25572.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/