Nathan tells me that this may well be related to a fix that was literally just 
pulled into the v1.8 branch today:

    https://github.com/open-mpi/ompi-release/pull/56

Would you mind testing any nightly tarball after tonight?  (i.e., the v1.8 
tarballs generated tonight will be the first ones to contain this fix)

    http://www.open-mpi.org/nightly/master/



On Oct 24, 2014, at 11:46 AM, <michael.rach...@dlr.de> <michael.rach...@dlr.de> 
wrote:

> Dear developers of OPENMPI,
>  
> I am running a small downsized Fortran-testprogram for shared memory 
> allocation (using MPI_WIN_ALLOCATE_SHARED and  MPI_WIN_SHARED_QUERY) )
> on only 1 node   of 2 different Linux-clusters with OPENMPI-1.8.3 and 
> Intel-14.0.4 /Intel-13.0.1, respectively.
>  
> The program simply allocates a sequence of shared data windows, each 
> consisting of 1 integer*4-array.
> None of the windows is freed, so the amount of allocated data  in shared 
> windows raises during the course of the execution.
>  
> That worked well on the 1st cluster (Laki, having 8 procs per node))  when 
> allocating even 1000 shared windows each having 50000 integer*4 array 
> elements,
> i.e. a total of  200 MBytes.
> On the 2nd cluster (Cluster5, having 24 procs per node) it also worked on the 
> login node, but it did NOT work on a compute node.
> In that error case, there occurs something like an internal storage limit of 
> ~ 140 MB for the total storage allocated in all shared windows.
> When that limit is reached, all later shared memory allocations fail (but 
> silently).
> So the first attempt to use such a bad shared data window results in a bus 
> error due to the bad storage address encountered.
>  
> That strange behavior could be observed in the small testprogram but also 
> with my large Fortran CFD-code.
> If the error occurs, then it occurs with both codes, and both at a storage 
> limit of  ~140 MB.
> I found that this storage limit depends only weakly on  the number of 
> processes (for np=2,4,8,16,24  it is: 144.4 , 144.0, 141.0, 137.0, 132.2 MB)
>  
> Note that the shared memory storage available on both clusters was very large 
> (many GB of free memory).
>  
> Here is the error message when running with np=2 and an  array dimension of 
> idim_1=50000  for the integer*4 array allocated per shared window
> on the compute node of Cluster5:
> In that case, the error occurred at the 723-th shared window, which is the 
> 1st badly allocated window in that case:
> (722 successfully allocated shared windows * 50000 array elements * 4 
> Bytes/el. = 144.4 MB)
>  
>  
> [1,0]<stdout>: ========on nodemaster: iwin=         722 :
> [1,0]<stdout>:  total storage [MByte] alloc. in shared windows so far:   
> 144.400000000000
> [1,0]<stdout>: =========== allocation of shared window no. iwin=         723
> [1,0]<stdout>:  starting now with idim_1=       50000
> [1,0]<stdout>: ========on nodemaster for iwin=         723 : before writing 
> on shared mem
> [1,0]<stderr>:[r5i5n13:12597] *** Process received signal ***
> [1,0]<stderr>:[r5i5n13:12597] Signal: Bus error (7)
> [1,0]<stderr>:[r5i5n13:12597] Signal code: Non-existant physical address (2)
> [1,0]<stderr>:[r5i5n13:12597] Failing at address: 0x7fffe08da000
> [1,0]<stderr>:[r5i5n13:12597] [ 0] 
> [1,0]<stderr>:/lib64/libpthread.so.0(+0xf800)[0x7ffff6d67800]
> [1,0]<stderr>:[r5i5n13:12597] [ 1] ./a.out[0x408a8b]
> [1,0]<stderr>:[r5i5n13:12597] [ 2] ./a.out[0x40800c]
> [1,0]<stderr>:[r5i5n13:12597] [ 3] 
> [1,0]<stderr>:/lib64/libc.so.6(__libc_start_main+0xe6)[0x7ffff69fec36]
> [1,0]<stderr>:[r5i5n13:12597] [ 4] [1,0]<stderr>:./a.out[0x407f09]
> [1,0]<stderr>:[r5i5n13:12597] *** End of error message ***
> [1,1]<stderr>:forrtl: error (78): process killed (SIGTERM)
> [1,1]<stderr>:Image              PC                Routine            Line    
>     Source
> [1,1]<stderr>:libopen-pal.so.6   00007FFFF4B74580  Unknown               
> Unknown  Unknown
> [1,1]<stderr>:libmpi.so.1        00007FFFF7267F3E  Unknown               
> Unknown  Unknown
> [1,1]<stderr>:libmpi.so.1        00007FFFF733B555  Unknown               
> Unknown  Unknown
> [1,1]<stderr>:libmpi.so.1        00007FFFF727DFFD  Unknown               
> Unknown  Unknown
> [1,1]<stderr>:libmpi_mpifh.so.2  00007FFFF779BA03  Unknown               
> Unknown  Unknown
> [1,1]<stderr>:a.out              0000000000408D15  Unknown               
> Unknown  Unknown
> [1,1]<stderr>:a.out              000000000040800C  Unknown               
> Unknown  Unknown
> [1,1]<stderr>:libc.so.6          00007FFFF69FEC36  Unknown               
> Unknown  Unknown
> [1,1]<stderr>:a.out              0000000000407F09  Unknown               
> Unknown  Unknown
> --------------------------------------------------------------------------
> mpiexec noticed that process rank 0 with PID 12597 on node r5i5n13 exited on 
> signal 7 (Bus error).
> --------------------------------------------------------------------------
>  
>  
> The small Ftn-testprogram was built by   
>   mpif90 sharedmemtest.f90
>   mpiexec -np 2 -bind-to core -tag-output ./a.out
>  
> Why does it work on the Laki  (both on login-node and on a compute node)  as 
> well as on the login-node of Cluster5,
> but fails on an compute node of Cluster5?
>  
> Greetings
>    Michael Rachner
>  
>  
>  
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/10/25572.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to