Michael, The available space must be greater than the requested size + 5%
From the logs, the error message makes sense to me : there is not enough space in /tmp Since the compute nodes have a lot of memory, you might want to try using /dev/shm instead of /tmp for the backing files Cheers, Gilles michael.rach...@dlr.de wrote: >Dear developers of OPENMPI, > >We have now installed and tested the bugfixed OPENMPI Nightly Tarball of >2014-10-24 (openmpi-dev-176-g9334abc.tar.gz) on Cluster5 . >As before (with OPENMPI-1.8.3 release version) the small Ftn-testprogram runs >correctly on the login-node. >As before the program aborts on the compute node, but now with a different >error message: > >The following message appears when launching the program with 2 processes: >mpiexec -np 2 -bind-to core -tag-output ./a.out >************************************************************************************************ >[1,0]<stdout>: ========on nodemaster: iwin= 685 : >[1,0]<stdout>: total storage [MByte] alloc. in shared windows so far: >137.000000000000 >[ [1,0]<stdout>: =========== allocation of shared window no. iwin= 686 >[1,0]<stdout>: starting now with idim_1= 50000 >------------------------------------------------------------------------- >It appears as if there is not enough space for >/tmp/openmpi-sessions-rachner@r5i5n13_0/48127/1/shared_window_688.r5i5n13 (the >shared-memory backing >file). It is likely that your MPI job will now either abort or experience >performance degradation. > > Local host: r5i5n13 > Space Requested: 204256 B > Space Available: 208896 B >-------------------------------------------------------------------------- >[r5i5n13:26917] *** An error occurred in MPI_Win_allocate_shared >[r5i5n13:26917] *** reported by process [3154051073,140733193388032] >[r5i5n13:26917] *** on communicator MPI_COMM_WORLD >[r5i5n13:26917] *** MPI_ERR_INTERN: internal error >[r5i5n13:26917] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will >now abort, >[r5i5n13:26917] *** and potentially your MPI job) >rachner@r5i5n13:~/dat> >************************************************************************************************ > > >When I repeat the run using 24 processes (on same compute node) the same kind >of abort message occurs, but earlier: >************************************************************************************************ >[1,0]<stdout>: ========on nodemaster: iwin= 231 : >[1,0]<stdout>: total storage [MByte] alloc. in shared windows so far: >46.2000000000000 > [1,0]<stdout>: =========== allocation of shared window no. iwin= 232 >[1,0]<stdout>: starting now with idim_1= 50000 >------------------------------------------------------------------------- >It appears as if there is not enough space for >/tmp/openmpi-sessions-rachner@r5i5n13_0/48029/1/shared_window_234.r5i5n13 (the >shared-memory backing >file). It is likely that your MPI job will now either abort or experience >performance degradation. > > Local host: r5i5n13 > Space Requested: 204784 B > Space Available: 131072 B >-------------------------------------------------------------------------- >[r5i5n13:26947] *** An error occurred in MPI_Win_allocate_shared >[r5i5n13:26947] *** reported by process [3147628545,140733193388032] >[r5i5n13:26947] *** on communicator MPI_COMM_WORLD >[r5i5n13:26947] *** MPI_ERR_INTERN: internal error >[r5i5n13:26947] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will >now abort, >[r5i5n13:26947] *** and potentially your MPI job) >rachner@r5i5n13:~/dat> >************************************************************************************************ > >So the problem is not yet resolved. > >Greetings > Michael Rachner > > > > > > >-----Ursprüngliche Nachricht----- >Von: Rachner, Michael >Gesendet: Montag, 27. Oktober 2014 11:49 >An: 'Open MPI Users' >Betreff: AW: [OMPI users] Bug in OpenMPI-1.8.3: storage limition in shared >memory allocation (MPI_WIN_ALLOCATE_SHARED) in Ftn-code > >Dear Mr. Squyres. > >We will try to install your bug-fixed nigthly tarball of 2014-10-24 on >Cluster5 to see whether it works or not. >The installation however will take some time. I get back to you, if I know >more. > >Let me add the information that on the Laki each nodes has 16 GB of shared >memory (there it worked), the login-node on Cluster 5 has 64 GB (there it >worked too), whereas the compute nodes on Cluster5 have 128 GB (there it did >not work). >So possibly the bug might have something to do with the size of the physical >shared memory available on the node. > >Greetings >Michael Rachner > >-----Ursprüngliche Nachricht----- >Von: users [mailto:users-boun...@open-mpi.org] Im Auftrag von Jeff Squyres >(jsquyres) >Gesendet: Freitag, 24. Oktober 2014 22:45 >An: Open MPI User's List >Betreff: Re: [OMPI users] Bug in OpenMPI-1.8.3: storage limition in shared >memory allocation (MPI_WIN_ALLOCATE_SHARED) in Ftn-code > >Nathan tells me that this may well be related to a fix that was literally just >pulled into the v1.8 branch today: > > https://github.com/open-mpi/ompi-release/pull/56 > >Would you mind testing any nightly tarball after tonight? (i.e., the v1.8 >tarballs generated tonight will be the first ones to contain this fix) > > http://www.open-mpi.org/nightly/master/ > > > >On Oct 24, 2014, at 11:46 AM, <michael.rach...@dlr.de> ><michael.rach...@dlr.de> wrote: > >> Dear developers of OPENMPI, >> >> I am running a small downsized Fortran-testprogram for shared memory >> allocation (using MPI_WIN_ALLOCATE_SHARED and MPI_WIN_SHARED_QUERY) ) >> on only 1 node of 2 different Linux-clusters with OPENMPI-1.8.3 and >> Intel-14.0.4 /Intel-13.0.1, respectively. >> >> The program simply allocates a sequence of shared data windows, each >> consisting of 1 integer*4-array. >> None of the windows is freed, so the amount of allocated data in shared >> windows raises during the course of the execution. >> >> That worked well on the 1st cluster (Laki, having 8 procs per node)) >> when allocating even 1000 shared windows each having 50000 integer*4 array >> elements, i.e. a total of 200 MBytes. >> On the 2nd cluster (Cluster5, having 24 procs per node) it also worked on >> the login node, but it did NOT work on a compute node. >> In that error case, there occurs something like an internal storage limit of >> ~ 140 MB for the total storage allocated in all shared windows. >> When that limit is reached, all later shared memory allocations fail (but >> silently). >> So the first attempt to use such a bad shared data window results in a bus >> error due to the bad storage address encountered. >> >> That strange behavior could be observed in the small testprogram but also >> with my large Fortran CFD-code. >> If the error occurs, then it occurs with both codes, and both at a storage >> limit of ~140 MB. >> I found that this storage limit depends only weakly on the number of >> processes (for np=2,4,8,16,24 it is: 144.4 , 144.0, 141.0, 137.0, >> 132.2 MB) >> >> Note that the shared memory storage available on both clusters was very >> large (many GB of free memory). >> >> Here is the error message when running with np=2 and an array >> dimension of idim_1=50000 for the integer*4 array allocated per shared >> window on the compute node of Cluster5: >> In that case, the error occurred at the 723-th shared window, which is the >> 1st badly allocated window in that case: >> (722 successfully allocated shared windows * 50000 array elements * 4 >> Bytes/el. = 144.4 MB) >> >> >> [1,0]<stdout>: ========on nodemaster: iwin= 722 : >> [1,0]<stdout>: total storage [MByte] alloc. in shared windows so far: >> 144.400000000000 >> [1,0]<stdout>: =========== allocation of shared window no. iwin= 723 >> [1,0]<stdout>: starting now with idim_1= 50000 >> [1,0]<stdout>: ========on nodemaster for iwin= 723 : before writing >> on shared mem >> [1,0]<stderr>:[r5i5n13:12597] *** Process received signal *** >> [1,0]<stderr>:[r5i5n13:12597] Signal: Bus error (7) >> [1,0]<stderr>:[r5i5n13:12597] Signal code: Non-existant physical >> address (2) [1,0]<stderr>:[r5i5n13:12597] Failing at address: >> 0x7fffe08da000 [1,0]<stderr>:[r5i5n13:12597] [ 0] >> [1,0]<stderr>:/lib64/libpthread.so.0(+0xf800)[0x7ffff6d67800] >> [1,0]<stderr>:[r5i5n13:12597] [ 1] ./a.out[0x408a8b] >> [1,0]<stderr>:[r5i5n13:12597] [ 2] ./a.out[0x40800c] >> [1,0]<stderr>:[r5i5n13:12597] [ 3] >> [1,0]<stderr>:/lib64/libc.so.6(__libc_start_main+0xe6)[0x7ffff69fec36] >> [1,0]<stderr>:[r5i5n13:12597] [ 4] [1,0]<stderr>:./a.out[0x407f09] >> [1,0]<stderr>:[r5i5n13:12597] *** End of error message *** >> [1,1]<stderr>:forrtl: error (78): process killed (SIGTERM) >> [1,1]<stderr>:Image PC Routine Line >> Source >> [1,1]<stderr>:libopen-pal.so.6 00007FFFF4B74580 Unknown >> Unknown Unknown >> [1,1]<stderr>:libmpi.so.1 00007FFFF7267F3E Unknown >> Unknown Unknown >> [1,1]<stderr>:libmpi.so.1 00007FFFF733B555 Unknown >> Unknown Unknown >> [1,1]<stderr>:libmpi.so.1 00007FFFF727DFFD Unknown >> Unknown Unknown >> [1,1]<stderr>:libmpi_mpifh.so.2 00007FFFF779BA03 Unknown >> Unknown Unknown >> [1,1]<stderr>:a.out 0000000000408D15 Unknown >> Unknown Unknown >> [1,1]<stderr>:a.out 000000000040800C Unknown >> Unknown Unknown >> [1,1]<stderr>:libc.so.6 00007FFFF69FEC36 Unknown >> Unknown Unknown >> [1,1]<stderr>:a.out 0000000000407F09 Unknown >> Unknown Unknown >> ---------------------------------------------------------------------- >> ---- mpiexec noticed that process rank 0 with PID 12597 on node >> r5i5n13 exited on signal 7 (Bus error). >> ---------------------------------------------------------------------- >> ---- >> >> >> The small Ftn-testprogram was built by >> mpif90 sharedmemtest.f90 >> mpiexec -np 2 -bind-to core -tag-output ./a.out >> >> Why does it work on the Laki (both on login-node and on a compute >> node) as well as on the login-node of Cluster5, but fails on an compute >> node of Cluster5? >> >> Greetings >> Michael Rachner >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/10/25572.php > > >-- >Jeff Squyres >jsquy...@cisco.com >For corporate legal information go to: >http://www.cisco.com/web/about/doing_business/legal/cri/ > >_______________________________________________ >users mailing list >us...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >Link to this post: >http://www.open-mpi.org/community/lists/users/2014/10/25580.php >_______________________________________________ >users mailing list >us...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >Link to this post: >http://www.open-mpi.org/community/lists/users/2014/10/25607.php