-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Tim,
Many thanks for the fix! Everything works fine now with the current trunk version. Best regards, stefan | Tim Prins wrote: | The fix I previously sent to the list has been committed in r17400. | | Thanks, | | Tim | Tim Prins wrote: | Hi Stefan, | | I was able to verify the problem. Turns out this is a problem with other | onesided operations as well. Attached is a simple test case I made in c | using MPI_Put that also fails. | | The problem is that the target count and displacements are both sent as | signed 32 bit integers. Then, the receiver multiplies them together and | adds them to the window base. However, this multiplication is done using | the signed 32 bit integers, which overflows. This is then added to the | 64 bit pointer. This, of course, results in a bad address. | | I have attached a patch against a recent development version that fixes | this for me. I am also copying Brian Barrett, who did all the work on | the onesided code. | | Brian: if possible, please take a look at the attached patch and test case. | | Thanks for the report! | | Tim Prins | | Stefan Knecht wrote: | Hi all, | | I encounter a problem with the routine MPI_ACCUMULATE trying to sum up | MPI_REAL8's on a large memory window with a large offset. | My program running (on a single processor, x86_64 architecture) | crashes with | an error message like: | | [node14:16236] *** Process received signal *** | [node14:16236] Signal: Segmentation fault (11) | [node14:16236] Signal code: Address not mapped (1) | [node14:16236] Failing at address: 0x2aaa32b16000 | [node14:16236] [ 0] /lib64/libpthread.so.0 [0x32e080de00] | [node14:16236] [ 1] | /home/stefan/bin/openmpi-1.2.5/lib/libmpi.so.0(ompi_mpi_op_sum_double+0x10) | [0x2aaaaaf15530] | [node14:16236] [ 2] | /home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_process_op+0x2d7) | | [0x2aaab1a47257] | [node14:16236] [ 3] | /home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so | [0x2aaab1a45432] | [node14:16236] [ 4] | /home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_passive_unlock+0x93) | | [0x2aaab1a48243] | [node14:16236] [ 5] | /home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so | [0x2aaab1a43436] | [node14:16236] [ 6] | /home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_progress+0xff) | | [0x2aaab1a42e0f] | [node14:16236] [ 7] | /home/stefan/bin/openmpi-1.2.5/lib/libopen-pal.so.0(opal_progress+0x4a) | [0x2aaaab3dfa0a] | [node14:16236] [ 8] | /home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_module_unlock+0x2a9) | | [0x2aaab1a48629] | [node14:16236] [ 9] | /home/stefan/bin/openmpi-1.2.5/lib/libmpi.so.0(PMPI_Win_unlock+0xe1) | [0x2aaaaaf4a291] | [node14:16236] [10] | /home/stefan/bin/openmpi-1.2.5/lib/libmpi_f77.so.0(mpi_win_unlock_+0x25) | [0x2aaaaacdd8c5] | [node14:16236] [11] /home/stefan/calc/mpi2_test/a.out(MAIN__+0x809) | [0x401851] | [node14:16236] [12] /home/stefan/calc/mpi2_test/a.out(main+0xe) | [0x401bbe] | [node14:16236] [13] /lib64/libc.so.6(__libc_start_main+0xf4) | [0x32dfc1dab4] | [node14:16236] [14] /home/stefan/calc/mpi2_test/a.out [0x400f99] | [node14:16236] *** End of error message *** | mpirun noticed that job rank 0 with PID 16236 on node node14 exited on | signal 11 (Segmentation fault). | | | The relevant part of my FORTRAN source code reads as: | | ~ program accumulate_test | ~ IMPLICIT REAL*8 (A-H,O-Z) | ~ include 'mpif.h' | ~ INTEGER(KIND=MPI_OFFSET_KIND) MX_SIZE_M | C dummy size parameter | ~ PARAMETER (MX_SIZE_M = 1 000 000) | ~ INTEGER MPIerr, MYID, NPROC | ~ INTEGER ITARGET, MY_X_WIN, JCOUNT, JCOUNT_T | ~ INTEGER(KIND=MPI_ADDRESS_KIND) MEM_X, MEM_Y | ~ INTEGER(KIND=MPI_ADDRESS_KIND) IDISPL_WIN | ~ INTEGER(KIND=MPI_ADDRESS_KIND) PTR1, PTR2 | ~ INTEGER(KIND=MPI_INTEGER_KIND) ISIZE_REAL8 | ~ INTEGER*8 NELEMENT_X, NELEMENT_Y | ~ POINTER (PTR1, XMAT(MX_SIZE_M)) | ~ POINTER (PTR2, YMAT(MX_SIZE_M)) | C | ~ CALL MPI_INIT( MPIerr ) | ~ CALL MPI_COMM_RANK( MPI_COMM_WORLD, MYID, MPIerr) | ~ CALL MPI_COMM_SIZE( MPI_COMM_WORLD, NPROC, MPIerr) | C | ~ NELEMENT_X = 400 000 000 | ~ NELEMENT_Y = 10 000 | C | ~ CALL MPI_TYPE_EXTENT(MPI_REAL8, ISIZE_REAL8, MPIerr) | ~ MEM_X = NELEMENT_X * ISIZE_REAL8 | ~ MEM_Y = NELEMENT_Y * ISIZE_REAL8 | C | C allocate memory | C | ~ CALL MPI_ALLOC_MEM( MEM_X, MPI_INFO_NULL, PTR1, MPIerr) | ~ CALL MPI_ALLOC_MEM( MEM_Y, MPI_INFO_NULL, PTR2, MPIerr) | C | C fill vectors with 0.0D0 and 1.0D0 | C | ~ CALL DZERO(XMAT,NELEMENT_X) | ~ CALL DONE(YMAT,NELEMENT_Y) | C | C open memory window | C | ~ CALL MPI_WIN_CREATE( XMAT, MEM_X, ISIZE_REAL8, | ~ & MPI_INFO_NULL, MPI_COMM_WORLD, | ~ & MY_X_WIN, MPIerr ) | C lock window (MPI_LOCK_SHARED mode) | C select target ==> if itarget == myid: no 1-sided communication | C | ~ ITARGET = MYID | ~ CALL MPI_WIN_LOCK( MPI_LOCK_SHARED, ITARGET, MPI_MODE_NOCHECK, | ~ & MY_X_WIN, MPIerr) | C | C transfer data to target ITARGET | C | ~ JCOUNT_T = 10 000 | ~ JCOUNT = JCOUNT_T | C set displacement in memory window | ~ IDISPL_WIN = 300 000 000 | C | ~ CALL MPI_ACCUMULATE( YMAT, JCOUNT, MPI_REAL8, ITARGET, IDISPL_WIN, | ~ & JCOUNT_T, MPI_REAL8, MPI_SUM, MY_X_WIN, MPIerr) | C | C unlock | C | ~ CALL MPI_WIN_UNLOCK( ITARGET, MY_X_WIN, MPIerr) | ... | | The complete source code (accumulate_test.F) is attached to this | e-mail as well as the | config.log of my OpenMPI installation. | | The program only(!) fails for values of IDISPL_WIN > 268 435 455!!! | For all lower | offset values it finishes normally. | | Therefore, I assume that after the internal multiplication (in | MPI_ACCUMULATE) | of IDISPL_WIN with the window scaling factor ISIZE_REAL8 (== 8 byte) an | INTEGER(*4) overflow occurs, although IDISPL_WIN is declared as | KIND=MPI_ADDRESS_KIND (INTEGER*8). Might that be the reason? | | Running this program doing rather MPI_GET than MPI_ACCUMULATE with | the same offsets is no problem at all. | | Thanks in advance for any help, | | stefan | | -- | ------------------------------------------- | Dipl. Chem. Stefan Knecht | Institute for Theoretical and | Computational Chemistry | Heinrich-Heine University D?sseldorf | Universit?tsstra?e 1 | Building 26.32 Room 03.33 | 40225 D?sseldorf | | phone: +49-(0)211-81-11439 | e-mail: ste...@theochem.uni-duesseldorf.de | http://www.theochem.uni-duesseldorf.de/users/stefan | | |> |> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHsEgqFgKivGtHXsARAixkAJ4k7yIyBl2ARp4j4syshVLBZ5xawQCgip7D 90VoNj9YO0UvF2CrMZXkF8s= =nLCs -----END PGP SIGNATURE-----