Thanks for the info, I updated https://github.com/open-mpi/ompi/issues/1809
accordingly.

fwiw, the bug occurs when addresses do not fit in 32 bits.
for some reasons, I always run into it on OSX but not on Linux, ubless I
use dmalloc.
I replaced malloc with alloca (and remove free) so I always hit the bug on
Linux.

Cheers,

Gilles

On Wednesday, June 22, 2016, Nicolas Joly <nj...@pasteur.fr> wrote:

> On Wed, Jun 22, 2016 at 11:58:25AM +0900, Gilles Gouaillardet wrote:
> > Nicolas,
> >
> > can you please give the attached patch a try ?
> >
> > in my environment, it fixes your test case.
>
> Yes ! It does here too ...
>
> Just patched ADIOI_NFS_WriteStrided() using the same fix. And the
> original tool that crashed first on read, and later on write with
> MPI_BOTTOM now succeed.
>
> > based on previous tests posted here, it is likely a similar bug should
> > be fixed for other filesystems.
>
> Thanks a lot.
>
> > Gilles
> >
> >
> > On 6/15/2016 12:42 AM, Nicolas Joly wrote:
> > >Hi,
> > >
> > >At work, i do have some mpi codes that make use of custom datatypes to
> > >call MPI_File_read with MPI_BOTTOM ... It mostly works, except when
> > >the underlying filesystem is NFS where if crash with SIGSEGV.
> > >
> > >The attached sample (code + data) works just fine with 1.10.1 on my
> > >NetBSD/amd64 workstation using the UFS romio backend, but crash if
> > >switched to NFS :
> > >
> > >njoly@issan [~]> mpirun --version
> > >mpirun (Open MPI) 1.10.1
> > >njoly@issan [~]> mpicc -g -Wall -o sample sample.c
> > >njoly@issan [~]> mpirun -n 2 ./sample ufs:data.txt
> > >rank1 ... 111111111133333333335555555555
> > >rank0 ... 000000000022222222224444444444
> > >njoly@issan [~]> mpirun -n 2 ./sample nfs:data.txt
> > >[issan:20563] *** Process received signal ***
> > >[issan:08879] *** Process received signal ***
> > >[issan:20563] Signal: Segmentation fault (11)
> > >[issan:20563] Signal code: Address not mapped (1)
> > >[issan:20563] Failing at address: 0xffffffffb1309240
> > >[issan:08879] Signal: Segmentation fault (11)
> > >[issan:08879] Signal code: Address not mapped (1)
> > >[issan:08879] Failing at address: 0xffffffff881b0420
> > >[issan:08879] [ 0] [issan:20563] [ 0] 0x7dafb14a52b0
> > ><__sigtramp_siginfo_2> at /usr/lib/libc.so.12
> > >[issan:20563] *** End of error message ***
> > >0x78b9886a52b0 <__sigtramp_siginfo_2> at /usr/lib/libc.so.12
> > >[issan:08879] *** End of error message ***
> >
> >--------------------------------------------------------------------------
> > >mpirun noticed that process rank 0 with PID 20563 on node issan exited
> on
> > >signal 11 (Segmentation fault).
> >
> >--------------------------------------------------------------------------
> > >njoly@issan [~]> gdb sample sample.core
> > >GNU gdb (GDB) 7.10.1
> > >[...]
> > >Core was generated by `sample'.
> > >Program terminated with signal SIGSEGV, Segmentation fault.
> > >#0  0x000078b98871971f in memcpy () from /usr/lib/libc.so.12
> > >[Current thread is 1 (LWP 1)]
> > >(gdb) bt
> > >#0  0x000078b98871971f in memcpy () from /usr/lib/libc.so.12
> > >#1  0x000078b974010edf in ADIOI_NFS_ReadStrided () from
> > >/usr/pkg/lib/openmpi/mca_io_romio.so
> > >#2  0x000078b97400bacf in MPIOI_File_read () from
> > >/usr/pkg/lib/openmpi/mca_io_romio.so
> > >#3  0x000078b97400bc72 in mca_io_romio_dist_MPI_File_read () from
> > >/usr/pkg/lib/openmpi/mca_io_romio.so
> > >#4  0x000078b988e72b38 in PMPI_File_read () from
> /usr/pkg/lib/libmpi.so.12
> > >#5  0x00000000004013a4 in main (argc=2, argv=0x7f7fff7b0f00) at
> sample.c:63
> > >
> > >Thanks.
> > >
> > >
> > >
> > >_______________________________________________
> > >users mailing list
> > >us...@open-mpi.org <javascript:;>
> > >Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> > >Link to this post:
> > >http://www.open-mpi.org/community/lists/users/2016/06/29434.php
> >
>
> > diff --git a/ompi/mca/io/romio/romio/adio/ad_nfs/ad_nfs_read.c
> b/ompi/mca/io/romio/romio/adio/ad_nfs/ad_nfs_read.c
> > index 16f3a4d..2577f13 100644
> > --- a/ompi/mca/io/romio/romio/adio/ad_nfs/ad_nfs_read.c
> > +++ b/ompi/mca/io/romio/romio/adio/ad_nfs/ad_nfs_read.c
> > @@ -457,13 +457,14 @@ void ADIOI_NFS_ReadStrided(ADIO_File fd, void
> *buf, int count,
> >       }
> >       else {
> >  /* noncontiguous in memory as well as in file */
> > +            ADIO_Offset i;
> >
> >           ADIOI_Flatten_datatype(datatype);
> >           flat_buf = ADIOI_Flatlist;
> >           while (flat_buf->type != datatype) flat_buf = flat_buf->next;
> >
> >           k = num = buf_count = 0;
> > -         i = (int) (flat_buf->indices[0]);
> > +         i = flat_buf->indices[0];
> >           j = st_index;
> >           off = offset;
> >           n_filetypes = st_n_filetypes;
> > @@ -508,8 +509,8 @@ void ADIOI_NFS_ReadStrided(ADIO_File fd, void *buf,
> int count,
> >
> >                   k = (k + 1)%flat_buf->count;
> >                   buf_count++;
> > -                 i = (int) (buftype_extent*(buf_count/flat_buf->count) +
> > -                     flat_buf->indices[k]);
> > +                 i = buftype_extent*(buf_count/flat_buf->count) +
> > +                     flat_buf->indices[k];
> >                   new_brd_size = flat_buf->blocklens[k];
> >                   if (size != frd_size) {
> >                       off += size;
>
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org <javascript:;>
> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29494.php
>
> --
> Nicolas Joly
>
> Cluster & Computing Group
> Biology IT Center
> Institut Pasteur, Paris.
> _______________________________________________
> users mailing list
> us...@open-mpi.org <javascript:;>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29504.php
>

Reply via email to