Nicolas,
can you please give the attached patch a try ?
in my environment, it fixes your test case.
based on previous tests posted here, it is likely a similar bug should
be fixed for other filesystems.
Cheers,
Gilles
On 6/15/2016 12:42 AM, Nicolas Joly wrote:
Hi,
At work, i do have some mpi codes that make use of custom datatypes to
call MPI_File_read with MPI_BOTTOM ... It mostly works, except when
the underlying filesystem is NFS where if crash with SIGSEGV.
The attached sample (code + data) works just fine with 1.10.1 on my
NetBSD/amd64 workstation using the UFS romio backend, but crash if
switched to NFS :
njoly@issan [~]> mpirun --version
mpirun (Open MPI) 1.10.1
njoly@issan [~]> mpicc -g -Wall -o sample sample.c
njoly@issan [~]> mpirun -n 2 ./sample ufs:data.txt
rank1 ... 111111111133333333335555555555
rank0 ... 000000000022222222224444444444
njoly@issan [~]> mpirun -n 2 ./sample nfs:data.txt
[issan:20563] *** Process received signal ***
[issan:08879] *** Process received signal ***
[issan:20563] Signal: Segmentation fault (11)
[issan:20563] Signal code: Address not mapped (1)
[issan:20563] Failing at address: 0xffffffffb1309240
[issan:08879] Signal: Segmentation fault (11)
[issan:08879] Signal code: Address not mapped (1)
[issan:08879] Failing at address: 0xffffffff881b0420
[issan:08879] [ 0] [issan:20563] [ 0] 0x7dafb14a52b0 <__sigtramp_siginfo_2> at
/usr/lib/libc.so.12
[issan:20563] *** End of error message ***
0x78b9886a52b0 <__sigtramp_siginfo_2> at /usr/lib/libc.so.12
[issan:08879] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 20563 on node issan exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------
njoly@issan [~]> gdb sample sample.core
GNU gdb (GDB) 7.10.1
[...]
Core was generated by `sample'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000078b98871971f in memcpy () from /usr/lib/libc.so.12
[Current thread is 1 (LWP 1)]
(gdb) bt
#0 0x000078b98871971f in memcpy () from /usr/lib/libc.so.12
#1 0x000078b974010edf in ADIOI_NFS_ReadStrided () from
/usr/pkg/lib/openmpi/mca_io_romio.so
#2 0x000078b97400bacf in MPIOI_File_read () from
/usr/pkg/lib/openmpi/mca_io_romio.so
#3 0x000078b97400bc72 in mca_io_romio_dist_MPI_File_read () from
/usr/pkg/lib/openmpi/mca_io_romio.so
#4 0x000078b988e72b38 in PMPI_File_read () from /usr/pkg/lib/libmpi.so.12
#5 0x00000000004013a4 in main (argc=2, argv=0x7f7fff7b0f00) at sample.c:63
Thanks.
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/06/29434.php
diff --git a/ompi/mca/io/romio/romio/adio/ad_nfs/ad_nfs_read.c
b/ompi/mca/io/romio/romio/adio/ad_nfs/ad_nfs_read.c
index 16f3a4d..2577f13 100644
--- a/ompi/mca/io/romio/romio/adio/ad_nfs/ad_nfs_read.c
+++ b/ompi/mca/io/romio/romio/adio/ad_nfs/ad_nfs_read.c
@@ -457,13 +457,14 @@ void ADIOI_NFS_ReadStrided(ADIO_File fd, void *buf, int
count,
}
else {
/* noncontiguous in memory as well as in file */
+ ADIO_Offset i;
ADIOI_Flatten_datatype(datatype);
flat_buf = ADIOI_Flatlist;
while (flat_buf->type != datatype) flat_buf = flat_buf->next;
k = num = buf_count = 0;
- i = (int) (flat_buf->indices[0]);
+ i = flat_buf->indices[0];
j = st_index;
off = offset;
n_filetypes = st_n_filetypes;
@@ -508,8 +509,8 @@ void ADIOI_NFS_ReadStrided(ADIO_File fd, void *buf, int
count,
k = (k + 1)%flat_buf->count;
buf_count++;
- i = (int) (buftype_extent*(buf_count/flat_buf->count) +
- flat_buf->indices[k]);
+ i = buftype_extent*(buf_count/flat_buf->count) +
+ flat_buf->indices[k];
new_brd_size = flat_buf->blocklens[k];
if (size != frd_size) {
off += size;