My apology for taking so long to respond.

We do have a reported JIRA issue with HDF5 1.10.x and OpenMPI, your issue is 
probably related. We think this is not a Lustre issue, from a technical lead at 
Intel:


This isn’t a Lustre issue. From the trace, this definitely seems like a bug in 
openmpi / romio.

The romio in openmpi is pretty outdated and we had bugs fixed in the mpich repo 
that hasn’t carried over to the ompi repo.

…

Meanwhile, depending on the ompi version he is using, he can try to use – mca 
io ompio, or report this to the openmpi mailing list.

I tried your test program using Cray MPI on Lustre (2.6), with HDF5 1.10.1 and 
develop, and I did not have any issues, so if you need to use 1.10.x then you 
will for now need to use an alternate version of MPI.


Scot






On Dec 7, 2017, at 3:44 AM, Bertini, Denis Dr. 
<d.bert...@gsi.de<mailto:d.bert...@gsi.de>> wrote:

Hi
I am facing a problem with data chunking using Lustre 2.10 ( and 2.6) 
filesystem using HDF5 1.10.1 in parallel mode.
I attached in my mail a simple C program which create immediately the crash 
caused at Line 94 trying
to create the dataset collectively.
I observed the crash when i simply set the chunk size to be the same as the 
dataset size. I know that this is one
of the "non recommended" setup according to your documentation ("PitFalls")
https://support.hdfgroup.org/HDF5/doc1.8/Advanced/Chunking/index.html
But  leaving apart the performance penalty , it should not cause a complete 
crash of the program.
Furthermore testing the same program with the older HDF5 version 1.8.16 DO Not 
cause any crash on the same
Lustre 2.10 ( or 2.6 ) version. So it seems that something has been change in 
the data chunking implementation
between the two major HDF5 version 1.8.x and 1.10.x .

Could you please tell me what should be changed for the data chunk size in the 
program when using the new version HDF5 1.10.x?

Thanks in advance,
Denis Bertini

PS:
Here is the core dump that i observed as soon as i use more that one MPI process
H5Pcreate access succeed
H5Pcreate access succeed
-I- Chunk size 176000:
-I- Chunk size 176000:
[lxbk0341:39368] *** Process received signal ***
[lxbk0341:39368] Signal: Segmentation fault (11)
[lxbk0341:39368] Signal code: Address not mapped (1)
[lxbk0341:39368] Failing at address: (nil)
[lxbk0341:39368] [ 0] 
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7f7742122890]
[lxbk0341:39368] [ 1] 
/lustre/hebe/rz/dbertini/plasma/softw/lib/openmpi/mca_io_romio314.so(ADIOI_Flatten+0x1577)[0x7f772e8ac657]
[lxbk0341:39368] [ 2] 
/lustre/hebe/rz/dbertini/plasma/softw/lib/openmpi/mca_io_romio314.so(ADIOI_Flatten_datatype+0xe3)[0x7f772e8ad363]
[lxbk0341:39368] [ 3] 
/lustre/hebe/rz/dbertini/plasma/softw/lib/openmpi/mca_io_romio314.so(ADIO_Set_view+0x1fd)[0x7f772e8a2f5d]
[lxbk0341:39368] [ 4] 
/lustre/hebe/rz/dbertini/plasma/softw/lib/openmpi/mca_io_romio314.so(mca_io_romio314_dist_MPI_File_set_view+0x2f6)[0x7f772e889e06]
[lxbk0341:39368] [ 5] 
/lustre/hebe/rz/dbertini/plasma/softw/lib/openmpi/mca_io_romio314.so(mca_io_romio314_file_set_view+0x22)[0x7f772e883802]
[lxbk0341:39368] [ 6] 
/lustre/hebe/rz/dbertini/plasma/softw/lib/libmpi.so.40(MPI_File_set_view+0xdd)[0x7f77423bfb2d]
<mytest.c>_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org<mailto:Hdf-forum@lists.hdfgroup.org>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to