Hi Daniel,
As Mohamad eluded to, we have developed a framework for auto-tuning HDF5
applications which is going to be presented at this year's
Supercomputing conference:
http://sc13.supercomputing.org/schedule/event_detail.php?evid=pap511
And I have recently installed this framework on Bluewaters. In case you
are interested in increasing the I/O performance of your application
more, I think I will be able to help you. You can contact me directly to
follow-up.
Thanks,
Babak
On 09/03/2013 11:14 AM, Mohamad Chaarawi wrote:
Hi Daniel,
I'm not sure what the issue with the forum email list is, but nobody seems to
have this problem. Just make sure you are always sending your messages and
replies to [email protected]; not another address.
I'll ask the sysadmins to look into this issue more.
Now to your results, the multiple file strategy is always (at least in most
cases) going to be the fastest strategy. There are no locking contention, and
not inter-process communication overhead.
The difference in performance with the single file strategy still seems a bit
high in your case, but again I'm saying this with a total lack of knowledge on
how your benchmark/application is accessing the file. I do not believe chunking
will help here.
One thing worth trying is varying the number of MPI aggregators. What MPI
library are you using? The MPI IO library is most probably ROMIO, so it should
accepts info hints (not sure if the top level implementation might ignore those
hints, but you can check anyway).
So use an MPI info object, that you pass in H5Pset_fapl_mpio(), to set the
number of MPI aggregators (cb_nodes, and cb_buffer_size). A full list of hints
to ROMIO can be found here:
http://www.mcs.anl.gov/research/projects/romio/doc/users-guide.pdf
I would set the cb_nodes to the stripe count; and try the cb_buffer_size as the
stripe size. Those are not necessary the ideal options, but best to start there.
I know that all this tuning is a burden for an application user of HDF5, but
that is what needs to be done today to get good performance. There have been
some work done aimed at auto tuning all this parameter space using a separate
tool, but the architecture is not user friendly yet for someone to simply grab,
deploy and run.
Thanks,
Mohamad
-----Original Message-----
From: Hdf-forum [mailto:[email protected]] On Behalf Of
Daniel Langr
Sent: Tuesday, September 03, 2013 10:38 AM
To: [email protected]
Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single
(shared) file
Mohamad,
I really do not understand how to reply to this forum :(. I tried to reply to
your post, which I received via e-mail. In this e-mail, there was the following
note:
"
If you reply to this email, your message will be added to the discussion
below:
http://hdf-forum.184993.n3.nabble.com/Very-poor-performance-of-pHDF5-when-using-single-shared-file-tp4026443p4026449.html
"
So, I replied to this e-mail, and received another one:
Subject: Delivery Status Notification (Failure)
"
Delivery to the following recipient failed permanently:
[email protected]
Your email to [email protected] has been rejected
because you are not allowed to post to
http://hdf-forum.184993.n3.nabble.com/Very-poor-performance-of-pHDF5-when-using-single-shared-file-tp4026443p4026449.html
. Please contact the owner about permissions or visit the Nabble Support
forum.
"
What the hell... why does it say I should reply and then that I am not
allowed to post to my own thread???
Anyway, I tried to post the following information:
I did some experiments yesterday using the BlueWaters cluster. The
stripe count is limited there to 160. For runs with 256 MPI
processes/cores and fixed datasets were the writing times:
separate files: 1.36 [s]
single file, 1 stripe: 133.6 [s]
single file, best result: 17.2 [s]
(I did multiple runs with various combinations of strip count and size,
presenting the best results I have obtained.)
Increasing the number of stripes obviously helped a lot, but comparing
with the separate-files strategy, the writing time is still more than
ten times slower . Do you think it is "normal"?
Might chunking help here?
Thanks,
Daniel
Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):
I've run some benchmark, where within an MPI program, each process wrote
3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following
writing strategies:
1) each process writes to its own file,
2) each process writes to the same file to its own dataset,
3) each process writes to the same file to a same dataset.
I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and
I've tested 2)-3) for both independent/collective options of the MPI
driver. I've also used 3 different clusters for measurements (all quite
modern).
As a result, the running (storage) times of the same-file strategy, i.e.
2) and 3), were of orders of magnitudes longer than the running times of
the separate-files strategy. For illustration:
cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed
data sets:
1) separate files: 2.73 [s]
2) single file, independent calls, separate data sets: 88.54[s]
cluster #2, 256 MPI processes, each process stores 100 MB of data,
chunked data sets (chunk size 1024):
1) separate files: 10.40 [s]
2) single file, independent calls, shared data sets: 295 [s]
3) single file, collective calls, shared data sets: 3275 [s]
Any idea why the single-file strategy gives so poor writing performance?
Daniel
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org