Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

Rob Latham Fri, 20 Sep 2013 08:07:35 -0700

On Fri, Sep 20, 2013 at 01:34:24PM +0000, Biddiscombe, John A. wrote:
> This morning, I did some poking around and found that the cmake
> based configure of hdf has a nasty bug that causes H5_HAVE_GPFS to
> be set to false and no GPFS optimizations are compiled in (libgpfs
> is not detected). Having tweaked that, you can imagine my happiness
> when I recompiled everything and now I'm getting even worse
> Bandwidth.


Thanks for the report on those hints.  HDF5 contains, outside of
gpfs-specific benchmarks, one of the few implementations of all the
gpfs_fcntl() tuning parameters.   Given your experience, probably best
to turn off those hints.

Also, cmake works on bluegene?  wow.  Don't forget that bluegene
requires cross compliation.  

> In fact if I enable collective IO, the app coredumps on me, so the
> situations is worse than I had feared. I'm using too much memory in
> my test I suspect and collectives are pushing me over the limit. The
> only test I can run with collective enabled is the one that uses
> only one rank and writes 16MB!

How many processes per node are you using on your BGQ?  if you are
loading up with 64 procs per node, that will give each one about
200-230 MiB of scratch space.  

I wonder if you have built some or all of your hdf5 library for the
front end nodes, and some or none for the compute nodes?

How many processes are you running here?  

A month back I ran some one-rack experiments:
https://www.dropbox.com/s/89wmgmf1b1ung0s/mira_hinted_api_compare.png

Here's my IOR config file.  Note two tuning parameters here:
- "bg_nodes_pset", which showed up on Blue Gene /L, is way way too low
  for Blue Gene /Q
- the 'bglockless' prefix is "robl's secret turbo button".  it was fun
  to pull that rabbit out of the hat... for the first few years.
  (it's not the default because in one specific case performance is
  shockingly poor).

IOR START
        numTasks=65536
        repetitions=3
        reorderTasksConstant=1024
        fsync=1
        transferSize=6M
        blockSize=6M
        collective=1
        showHints=1
        hintsFileName=IOR-hints-bg_nodes_pset.64
        
testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-api.mpi
        api=MPIIO
        RUN
        api=HDF5
        
testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-api.h5
        RUN
        api=NCMPI
        
testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-api.nc
        RUN
IOR STOP


> Rob : you mentioned some fcntl functions were deprecated etc. do I
> need to remove these to stop the coredumps? (I'm very much hoping
> something has gone wrong with my tests because the performance is
> shockingly bad  ... ) (NB. my Version is 1.8.12-snap17)

Unless you are running BGQ system software driver V1R2M1, the
gpfs_fcntl hints do not get forwarded to storage, and return an error.
It's possible HDF5 responds to that error with a core dump? 

==rob


> JB
> 
> > -----Original Message----- From: Hdf-forum
> > [mailto:[email protected]] On Behalf Of Daniel
> > Langr Sent: 20 September 2013 13:46 To: HDF Users Discussion List
> > Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using
> > single (shared) file
> > 
> > Rob,
> > 
> > thanks a lot for hints. I will look at the suggested option and
> > try some experiments with it :).
> > 
> > Daniel
> > 
> > 
> > 
> > Dne 17. 9. 2013 15:34, Rob Latham napsal(a):
> > > On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:
> > >> separate files: 1.36 [s] single file, 1 stripe: 133.6 [s]
> > >> single file, best result: 17.2 [s]
> > >>
> > >> (I did multiple runs with various combinations of strip count
> > >> and size, presenting the best results I have obtained.)
> > >>
> > >> Increasing the number of stripes obviously helped a lot, but
> > >> comparing with the separate-files strategy, the writing time is
> > >> still more than ten times slower . Do you think it is "normal"?
> > >
> > > It might be "normal" for Lustre, but it's not good.  I wish I
> > > had more experience tuning the Cray/MPI-IO/Lustre stack, but I
> > > do not.  The ADIOS folks report tuned-HDF5 to a single shared
> > > file runs about 60% slower than ADIOS to multiple files, not 10x
> > > slower, so it seems there is room for improvement.
> > >
> > > I've asked them about the kinds of things "tuned HDF5" entails,
> > > and they didn't know (!).
> > >
> > > There are quite a few settings documented in the intro_mpi(3)
> > > man page.  MPICH_MPIIO_CB_ALIGN will probably be the most
> > > important thing you can try.  I'm sorry to report that in my
> > > limited experience, the documentation and reality are sometimes
> > > out of sync, especially with respect to which settings are
> > > default or not.
> > >
> > > ==rob
> > >
> > >> Thanks, Daniel
> > >>
> > >> Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):
> > >>> I've run some benchmark, where within an MPI program, each
> > >>> process wrote 3 plain 1D arrays to 3 datasets of an HDF5 file.
> > >>> I've used the following writing strategies:
> > >>>
> > >>> 1) each process writes to its own file, 2) each process writes
> > >>> to the same file to its own dataset, 3) each process writes to
> > >>> the same file to a same dataset.
> > >>>
> > >>> I've tested 1)-3) for both fixed/chunked datasets (chunk size
> > >>> 1024), and I've tested 2)-3) for both independent/collective
> > >>> options of the MPI driver. I've also used 3 different clusters
> > >>> for measurements (all quite modern).
> > >>>
> > >>> As a result, the running (storage) times of the same-file
> > >>> strategy, i.e.  2) and 3), were of orders of magnitudes longer
> > >>> than the running times of the separate-files strategy. For
> > >>> illustration:
> > >>>
> > >>> cluster #1, 512 MPI processes, each process stores 100 MB of
> > >>> data, fixed data sets:
> > >>>
> > >>> 1) separate files: 2.73 [s] 2) single file, independent calls,
> > >>> separate data sets: 88.54[s]
> > >>>
> > >>> cluster #2, 256 MPI processes, each process stores 100 MB of
> > >>> data, chunked data sets (chunk size 1024):
> > >>>
> > >>> 1) separate files: 10.40 [s] 2) single file, independent
> > >>> calls, shared data sets: 295 [s] 3) single file, collective
> > >>> calls, shared data sets: 3275 [s]
> > >>>
> > >>> Any idea why the single-file strategy gives so poor writing
> > >>> performance?
> > >>>
> > >>> Daniel
> > >>
> > >> _______________________________________________ Hdf-forum is
> > >> for HDF software users discussion.
> > >> [email protected]
> > >> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgr
> > >> oup.org
> > >
> > 
> > _______________________________________________ Hdf-forum is for
> > HDF software users discussion.  [email protected]
> > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> 
> _______________________________________________ Hdf-forum is for HDF
> software users discussion.  [email protected]
> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

Reply via email to