Yes, I agree with you. I think I did the test using each file per MPI process. Each MPI process opens a file with the file name followed by its rank using MPI_File_open(MPI_COMM_SELF, ...). It showed a few times better performance (with np=4 or 8 on my workstation) than single MPI process (with np = 1) can achieve. As I mentioned before I could get: "As for the local disk, at least 2 times faster than single MPI process can achieve. As for the ramdisk, at least 5 times faster. Luster, I know that it is at least 7-8 times or more faster depending on the configuration.". However, when a single file is shared by multiple MPI processes (np > 1), the sum of write speed of all MPI processes is at most the performance of run with a single MPI process run(np = 1).
I expect the simple MPI File IO is scalable at least for small number of processes. But I don't see that at all now. I ran it on a shared memory machine having tens of cores, but saw the same results. Any idea? David On Mon, Apr 6, 2020 at 10:47 AM Gabriel, Edgar <egabr...@central.uh.edu> wrote: > The one test that would give you a good idea of the upper bound for your > scenario would be that write a benchmark where each process writes to a > separate file, and look at the overall bandwidth achieved across all > processes. The MPI I/O performance will be less or equal to the bandwidth > achieved in this scenario, as long as the number of processes are moderate. > > > > Thanks > > Edgar > > > > *From:* Dong-In Kang <dik...@gmail.com> > *Sent:* Monday, April 6, 2020 9:34 AM > *To:* Collin Strassburger <cstrassbur...@bihrle.com> > *Cc:* Open MPI Users <users@lists.open-mpi.org>; Gabriel, Edgar < > egabr...@central.uh.edu> > *Subject:* Re: [OMPI users] Slow collective MPI File IO > > > > Hi Collin, > > > > It is written in C. > > So, I think it is OK. > > > > Thank you, > > David > > > > > > On Mon, Apr 6, 2020 at 10:19 AM Collin Strassburger < > cstrassbur...@bihrle.com> wrote: > > Hello, > > > > Just a quick comment on this; is your code written in C/C++ or Fortran? > Fortran has issues with writing at a decent speed regardless of MPI setup > and as such should be avoided for file IO (yet I still occasionally see it > implemented). > > > > Collin > > > > *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Dong-In > Kang via users > *Sent:* Monday, April 6, 2020 10:02 AM > *To:* Gabriel, Edgar <egabr...@central.uh.edu> > *Cc:* Dong-In Kang <dik...@gmail.com>; Open MPI Users < > users@lists.open-mpi.org> > *Subject:* Re: [OMPI users] Slow collective MPI File IO > > > > > > Thank you Edgar for the information. > > > > I also tried MPI_File_write_at_all(), but it usually makes the > performance worse. > > My program is very simple. > > Each MPI process writes a consecutive portion of a file. > > No interleaving among the MPI processes. > > I think in this case I can use MPI_File_write_at(). > > > > I tested the maximum bandwidth of the target devices and they are at least > a few times bigger than what single process can achieve. > > I tested it using the same program but open the individual files using > MPI_COMM_SELF. > > I tested 32MB chunk, but didn't show noticeable changes. I also tried > 512MB chunk, but no noticeable difference. > > (There are performance differences between using 32MB chunk and using > 512MB chunk. > > But, they still don't make multiple MPI processes file IO exceeds the > performance of single MPI process file IO) > > As for the local disk, at least 2 times faster than single MPI process can > achieve. > > As for the ramdisk, at least 5 times faster. > > Luster, I know that it is at least 7-8 times or more faster depending on > the configuration. > > > > About caching effect, it would be the case of MPI_File_read(). > > I can see very high bandwidth of MPI_File_read(), which I believe comes > from caches in RAM. > > But as for MPI_File_write, I think it doesn't be affected by caching. > > And I create a new file for each test and removes the file at the end of > the testing. > > > > I may make a very simple mistake, but I don't know what it is. > > I saw MPI_File I/O could achieve multiple times of speedup over single > process file IO, > > when faster file system is used like Lustre from a few reports in the > internet. > > I started this experiment because I couldn't get speedup on Lustre file > system. > And then I moved the experiment to ramdisk and local disk, because it can > remove the issue of Lustre configuration. > > > > Any comments are welcome. > > > > David > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Apr 6, 2020 at 9:03 AM Gabriel, Edgar <egabr...@central.uh.edu> > wrote: > > Hi, > > > > A couple of comments. First, if you use MPI_File_write_at, this is usually > not considered collective I/O, even if executed by multiple processes. > MPI_File_write_at_all would be collective I/O. > > > > Second, MPI I/O can not do ‘magic’, but is bound by hardware that you are > providing. If already a single process is able to saturate the bandwidth of > your file system and hardware, you will not be able to see performance > improvements from multiple processes (some minor exceptions maybe due to > caching effects, but that is only for smaller problem sizes, the larger the > amount of data that you try to write, the lesser the caching effects become > in file I/O). So the first question that you have to answer, what is the > sustained bandwidth of your hardware, and are you able to saturate it > already with a single process. If you are using a single hard drive (or > even 2 or 3 hard drives in a RAID 0 configuration), this is almost > certainly the case. > > > > Lastly, the configuration parameters of your tests also play a major role. > As a general rule, the larger amounts of data you are able to provide per > file I/O call, the better the performance will be. 1MB of data per call is > probably on the smaller side. The ompio implementation of MPI I/O breaks > large individual I/O operations (e.g. MPI_File_write_at) into chunks of > 512MB for performance reasons internally. Large collective I/O operations > (e.g. MPI_File_write_at_all) are broken into chunks of 32 MB. This gives > you some hints on the quantities of data that you would have to use for > performance reasons. > > > > Along the same lines, one final comment. You say you did 1000 writes of > 1MB each. For a single process that is about 1GB of data. Depending on how > much main memory your PC has, this amount of data can still be cached in > modern systems, and you might have an unrealistically high bandwidth value > for the 1 process case that you are comparing against (it depends a bit on > what your benchmark does, and whether you force flushing the data to disk > inside of your measurement loop). > > > > Hope this gives you some pointers on where to start to look. > > Thanks > > Edgar > > > > *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Dong-In > Kang via users > *Sent:* Monday, April 6, 2020 7:14 AM > *To:* users@lists.open-mpi.org > *Cc:* Dong-In Kang <dik...@gmail.com> > *Subject:* [OMPI users] Slow collective MPI File IO > > > > Hi, > > > > I am running an MPI program where N processes write to a single file on a > single shared memory machine. > > I’m using OpenMPI v.4.0.2. > > Each MPI process write a 1MB chunk of data for 1K times sequentially. > > There is no overlap in the file between any of the two MPI processes. > > I ran the program for -np = {1, 2, 4, 8}. > > I am seeing that the speed of the collective write to a file for -np = {2, > 4, 8} never exceeds the speed of -np = {1}. > > I did the experiment with a few different file systems {local disk, ram > disk, Luster FS}. > > For all of them, I see similar results. > > The speed of collective write to a single shared file never exceeds the > speed of single MPI process case. > > Any tip or suggestions? > > > > I used MPI_File_write_at() routine with proper offset for each MPI process. > > (I also tried MPI_File_write_at_all() routine, which makes the performance > worse as np gets bigger.) > > Before writing, MPI_Barrrier() is used. > > The start time is taken right after MPI_Barrier() using MPI_Timer(); > > The end time is taken right after another MPI_Barrier(). > > The speed of the collective write is calculate as > > (total data amount written to the file)/(time between the first > MPI_Barrier() and the second MPI_Barrier()); > > > > Any idea to increase the speed? > > > > Thanks, > > David > > > > > > > -- > > ========= > > Jesus is My Lord! > > > -- ========= Jesus is My Lord!