Re: [OMPI users] file/process write speed is not scalable
Hi David, could you specify which version of OpenMPI you are using ? I've also some parallel I/O trouble with one code but still have not investigated. Thanks Patrick Le 13/04/2020 à 17:11, Dong-In Kang via users a écrit : > > Thank you for your suggestion. > I am more concerned about the poor performance of one MPI > process/socket case. > The model fits better for my real workload. > The performance that I see is a lot worse than what the underlying > hardware can support. > The best case (all MPI processes in a single socket) is pretty good, > which is about 80+% of underlying hardware's speed. > However, one MPI per socket model achieves only 30% of what I get with > all MPI processes in a single socket. > Both are doing the same thing - independent file write. > I used all the OSTs available. > > As a reference point, I did the same test on ramdisk. > For both case, the performance scales very well, and their > performances are close. > > There seems to be extra overhead when multi-sockets are used for > independent file I/O with Lustre. > I don't know what causes that overhead. > > Thanks, > David > > > On Thu, Apr 9, 2020 at 11:07 PM Gilles Gouaillardet via users > mailto:users@lists.open-mpi.org>> wrote: > > Note there could be some NUMA-IO effect, so I suggest you compare > running every MPI tasks on socket 0, to running every MPI tasks on > socket 1 and so on, and then compared to running one MPI task per > socket. > > Also, what performance do you measure? > - Is this something in line with the filesystem/network expectation? > - Or is this much higher (and in this case, you are benchmarking > the i/o cache)? > > FWIW, I usually write files whose cumulated size is four times the > node memory to avoid local caching effect > (if you have a lot of RAM, that might take a while ...) > > Keep in mind Lustre is also sensitive to the file layout. > If you write one file per task, you likely want to use all the > available OST, but no stripping. > If you want to write into a single file with 1MB blocks per MPI task, > you likely want to stripe with 1MB blocks, > and use the same number of OST than MPI tasks (so each MPI task ends > up writing to its own OST) > > Cheers, > > Gilles > > On Fri, Apr 10, 2020 at 6:41 AM Dong-In Kang via users > mailto:users@lists.open-mpi.org>> wrote: > > > > Hi, > > > > I'm running IOR benchmark on a big shared memory machine with > Lustre file system. > > I set up IOR to use an independent file/process so that the > aggregated bandwidth is maximized. > > I ran N MPI processes where N < # of cores in a socket. > > When I put those N MPI processes on a single socket, its write > performance is scalable. > > However, when I put those N MPI processes on N sockets (so, 1 > MPI process/socket), > > it performance does not scale, and stays the same for more than > 4 MPI processes. > > I expected it would be as scalable as the case of N processes on > a single socket. > > But, it is not. > > > > I think if an MPI process write to an independent file/process, > there must not be file locking among MPI processes. However, there > seems to be some. Is there any way to avoid that locking or > overhead? It may not be file lock issue, but I don't know what is > the exact reason for the poor performance. > > > > Any help will be appreciated. > > > > David >
Re: [OMPI users] file/process write speed is not scalable
I'm using OpenMPI v.4.0.2. Is your problem similar to mine? Thanks, David On Tue, Apr 14, 2020 at 7:33 AM Patrick Bégou via users < users@lists.open-mpi.org> wrote: > Hi David, > > could you specify which version of OpenMPI you are using ? > I've also some parallel I/O trouble with one code but still have not > investigated. > Thanks > > Patrick > > Le 13/04/2020 à 17:11, Dong-In Kang via users a écrit : > > > Thank you for your suggestion. > I am more concerned about the poor performance of one MPI process/socket > case. > The model fits better for my real workload. > The performance that I see is a lot worse than what the underlying > hardware can support. > The best case (all MPI processes in a single socket) is pretty good, which > is about 80+% of underlying hardware's speed. > However, one MPI per socket model achieves only 30% of what I get with all > MPI processes in a single socket. > Both are doing the same thing - independent file write. > I used all the OSTs available. > > As a reference point, I did the same test on ramdisk. > For both case, the performance scales very well, and their performances > are close. > > There seems to be extra overhead when multi-sockets are used for > independent file I/O with Lustre. > I don't know what causes that overhead. > > Thanks, > David > > > On Thu, Apr 9, 2020 at 11:07 PM Gilles Gouaillardet via users < > users@lists.open-mpi.org> wrote: > >> Note there could be some NUMA-IO effect, so I suggest you compare >> running every MPI tasks on socket 0, to running every MPI tasks on >> socket 1 and so on, and then compared to running one MPI task per >> socket. >> >> Also, what performance do you measure? >> - Is this something in line with the filesystem/network expectation? >> - Or is this much higher (and in this case, you are benchmarking the i/o >> cache)? >> >> FWIW, I usually write files whose cumulated size is four times the >> node memory to avoid local caching effect >> (if you have a lot of RAM, that might take a while ...) >> >> Keep in mind Lustre is also sensitive to the file layout. >> If you write one file per task, you likely want to use all the >> available OST, but no stripping. >> If you want to write into a single file with 1MB blocks per MPI task, >> you likely want to stripe with 1MB blocks, >> and use the same number of OST than MPI tasks (so each MPI task ends >> up writing to its own OST) >> >> Cheers, >> >> Gilles >> >> On Fri, Apr 10, 2020 at 6:41 AM Dong-In Kang via users >> wrote: >> > >> > Hi, >> > >> > I'm running IOR benchmark on a big shared memory machine with Lustre >> file system. >> > I set up IOR to use an independent file/process so that the aggregated >> bandwidth is maximized. >> > I ran N MPI processes where N < # of cores in a socket. >> > When I put those N MPI processes on a single socket, its write >> performance is scalable. >> > However, when I put those N MPI processes on N sockets (so, 1 MPI >> process/socket), >> > it performance does not scale, and stays the same for more than 4 MPI >> processes. >> > I expected it would be as scalable as the case of N processes on a >> single socket. >> > But, it is not. >> > >> > I think if an MPI process write to an independent file/process, there >> must not be file locking among MPI processes. However, there seems to be >> some. Is there any way to avoid that locking or overhead? It may not be >> file lock issue, but I don't know what is the exact reason for the poor >> performance. >> > >> > Any help will be appreciated. >> > >> > David >> > >
Re: [OMPI users] Meaning of mpiexec error flags
Then those flags are correct. I suspect mpirun is executing on n006, yes? The "location verified" just means that the daemon of rank N reported back from the node we expected it to be on - Slurm and Cray sometimes renumber the ranks. Torque doesn't and so you should never see a problem. Since mpirun isn't launched by itself, its node is never "verified", though I probably should alter that as it is obviously in the "right" place. I don't know what you mean by your app isn't behaving correctly on the remote nodes - best guess is that perhaps some envar they need isn't being forwarded? On Apr 14, 2020, at 2:04 AM, Mccall, Kurt E. (MSFC-EV41) mailto:kurt.e.mcc...@nasa.gov> > wrote: CentOS, Torque. From: Ralph Castain mailto:r...@open-mpi.org> > Sent: Monday, April 13, 2020 5:44 PM To: Mccall, Kurt E. (MSFC-EV41) mailto:kurt.e.mcc...@nasa.gov> > Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags What kind of system are you running on? Slurm? Cray? ...? On Apr 13, 2020, at 3:11 PM, Mccall, Kurt E. (MSFC-EV41) mailto:kurt.e.mcc...@nasa.gov> > wrote: Thanks Ralph. So the difference between the working node flag (0x11) and the non-working nodes’ flags (0x13) is the flagPRRTE_NODE_FLAG_LOC_VERIFIED. What does that imply? The location of the daemon has NOT been verified? Kurt From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users Sent: Monday, April 13, 2020 4:47 PM To: Open MPI Users mailto:users@lists.open-mpi.org> > Cc: Ralph Castain mailto:r...@open-mpi.org> > Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags I updated the message to explain the flags (instead of a numerical value) for OMPI v5. In brief: #define PRRTE_NODE_FLAG_DAEMON_LAUNCHED 0x01 // whether or not the daemon on this node has been launched #define PRRTE_NODE_FLAG_LOC_VERIFIED 0x02 // whether or not the location has been verified - used for // environments where the daemon's final destination is uncertain #define PRRTE_NODE_FLAG_OVERSUBSCRIBED 0x04 // whether or not this node is oversubscribed #define PRRTE_NODE_FLAG_MAPPED 0x08 // whether we have been added to the current map #define PRRTE_NODE_FLAG_SLOTS_GIVEN 0x10 // the number of slots was specified - used only in non-managed environments #define PRRTE_NODE_NON_USABLE 0x20 // the node is hosting a tool and is NOT to be used for jobs On Apr 13, 2020, at 2:15 PM, Mccall, Kurt E. (MSFC-EV41) via users mailto:users@lists.open-mpi.org> > wrote: My application is behaving correctly on node n006, and incorrectly on the lower numbered nodes. The flags in the error message below may give a clue as to why. What is the meaning of the flag values 0x11 and 0x13? == ALLOCATED NODES == n006: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP n005: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n004: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n003: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n002: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP I’m using OpenMPI 4.0.3. Thanks, Kurt
Re: [OMPI users] Meaning of mpiexec error flags
Darn, I was hoping the flags would give a clue to the malfunction, which I’ve been trying to solve for weeks. MPI_Comm_spawn() correctly spawns a worker on the node the mpirun is executing on, but on other nodes it says the following: There are no allocated resources for the application: /home/kmccall/mav/9.15_mpi/mav that match the requested mapping: -host: n002.cluster.com:3 Verify that you have mapped the allocated resources properly for the indicated specification. [n002:08645] *** An error occurred in MPI_Comm_spawn [n002:08645] *** reported by process [1225916417,4] [n002:08645] *** on communicator MPI_COMM_SELF [n002:08645] *** MPI_ERR_SPAWN: could not spawn processes [n002:08645] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [n002:08645] ***and potentially your MPI job) As you suggested several weeks ago, I added a process count to the host name (n001.cluster.com:3) but it didn’t help. Here is how I set up the “info” argument to MPI_Comm_spawn to spawn a single worker: char info_str[64], host_str[64]; sprintf(info_str, "ppr:%d:node", 1); sprintf(host_str, "%s:%d", host_name_.c_str(), 3);// added “:3” to host name MPI_Info_create(&info); MPI_Info_set(info, "host", host_str); MPI_Info_set(info, "map-by", info_str); MPI_Info_set(info, "ompi_non_mpi", "true"); From: users On Behalf Of Ralph Castain via users Sent: Tuesday, April 14, 2020 8:13 AM To: Open MPI Users Cc: Ralph Castain Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags Then those flags are correct. I suspect mpirun is executing on n006, yes? The "location verified" just means that the daemon of rank N reported back from the node we expected it to be on - Slurm and Cray sometimes renumber the ranks. Torque doesn't and so you should never see a problem. Since mpirun isn't launched by itself, its node is never "verified", though I probably should alter that as it is obviously in the "right" place. I don't know what you mean by your app isn't behaving correctly on the remote nodes - best guess is that perhaps some envar they need isn't being forwarded? On Apr 14, 2020, at 2:04 AM, Mccall, Kurt E. (MSFC-EV41) mailto:kurt.e.mcc...@nasa.gov>> wrote: CentOS, Torque. From: Ralph Castain mailto:r...@open-mpi.org>> Sent: Monday, April 13, 2020 5:44 PM To: Mccall, Kurt E. (MSFC-EV41) mailto:kurt.e.mcc...@nasa.gov>> Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags What kind of system are you running on? Slurm? Cray? ...? On Apr 13, 2020, at 3:11 PM, Mccall, Kurt E. (MSFC-EV41) mailto:kurt.e.mcc...@nasa.gov>> wrote: Thanks Ralph. So the difference between the working node flag (0x11) and the non-working nodes’ flags (0x13) is the flagPRRTE_NODE_FLAG_LOC_VERIFIED. What does that imply? The location of the daemon has NOT been verified? Kurt From: users mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Ralph Castain via users Sent: Monday, April 13, 2020 4:47 PM To: Open MPI Users mailto:users@lists.open-mpi.org>> Cc: Ralph Castain mailto:r...@open-mpi.org>> Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags I updated the message to explain the flags (instead of a numerical value) for OMPI v5. In brief: #define PRRTE_NODE_FLAG_DAEMON_LAUNCHED0x01 // whether or not the daemon on this node has been launched #define PRRTE_NODE_FLAG_LOC_VERIFIED 0x02 // whether or not the location has been verified - used for // environments where the daemon's final destination is uncertain #define PRRTE_NODE_FLAG_OVERSUBSCRIBED 0x04 // whether or not this node is oversubscribed #define PRRTE_NODE_FLAG_MAPPED 0x08 // whether we have been added to the current map #define PRRTE_NODE_FLAG_SLOTS_GIVEN 0x10 // the number of slots was specified - used only in non-managed environments #define PRRTE_NODE_NON_USABLE 0x20 // the node is hosting a tool and is NOT to be used for jobs On Apr 13, 2020, at 2:15 PM, Mccall, Kurt E. (MSFC-EV41) via users mailto:users@lists.open-mpi.org>> wrote: My application is behaving correctly on node n006, and incorrectly on the lower numbered nodes. The flags in the error message below may give a clue as to why. What is the meaning of the flag values 0x11 and 0x13? == ALLOCATED NODES == n006: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP n005: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n004: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n003: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n002: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n001: flags=0x13 slots=3 max_slots=0 slots_i
[OMPI users] Hwlock library problem
I am attempting to build the latest stable version of openmpi (openmpi-4.0.3) on Mac OS 10.15.4 using the latest intel compilers fort, icc, iclc (19.1.1.216 20200306). I am using the configuration ./configure --prefix=/opt/openmpi CC=icc CXX=icpc F77=ifort FC=ifort --with-hwloc=internal --with-libevent=internal My initial attempt to make opemmpi failed with the error -lhwloc not found. I then read the FAQ on the open-mpi homepage which stated that the problem often arises when compilers are mixed (they are not here). The FAQ also suggested that the options --with-hwloc=internal --with-libevent=internal would force the build process to use an internal version of hwloc. There were no other errors in the make process so presumably the hwloc internal library was successfully built. The error messages are reprinted below (they match the hwloc FAQ exactly). I don’t understand how to work around the problem however and suggestions would be welcome. I do have homebrew installed, but I would think that the explicit options to use the internal hwlock library would avoid referencing other libraries. Note the same problem occurs when configure is specified with gcc-9 and gfortran (from homebrew). Thanks four your help in advance. amke . . . make[1]: Nothing to be done for `all'. Making all in mpi/fortran/use-mpi-f08 FCLD libmpi_usempif08.la ifort: command line warning #10006: ignoring unknown option '-force_load,mod/.libs/libforce_usempif08_internal_modules_to_be_built.a' ifort: command line warning #10006: ignoring unknown option '-force_load,bindings/.libs/libforce_usempif08_internal_bindings_to_be_built.a' ifort: command line warning #10006: ignoring unknown option '-force_load,../../../../ompi/mpiext/pcollreq/use-mpi-f08/.libs/libmpiext_pcollreq_usempif08.a' ifort: command line warning #10006: ignoring unknown option '-force_load,base/.libs/libusempif08_ccode.a' ld: library not found for -lhwloc make[1]: *** [libmpi_usempif08.la] Error 1 make: *** [all-recursive] Error 1 Paul Fons Keio University, Faculty of Science and Technology, Department of Electronics and Information Engineering 〒223-8522 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223-8522, Japan Keio University Faculty of Science and Technology Yagami Campus. 〒223-8522 横浜市港北区日吉3-14-1 慶應義塾大学理工学部電子工学科
Re: [OMPI users] Hwlock library problem
Paul, this issue is likely the one already been reported at https://github.com/open-mpi/ompi/issues/7615 Several workarounds are documented, feel free to try some of them and report back (either on GitHub or this mailing list) Cheers, Gilles On Tue, Apr 14, 2020 at 11:18 PM フォンスポール J via users wrote: > > I am attempting to build the latest stable version of openmpi (openmpi-4.0.3) > on Mac OS 10.15.4 using the latest intel compilers fort, icc, iclc > (19.1.1.216 20200306). I am using the configuration > > > ./configure --prefix=/opt/openmpi CC=icc CXX=icpc F77=ifort FC=ifort > --with-hwloc=internal --with-libevent=internal > > My initial attempt to make opemmpi failed with the error -lhwloc not found. I > then read the FAQ on the open-mpi homepage which stated that the problem > often arises when compilers are mixed (they are not here). The FAQ also > suggested that the options --with-hwloc=internal --with-libevent=internal > would force the build process to use an internal version of hwloc. There were > no other errors in the make process so presumably the hwloc internal library > was successfully built. The error messages are reprinted below (they match > the hwloc FAQ exactly). I don’t understand how to work around the problem > however and suggestions would be welcome. I do have homebrew installed, but I > would think that the explicit options to use the internal hwlock library > would avoid referencing other libraries. Note the same problem occurs when > configure is specified with gcc-9 and gfortran (from homebrew). Thanks four > your help in advance. > > amke > . > . > . > > make[1]: Nothing to be done for `all'. > Making all in mpi/fortran/use-mpi-f08 > FCLD libmpi_usempif08.la > ifort: command line warning #10006: ignoring unknown option > '-force_load,mod/.libs/libforce_usempif08_internal_modules_to_be_built.a' > ifort: command line warning #10006: ignoring unknown option > '-force_load,bindings/.libs/libforce_usempif08_internal_bindings_to_be_built.a' > ifort: command line warning #10006: ignoring unknown option > '-force_load,../../../../ompi/mpiext/pcollreq/use-mpi-f08/.libs/libmpiext_pcollreq_usempif08.a' > ifort: command line warning #10006: ignoring unknown option > '-force_load,base/.libs/libusempif08_ccode.a' > ld: library not found for -lhwloc > make[1]: *** [libmpi_usempif08.la] Error 1 > make: *** [all-recursive] Error 1 > > > > > > Paul Fons > > > Keio University, Faculty of Science and Technology, Department of Electronics > and Information Engineering > > 〒223-8522 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223-8522, Japan > Keio University Faculty of Science and Technology Yagami Campus. > > 〒223-8522 横浜市港北区日吉3-14-1 慶應義塾大学理工学部電子工学科 > > > >
[OMPI users] Inquiry about pml layer
Hello,I'm using CUDA-aware OMPIv4.0.3 with UCX to run some apps. Most of them have worked seamlessly, but one breaks and returns the error:memtype_cache.c:299 UCX ERROR failed to set UCM memtype event handler: Unsupported operation--No components were able to be opened in the pml framework.This typically means that either no components of this type wereinstalled, or none of the installed components can be loaded.Sometimes this means that shared libraries required by thesecomponents are unable to be found/loaded. Host: laghos-2 Framework: pml--[laghos-2:00328] PML ucx cannot be selected The full discussion is posted at https://github.com/openucx/ucx/issues/4988. I don't fully understand the internals of OpenMPI and my question is specific to the 'pml' layer. Does it make any difference if the network is either ethernet or IB on how OpenMPI handles the data and, more specifically, memory access?Thanks.Arturo