Re: [OMPI users] file/process write speed is not scalable

2020-04-14 Thread Patrick Bégou via users
Hi David,

could you specify which version of OpenMPI you are using ?
I've also some parallel I/O trouble with one code but still have not
investigated.
Thanks

Patrick

Le 13/04/2020 à 17:11, Dong-In Kang via users a écrit :
>
>  Thank you for your suggestion.
> I am more concerned about the poor performance of one MPI
> process/socket case.
> The model fits better for my real workload.
> The performance that I see is a lot worse than what the underlying
> hardware can support.
> The best case (all MPI processes in a single socket) is pretty good,
> which is about 80+% of underlying hardware's speed.
> However, one MPI per socket model achieves only 30% of what I get with
> all MPI processes in a single socket.
> Both are doing the same thing - independent file write.
> I used all the OSTs available.
>
> As a reference point, I did the same test on ramdisk.
> For both case, the performance scales very well, and their
> performances are close.
>
> There seems to be extra overhead when multi-sockets are used for
> independent file I/O with Lustre.
> I don't know what causes that overhead.
>
> Thanks,
> David
>
>
> On Thu, Apr 9, 2020 at 11:07 PM Gilles Gouaillardet via users
> mailto:users@lists.open-mpi.org>> wrote:
>
> Note there could be some NUMA-IO effect, so I suggest you compare
> running every MPI tasks on socket 0, to running every MPI tasks on
> socket 1 and so on, and then compared to running one MPI task per
> socket.
>
> Also, what performance do you measure?
> - Is this something in line with the filesystem/network expectation?
> - Or is this much higher (and in this case, you are benchmarking
> the i/o cache)?
>
> FWIW, I usually write files whose cumulated size is four times the
> node memory to avoid local caching effect
> (if you have a lot of RAM, that might take a while ...)
>
> Keep in mind Lustre is also sensitive to the file layout.
> If you write one file per task, you likely want to use all the
> available OST, but no stripping.
> If you want to write into a single file with 1MB blocks per MPI task,
> you likely want to stripe with 1MB blocks,
> and use the same number of OST than MPI tasks (so each MPI task ends
> up writing to its own OST)
>
> Cheers,
>
> Gilles
>
> On Fri, Apr 10, 2020 at 6:41 AM Dong-In Kang via users
> mailto:users@lists.open-mpi.org>> wrote:
> >
> > Hi,
> >
> > I'm running IOR benchmark on a big shared memory machine with
> Lustre file system.
> > I set up IOR to use an independent file/process so that the
> aggregated bandwidth is maximized.
> > I ran N MPI processes where N < # of cores in a socket.
> > When I put those N MPI processes on a single socket, its write
> performance is scalable.
> > However, when I put those N MPI processes on N sockets (so, 1
> MPI process/socket),
> > it performance does not scale, and stays the same for more than
> 4 MPI processes.
> > I expected it would be as scalable as the case of N processes on
> a single socket.
> > But, it is not.
> >
> > I think if an MPI process write to an independent file/process,
> there must not be file locking among MPI processes. However, there
> seems to be some. Is there any way to avoid that locking or
> overhead? It may not be file lock issue, but I don't know what is
> the exact reason for the poor performance.
> >
> > Any help will be appreciated.
> >
> > David
>



Re: [OMPI users] file/process write speed is not scalable

2020-04-14 Thread Dong-In Kang via users
I'm using OpenMPI v.4.0.2.
Is your problem similar to mine?

Thanks,
David


On Tue, Apr 14, 2020 at 7:33 AM Patrick Bégou via users <
users@lists.open-mpi.org> wrote:

> Hi David,
>
> could you specify which version of OpenMPI you are using ?
> I've also some parallel I/O trouble with one code but still have not
> investigated.
> Thanks
>
> Patrick
>
> Le 13/04/2020 à 17:11, Dong-In Kang via users a écrit :
>
>
>  Thank you for your suggestion.
> I am more concerned about the poor performance of one MPI process/socket
> case.
> The model fits better for my real workload.
> The performance that I see is a lot worse than what the underlying
> hardware can support.
> The best case (all MPI processes in a single socket) is pretty good, which
> is about 80+% of underlying hardware's speed.
> However, one MPI per socket model achieves only 30% of what I get with all
> MPI processes in a single socket.
> Both are doing the same thing - independent file write.
> I used all the OSTs available.
>
> As a reference point, I did the same test on ramdisk.
> For both case, the performance scales very well, and their performances
> are close.
>
> There seems to be extra overhead when multi-sockets are used for
> independent file I/O with Lustre.
> I don't know what causes that overhead.
>
> Thanks,
> David
>
>
> On Thu, Apr 9, 2020 at 11:07 PM Gilles Gouaillardet via users <
> users@lists.open-mpi.org> wrote:
>
>> Note there could be some NUMA-IO effect, so I suggest you compare
>> running every MPI tasks on socket 0, to running every MPI tasks on
>> socket 1 and so on, and then compared to running one MPI task per
>> socket.
>>
>> Also, what performance do you measure?
>> - Is this something in line with the filesystem/network expectation?
>> - Or is this much higher (and in this case, you are benchmarking the i/o
>> cache)?
>>
>> FWIW, I usually write files whose cumulated size is four times the
>> node memory to avoid local caching effect
>> (if you have a lot of RAM, that might take a while ...)
>>
>> Keep in mind Lustre is also sensitive to the file layout.
>> If you write one file per task, you likely want to use all the
>> available OST, but no stripping.
>> If you want to write into a single file with 1MB blocks per MPI task,
>> you likely want to stripe with 1MB blocks,
>> and use the same number of OST than MPI tasks (so each MPI task ends
>> up writing to its own OST)
>>
>> Cheers,
>>
>> Gilles
>>
>> On Fri, Apr 10, 2020 at 6:41 AM Dong-In Kang via users
>>  wrote:
>> >
>> > Hi,
>> >
>> > I'm running IOR benchmark on a big shared memory machine with Lustre
>> file system.
>> > I set up IOR to use an independent file/process so that the aggregated
>> bandwidth is maximized.
>> > I ran N MPI processes where N < # of cores in a socket.
>> > When I put those N MPI processes on a single socket, its write
>> performance is scalable.
>> > However, when I put those N MPI processes on N sockets (so, 1 MPI
>> process/socket),
>> > it performance does not scale, and stays the same for more than 4 MPI
>> processes.
>> > I expected it would be as scalable as the case of N processes on a
>> single socket.
>> > But, it is not.
>> >
>> > I think if an MPI process write to an independent file/process, there
>> must not be file locking among MPI processes. However, there seems to be
>> some. Is there any way to avoid that locking or overhead? It may not be
>> file lock issue, but I don't know what is the exact reason for the poor
>> performance.
>> >
>> > Any help will be appreciated.
>> >
>> > David
>>
>
>


Re: [OMPI users] Meaning of mpiexec error flags

2020-04-14 Thread Ralph Castain via users
Then those flags are correct. I suspect mpirun is executing on n006, yes? The 
"location verified" just means that the daemon of rank N reported back from the 
node we expected it to be on - Slurm and Cray sometimes renumber the ranks. 
Torque doesn't and so you should never see a problem. Since mpirun isn't 
launched by itself, its node is never "verified", though I probably should 
alter that as it is obviously in the "right" place.

I don't know what you mean by your app isn't behaving correctly on the remote 
nodes - best guess is that perhaps some envar they need isn't being forwarded?


On Apr 14, 2020, at 2:04 AM, Mccall, Kurt E. (MSFC-EV41) 
mailto:kurt.e.mcc...@nasa.gov> > wrote:

CentOS, Torque.
   From: Ralph Castain mailto:r...@open-mpi.org> > 
Sent: Monday, April 13, 2020 5:44 PM
To: Mccall, Kurt E. (MSFC-EV41) mailto:kurt.e.mcc...@nasa.gov> >
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags

 What kind of system are you running on? Slurm? Cray? ...?
 

On Apr 13, 2020, at 3:11 PM, Mccall, Kurt E. (MSFC-EV41) 
mailto:kurt.e.mcc...@nasa.gov> > wrote:
 Thanks Ralph.   So the difference between the working node flag (0x11) and the 
non-working nodes’ flags (0x13) is the flagPRRTE_NODE_FLAG_LOC_VERIFIED.    
What does that imply?   The location of the daemon has NOT been verified?
 Kurt
 From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users
Sent: Monday, April 13, 2020 4:47 PM
To: Open MPI Users mailto:users@lists.open-mpi.org> >
Cc: Ralph Castain mailto:r...@open-mpi.org> >
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags

 I updated the message to explain the flags (instead of a numerical value) for 
OMPI v5. In brief:
 #define PRRTE_NODE_FLAG_DAEMON_LAUNCHED    0x01   // whether or not the daemon 
on this node has been launched
#define PRRTE_NODE_FLAG_LOC_VERIFIED               0x02   // whether or not the 
location has been verified - used for
                                                                                
                      // environments where the daemon's final destination is 
uncertain
#define PRRTE_NODE_FLAG_OVERSUBSCRIBED       0x04   // whether or not this node 
is oversubscribed
#define PRRTE_NODE_FLAG_MAPPED                         0x08   // whether we 
have been added to the current map
#define PRRTE_NODE_FLAG_SLOTS_GIVEN               0x10   // the number of slots 
was specified - used only in non-managed environments
#define PRRTE_NODE_NON_USABLE                           0x20   // the node is 
hosting a tool and is NOT to be used for jobs
  


On Apr 13, 2020, at 2:15 PM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org> > wrote:
 My application is behaving correctly on node n006, and incorrectly on the 
lower numbered nodes.   The flags in the error message below may give a clue as 
to why.   What is the meaning of the flag values 0x11 and 0x13?
 ==   ALLOCATED NODES   ==
    n006: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP
    n005: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
    n004: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
    n003: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
    n002: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
    n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
 I’m using OpenMPI 4.0.3.
 Thanks,
Kurt



Re: [OMPI users] Meaning of mpiexec error flags

2020-04-14 Thread Mccall, Kurt E. (MSFC-EV41) via users
Darn, I was hoping the flags would give a clue to the malfunction, which I’ve 
been trying to solve for weeks.  MPI_Comm_spawn() correctly spawns a worker on 
the node the mpirun is executing on, but on other nodes it says the following:



There are no allocated resources for the application:
  /home/kmccall/mav/9.15_mpi/mav
that match the requested mapping:
  -host: n002.cluster.com:3

Verify that you have mapped the allocated resources properly for the
indicated specification.

[n002:08645] *** An error occurred in MPI_Comm_spawn
[n002:08645] *** reported by process [1225916417,4]
[n002:08645] *** on communicator MPI_COMM_SELF
[n002:08645] *** MPI_ERR_SPAWN: could not spawn processes
[n002:08645] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n002:08645] ***and potentially your MPI job)

As you suggested several weeks ago, I added a process count to the host name 
(n001.cluster.com:3)   but it didn’t help.   Here is how I set up the “info” 
argument to MPI_Comm_spawn to spawn a single worker:

char info_str[64], host_str[64];

sprintf(info_str, "ppr:%d:node", 1);
sprintf(host_str, "%s:%d", host_name_.c_str(), 3);// added “:3” to 
host name

MPI_Info_create(&info);
MPI_Info_set(info, "host", host_str);
MPI_Info_set(info, "map-by", info_str);
MPI_Info_set(info, "ompi_non_mpi", "true");


From: users  On Behalf Of Ralph Castain via 
users
Sent: Tuesday, April 14, 2020 8:13 AM
To: Open MPI Users 
Cc: Ralph Castain 
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags

Then those flags are correct. I suspect mpirun is executing on n006, yes? The 
"location verified" just means that the daemon of rank N reported back from the 
node we expected it to be on - Slurm and Cray sometimes renumber the ranks. 
Torque doesn't and so you should never see a problem. Since mpirun isn't 
launched by itself, its node is never "verified", though I probably should 
alter that as it is obviously in the "right" place.

I don't know what you mean by your app isn't behaving correctly on the remote 
nodes - best guess is that perhaps some envar they need isn't being forwarded?



On Apr 14, 2020, at 2:04 AM, Mccall, Kurt E. (MSFC-EV41) 
mailto:kurt.e.mcc...@nasa.gov>> wrote:

CentOS, Torque.



From: Ralph Castain mailto:r...@open-mpi.org>>
Sent: Monday, April 13, 2020 5:44 PM
To: Mccall, Kurt E. (MSFC-EV41) 
mailto:kurt.e.mcc...@nasa.gov>>
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags

What kind of system are you running on? Slurm? Cray? ...?




On Apr 13, 2020, at 3:11 PM, Mccall, Kurt E. (MSFC-EV41) 
mailto:kurt.e.mcc...@nasa.gov>> wrote:

Thanks Ralph.   So the difference between the working node flag (0x11) and the 
non-working nodes’ flags (0x13) is the flagPRRTE_NODE_FLAG_LOC_VERIFIED.
What does that imply?   The location of the daemon has NOT been verified?

Kurt

From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ralph Castain via users
Sent: Monday, April 13, 2020 4:47 PM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Ralph Castain mailto:r...@open-mpi.org>>
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags

I updated the message to explain the flags (instead of a numerical value) for 
OMPI v5. In brief:

#define PRRTE_NODE_FLAG_DAEMON_LAUNCHED0x01   // whether or not the daemon 
on this node has been launched
#define PRRTE_NODE_FLAG_LOC_VERIFIED   0x02   // whether or not the 
location has been verified - used for

  // environments where the daemon's final destination is 
uncertain
#define PRRTE_NODE_FLAG_OVERSUBSCRIBED   0x04   // whether or not this node 
is oversubscribed
#define PRRTE_NODE_FLAG_MAPPED 0x08   // whether we 
have been added to the current map
#define PRRTE_NODE_FLAG_SLOTS_GIVEN   0x10   // the number of slots 
was specified - used only in non-managed environments
#define PRRTE_NODE_NON_USABLE   0x20   // the node is 
hosting a tool and is NOT to be used for jobs






On Apr 13, 2020, at 2:15 PM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org>> wrote:

My application is behaving correctly on node n006, and incorrectly on the lower 
numbered nodes.   The flags in the error message below may give a clue as to 
why.   What is the meaning of the flag values 0x11 and 0x13?

==   ALLOCATED NODES   ==
n006: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP
n005: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n004: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n003: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n002: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n001: flags=0x13 slots=3 max_slots=0 slots_i

[OMPI users] Hwlock library problem

2020-04-14 Thread フォンスポール J via users
I am attempting to build the latest stable version of openmpi (openmpi-4.0.3) 
on Mac OS 10.15.4 using the latest intel compilers fort, icc, iclc (19.1.1.216 
20200306). I am using the configuration


./configure --prefix=/opt/openmpi CC=icc CXX=icpc F77=ifort FC=ifort 
--with-hwloc=internal --with-libevent=internal 

My initial attempt to make opemmpi failed with the error -lhwloc not found. I 
then read the FAQ on the open-mpi homepage which stated that the problem often 
arises when compilers are mixed (they are not here). The FAQ also suggested 
that the options --with-hwloc=internal --with-libevent=internal would force the 
build process to use an internal version of hwloc. There were no other errors 
in the make process so presumably the hwloc internal library was successfully 
built. The error messages are reprinted below (they match the hwloc FAQ 
exactly). I don’t understand how to work around the problem however and 
suggestions would be welcome. I do have homebrew installed, but I would think 
that the explicit options to use the internal hwlock library would avoid 
referencing other libraries. Note the same problem occurs when configure is 
specified with gcc-9 and gfortran (from homebrew). Thanks four your help in 
advance.

amke
.
.
.

make[1]: Nothing to be done for `all'.
Making all in mpi/fortran/use-mpi-f08
  FCLD libmpi_usempif08.la
ifort: command line warning #10006: ignoring unknown option 
'-force_load,mod/.libs/libforce_usempif08_internal_modules_to_be_built.a'
ifort: command line warning #10006: ignoring unknown option 
'-force_load,bindings/.libs/libforce_usempif08_internal_bindings_to_be_built.a'
ifort: command line warning #10006: ignoring unknown option 
'-force_load,../../../../ompi/mpiext/pcollreq/use-mpi-f08/.libs/libmpiext_pcollreq_usempif08.a'
ifort: command line warning #10006: ignoring unknown option 
'-force_load,base/.libs/libusempif08_ccode.a'
ld: library not found for -lhwloc
make[1]: *** [libmpi_usempif08.la] Error 1
make: *** [all-recursive] Error 1





Paul Fons


Keio University, Faculty of Science and Technology, Department of Electronics 
and Information Engineering

〒223-8522 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223-8522, Japan
Keio University Faculty of Science and Technology Yagami Campus.

〒223-8522 横浜市港北区日吉3-14-1 慶應義塾大学理工学部電子工学科






Re: [OMPI users] Hwlock library problem

2020-04-14 Thread Gilles Gouaillardet via users
Paul,

this issue is likely the one already been reported at
https://github.com/open-mpi/ompi/issues/7615

Several workarounds are documented, feel free to try some of them and
report back
(either on GitHub or this mailing list)

Cheers,

Gilles

On Tue, Apr 14, 2020 at 11:18 PM フォンスポール J via users
 wrote:
>
> I am attempting to build the latest stable version of openmpi (openmpi-4.0.3) 
> on Mac OS 10.15.4 using the latest intel compilers fort, icc, iclc 
> (19.1.1.216 20200306). I am using the configuration
>
>
> ./configure --prefix=/opt/openmpi CC=icc CXX=icpc F77=ifort FC=ifort 
> --with-hwloc=internal --with-libevent=internal
>
> My initial attempt to make opemmpi failed with the error -lhwloc not found. I 
> then read the FAQ on the open-mpi homepage which stated that the problem 
> often arises when compilers are mixed (they are not here). The FAQ also 
> suggested that the options --with-hwloc=internal --with-libevent=internal 
> would force the build process to use an internal version of hwloc. There were 
> no other errors in the make process so presumably the hwloc internal library 
> was successfully built. The error messages are reprinted below (they match 
> the hwloc FAQ exactly). I don’t understand how to work around the problem 
> however and suggestions would be welcome. I do have homebrew installed, but I 
> would think that the explicit options to use the internal hwlock library 
> would avoid referencing other libraries. Note the same problem occurs when 
> configure is specified with gcc-9 and gfortran (from homebrew). Thanks four 
> your help in advance.
>
> amke
> .
> .
> .
>
> make[1]: Nothing to be done for `all'.
> Making all in mpi/fortran/use-mpi-f08
>   FCLD libmpi_usempif08.la
> ifort: command line warning #10006: ignoring unknown option 
> '-force_load,mod/.libs/libforce_usempif08_internal_modules_to_be_built.a'
> ifort: command line warning #10006: ignoring unknown option 
> '-force_load,bindings/.libs/libforce_usempif08_internal_bindings_to_be_built.a'
> ifort: command line warning #10006: ignoring unknown option 
> '-force_load,../../../../ompi/mpiext/pcollreq/use-mpi-f08/.libs/libmpiext_pcollreq_usempif08.a'
> ifort: command line warning #10006: ignoring unknown option 
> '-force_load,base/.libs/libusempif08_ccode.a'
> ld: library not found for -lhwloc
> make[1]: *** [libmpi_usempif08.la] Error 1
> make: *** [all-recursive] Error 1
>
>
>
>
>
> Paul Fons
>
>
> Keio University, Faculty of Science and Technology, Department of Electronics 
> and Information Engineering
>
> 〒223-8522 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223-8522, Japan
> Keio University Faculty of Science and Technology Yagami Campus.
>
> 〒223-8522 横浜市港北区日吉3-14-1 慶應義塾大学理工学部電子工学科
>
>
>
>


[OMPI users] Inquiry about pml layer

2020-04-14 Thread Arturo Fernandez via users
Hello,I'm using CUDA-aware OMPIv4.0.3 with UCX to run some apps. Most of them have worked seamlessly, but one breaks and returns the error:memtype_cache.c:299  UCX  ERROR failed to set UCM memtype event handler: Unsupported operation--No components were able to be opened in the pml framework.This typically means that either no components of this type wereinstalled, or none of the installed components can be loaded.Sometimes this means that shared libraries required by thesecomponents are unable to be found/loaded.  Host:      laghos-2  Framework: pml--[laghos-2:00328] PML ucx cannot be selected The full discussion is posted at https://github.com/openucx/ucx/issues/4988. I don't fully understand the internals of OpenMPI and my question is specific to the 'pml' layer. Does it make any difference if the network is either ethernet or IB on how OpenMPI handles the data and, more specifically, memory access?Thanks.Arturo