Re: [OMPI users] file/process write speed is not scalable

2020-04-13 Thread Dong-In Kang via users
 Thank you for your suggestion.
I am more concerned about the poor performance of one MPI process/socket
case.
The model fits better for my real workload.
The performance that I see is a lot worse than what the underlying hardware
can support.
The best case (all MPI processes in a single socket) is pretty good, which
is about 80+% of underlying hardware's speed.
However, one MPI per socket model achieves only 30% of what I get with all
MPI processes in a single socket.
Both are doing the same thing - independent file write.
I used all the OSTs available.

As a reference point, I did the same test on ramdisk.
For both case, the performance scales very well, and their performances are
close.

There seems to be extra overhead when multi-sockets are used for
independent file I/O with Lustre.
I don't know what causes that overhead.

Thanks,
David


On Thu, Apr 9, 2020 at 11:07 PM Gilles Gouaillardet via users <
users@lists.open-mpi.org> wrote:

> Note there could be some NUMA-IO effect, so I suggest you compare
> running every MPI tasks on socket 0, to running every MPI tasks on
> socket 1 and so on, and then compared to running one MPI task per
> socket.
>
> Also, what performance do you measure?
> - Is this something in line with the filesystem/network expectation?
> - Or is this much higher (and in this case, you are benchmarking the i/o
> cache)?
>
> FWIW, I usually write files whose cumulated size is four times the
> node memory to avoid local caching effect
> (if you have a lot of RAM, that might take a while ...)
>
> Keep in mind Lustre is also sensitive to the file layout.
> If you write one file per task, you likely want to use all the
> available OST, but no stripping.
> If you want to write into a single file with 1MB blocks per MPI task,
> you likely want to stripe with 1MB blocks,
> and use the same number of OST than MPI tasks (so each MPI task ends
> up writing to its own OST)
>
> Cheers,
>
> Gilles
>
> On Fri, Apr 10, 2020 at 6:41 AM Dong-In Kang via users
>  wrote:
> >
> > Hi,
> >
> > I'm running IOR benchmark on a big shared memory machine with Lustre
> file system.
> > I set up IOR to use an independent file/process so that the aggregated
> bandwidth is maximized.
> > I ran N MPI processes where N < # of cores in a socket.
> > When I put those N MPI processes on a single socket, its write
> performance is scalable.
> > However, when I put those N MPI processes on N sockets (so, 1 MPI
> process/socket),
> > it performance does not scale, and stays the same for more than 4 MPI
> processes.
> > I expected it would be as scalable as the case of N processes on a
> single socket.
> > But, it is not.
> >
> > I think if an MPI process write to an independent file/process, there
> must not be file locking among MPI processes. However, there seems to be
> some. Is there any way to avoid that locking or overhead? It may not be
> file lock issue, but I don't know what is the exact reason for the poor
> performance.
> >
> > Any help will be appreciated.
> >
> > David
>


[OMPI users] Meaning of mpiexec error flags

2020-04-13 Thread Mccall, Kurt E. (MSFC-EV41) via users
My application is behaving correctly on node n006, and incorrectly on the lower 
numbered nodes.   The flags in the error message below may give a clue as to 
why.   What is the meaning of the flag values 0x11 and 0x13?

==   ALLOCATED NODES   ==
n006: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP
n005: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n004: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n003: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n002: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP

I'm using OpenMPI 4.0.3.

Thanks,
Kurt


Re: [OMPI users] Meaning of mpiexec error flags

2020-04-13 Thread Ralph Castain via users
I updated the message to explain the flags (instead of a numerical value) for 
OMPI v5. In brief:

#define PRRTE_NODE_FLAG_DAEMON_LAUNCHED    0x01   // whether or not the daemon 
on this node has been launched
#define PRRTE_NODE_FLAG_LOC_VERIFIED               0x02   // whether or not the 
location has been verified - used for
                                                                                
                      // environments where the daemon's final destination is 
uncertain
#define PRRTE_NODE_FLAG_OVERSUBSCRIBED       0x04   // whether or not this node 
is oversubscribed
#define PRRTE_NODE_FLAG_MAPPED                         0x08   // whether we 
have been added to the current map
#define PRRTE_NODE_FLAG_SLOTS_GIVEN               0x10   // the number of slots 
was specified - used only in non-managed environments
#define PRRTE_NODE_NON_USABLE                           0x20   // the node is 
hosting a tool and is NOT to be used for jobs



On Apr 13, 2020, at 2:15 PM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org> > wrote:

My application is behaving correctly on node n006, and incorrectly on the lower 
numbered nodes.   The flags in the error message below may give a clue as to 
why.   What is the meaning of the flag values 0x11 and 0x13?
 ==   ALLOCATED NODES   ==
    n006: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP
    n005: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
    n004: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
    n003: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
    n002: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
    n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
 I’m using OpenMPI 4.0.3.
 Thanks,
Kurt



Re: [OMPI users] Meaning of mpiexec error flags

2020-04-13 Thread Mccall, Kurt E. (MSFC-EV41) via users
Thanks Ralph.   So the difference between the working node flag (0x11) and the 
non-working nodes’ flags (0x13) is the flag PRRTE_NODE_FLAG_LOC_VERIFIED.
What does that imply?   The location of the daemon has NOT been verified?

Kurt

From: users  On Behalf Of Ralph Castain via 
users
Sent: Monday, April 13, 2020 4:47 PM
To: Open MPI Users 
Cc: Ralph Castain 
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags

I updated the message to explain the flags (instead of a numerical value) for 
OMPI v5. In brief:

#define PRRTE_NODE_FLAG_DAEMON_LAUNCHED0x01   // whether or not the daemon 
on this node has been launched
#define PRRTE_NODE_FLAG_LOC_VERIFIED   0x02   // whether or not the 
location has been verified - used for

  // environments where the daemon's final destination is 
uncertain
#define PRRTE_NODE_FLAG_OVERSUBSCRIBED   0x04   // whether or not this node 
is oversubscribed
#define PRRTE_NODE_FLAG_MAPPED 0x08   // whether we 
have been added to the current map
#define PRRTE_NODE_FLAG_SLOTS_GIVEN   0x10   // the number of slots 
was specified - used only in non-managed environments
#define PRRTE_NODE_NON_USABLE   0x20   // the node is 
hosting a tool and is NOT to be used for jobs




On Apr 13, 2020, at 2:15 PM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org>> wrote:

My application is behaving correctly on node n006, and incorrectly on the lower 
numbered nodes.   The flags in the error message below may give a clue as to 
why.   What is the meaning of the flag values 0x11 and 0x13?

==   ALLOCATED NODES   ==
n006: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP
n005: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n004: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n003: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n002: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP

I’m using OpenMPI 4.0.3.

Thanks,
Kurt