Re: [OMPI users] file/process write speed is not scalable
Thank you for your suggestion. I am more concerned about the poor performance of one MPI process/socket case. The model fits better for my real workload. The performance that I see is a lot worse than what the underlying hardware can support. The best case (all MPI processes in a single socket) is pretty good, which is about 80+% of underlying hardware's speed. However, one MPI per socket model achieves only 30% of what I get with all MPI processes in a single socket. Both are doing the same thing - independent file write. I used all the OSTs available. As a reference point, I did the same test on ramdisk. For both case, the performance scales very well, and their performances are close. There seems to be extra overhead when multi-sockets are used for independent file I/O with Lustre. I don't know what causes that overhead. Thanks, David On Thu, Apr 9, 2020 at 11:07 PM Gilles Gouaillardet via users < users@lists.open-mpi.org> wrote: > Note there could be some NUMA-IO effect, so I suggest you compare > running every MPI tasks on socket 0, to running every MPI tasks on > socket 1 and so on, and then compared to running one MPI task per > socket. > > Also, what performance do you measure? > - Is this something in line with the filesystem/network expectation? > - Or is this much higher (and in this case, you are benchmarking the i/o > cache)? > > FWIW, I usually write files whose cumulated size is four times the > node memory to avoid local caching effect > (if you have a lot of RAM, that might take a while ...) > > Keep in mind Lustre is also sensitive to the file layout. > If you write one file per task, you likely want to use all the > available OST, but no stripping. > If you want to write into a single file with 1MB blocks per MPI task, > you likely want to stripe with 1MB blocks, > and use the same number of OST than MPI tasks (so each MPI task ends > up writing to its own OST) > > Cheers, > > Gilles > > On Fri, Apr 10, 2020 at 6:41 AM Dong-In Kang via users > wrote: > > > > Hi, > > > > I'm running IOR benchmark on a big shared memory machine with Lustre > file system. > > I set up IOR to use an independent file/process so that the aggregated > bandwidth is maximized. > > I ran N MPI processes where N < # of cores in a socket. > > When I put those N MPI processes on a single socket, its write > performance is scalable. > > However, when I put those N MPI processes on N sockets (so, 1 MPI > process/socket), > > it performance does not scale, and stays the same for more than 4 MPI > processes. > > I expected it would be as scalable as the case of N processes on a > single socket. > > But, it is not. > > > > I think if an MPI process write to an independent file/process, there > must not be file locking among MPI processes. However, there seems to be > some. Is there any way to avoid that locking or overhead? It may not be > file lock issue, but I don't know what is the exact reason for the poor > performance. > > > > Any help will be appreciated. > > > > David >
[OMPI users] Meaning of mpiexec error flags
My application is behaving correctly on node n006, and incorrectly on the lower numbered nodes. The flags in the error message below may give a clue as to why. What is the meaning of the flag values 0x11 and 0x13? == ALLOCATED NODES == n006: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP n005: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n004: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n003: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n002: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP I'm using OpenMPI 4.0.3. Thanks, Kurt
Re: [OMPI users] Meaning of mpiexec error flags
I updated the message to explain the flags (instead of a numerical value) for OMPI v5. In brief: #define PRRTE_NODE_FLAG_DAEMON_LAUNCHED 0x01 // whether or not the daemon on this node has been launched #define PRRTE_NODE_FLAG_LOC_VERIFIED 0x02 // whether or not the location has been verified - used for // environments where the daemon's final destination is uncertain #define PRRTE_NODE_FLAG_OVERSUBSCRIBED 0x04 // whether or not this node is oversubscribed #define PRRTE_NODE_FLAG_MAPPED 0x08 // whether we have been added to the current map #define PRRTE_NODE_FLAG_SLOTS_GIVEN 0x10 // the number of slots was specified - used only in non-managed environments #define PRRTE_NODE_NON_USABLE 0x20 // the node is hosting a tool and is NOT to be used for jobs On Apr 13, 2020, at 2:15 PM, Mccall, Kurt E. (MSFC-EV41) via users mailto:users@lists.open-mpi.org> > wrote: My application is behaving correctly on node n006, and incorrectly on the lower numbered nodes. The flags in the error message below may give a clue as to why. What is the meaning of the flag values 0x11 and 0x13? == ALLOCATED NODES == n006: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP n005: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n004: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n003: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n002: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP I’m using OpenMPI 4.0.3. Thanks, Kurt
Re: [OMPI users] Meaning of mpiexec error flags
Thanks Ralph. So the difference between the working node flag (0x11) and the non-working nodes’ flags (0x13) is the flag PRRTE_NODE_FLAG_LOC_VERIFIED. What does that imply? The location of the daemon has NOT been verified? Kurt From: users On Behalf Of Ralph Castain via users Sent: Monday, April 13, 2020 4:47 PM To: Open MPI Users Cc: Ralph Castain Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags I updated the message to explain the flags (instead of a numerical value) for OMPI v5. In brief: #define PRRTE_NODE_FLAG_DAEMON_LAUNCHED0x01 // whether or not the daemon on this node has been launched #define PRRTE_NODE_FLAG_LOC_VERIFIED 0x02 // whether or not the location has been verified - used for // environments where the daemon's final destination is uncertain #define PRRTE_NODE_FLAG_OVERSUBSCRIBED 0x04 // whether or not this node is oversubscribed #define PRRTE_NODE_FLAG_MAPPED 0x08 // whether we have been added to the current map #define PRRTE_NODE_FLAG_SLOTS_GIVEN 0x10 // the number of slots was specified - used only in non-managed environments #define PRRTE_NODE_NON_USABLE 0x20 // the node is hosting a tool and is NOT to be used for jobs On Apr 13, 2020, at 2:15 PM, Mccall, Kurt E. (MSFC-EV41) via users mailto:users@lists.open-mpi.org>> wrote: My application is behaving correctly on node n006, and incorrectly on the lower numbered nodes. The flags in the error message below may give a clue as to why. What is the meaning of the flag values 0x11 and 0x13? == ALLOCATED NODES == n006: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP n005: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n004: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n003: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n002: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP I’m using OpenMPI 4.0.3. Thanks, Kurt