Re: [lustre-discuss] Poor(?) Lustre performance

Andreas Dilger via lustre-discuss Wed, 20 Apr 2022 16:53:01 -0700

Finn,
I can't really say for sure where the performance limitation in your system is 
coming from.



You'd have to re-run the tests against the local ldiskfs filesystem to see how 
the performance compares to with that of Lustre.  The important part of 
benchmark testing is to systematically build a complete picture from the ground 
up to see what the capabilities of the various components of the storage stack 
are, and then determine where any bottlenecks are being hit.

That is what the "lustre-iokit" is intended to do - benchmark starting on the 
raw storage (sgpdd-survey), on the local disk filesystem (obdfilter-survey for 
local OSDs), then the network (lnet-selftest), and finally on the client 
(obdfilter-survey for network OSDs).

For example, running sgpdd-survey (or "fio") with small and large IO sizes 
against the storage devices, individually *AND IN PARALLEL* to determine their 
performance characteristics.  Also running in parallel is critical, since you 
may see e.g. 3GB/s reads, 2GB/s writes from a single NVMe device, but *not* see 
4x that performance when running on 4x NVMe devices because of CPU and/or PCI 
and/or memory bandwidth limitations.  Similarly, you may see reasonable per-OSS 
performance from a single OSS, but network congestion (on the client, 
switch(es), or server) may prevent the performance from scaling as more servers 
are added.

This is described in some detail at 
https://github.com/DDNStorage/lustre_manual_markdown/blob/master/04.02-Benchmarking%20Lustre%20File%20System%20Performance%20(Lustre%20IO%20Kit).md

Cheers, Andreas

On Apr 20, 2022, at 12:03, Finn Rawles Malliagh 
<[email protected]<mailto:[email protected]>> wrote:

Hi Andreas,

Thank you for taking the time to reply with such a detailed response.
I have taken your advice on board and made some changes. Firstly, I have 
swapped from ZFS and am now using striped LVM groups (Including the P4800X 
instead of using it as a cache drive). I have also modified io500.sh to include 
the optimisation listed above. Rerunning the IO500 benchmark provides the 
metadata results below:

With ZFS
[RESULT]    mdtest-easy-write        0.931693 kIOPS : time 31.028 seconds 
[INVALID]
[RESULT]    mdtest-hard-write        0.427000 kIOPS : time 31.070 seconds 
[INVALID]
[RESULT]                 find       25.311534 kIOPS : time 1.631 seconds
[RESULT]     mdtest-easy-stat        0.570021 kIOPS : time 50.067 seconds
[RESULT]     mdtest-hard-stat        1.834985 kIOPS : time 7.998 seconds
[RESULT]   mdtest-easy-delete        1.715750 kIOPS : time 17.308 seconds
[RESULT]     mdtest-hard-read        1.006240 kIOPS : time 13.759 seconds
[RESULT]   mdtest-hard-delete        1.624117 kIOPS : time 8.910 seconds
[SCORE ] Bandwidth 2.271383 GiB/s : IOPS 1.526825 kiops : TOTAL 1.862258 
[INVALID]

With LVM:
[RESULT]    mdtest-easy-write        3.057249 kIOPS : time 27.177 seconds 
[INVALID]
[RESULT]    mdtest-hard-write        1.576865 kIOPS : time 51.740 seconds 
[INVALID]
[RESULT]                 find       71.979457 kIOPS : time 2.234 seconds
[RESULT]     mdtest-easy-stat        1.841655 kIOPS : time 44.443 seconds
[RESULT]     mdtest-hard-stat        1.779211 kIOPS : time 45.967 seconds
[RESULT]   mdtest-easy-delete        1.559825 kIOPS : time 52.301 seconds
[RESULT]     mdtest-hard-read        0.631109 kIOPS : time 127.765 seconds
[RESULT]   mdtest-hard-delete        0.856858 kIOPS : time 94.372 seconds
[SCORE ] Bandwidth 0.948100 GiB/s : IOPS 2.359024 kiops : TOTAL 1.495524 
[INVALID]

I believe these scores are more in line with what I should expect, however, it 
seems that my throughput performance is still lacking(?). In your expert 
opinion do you think this would be just a case of tuning IO500/lvm parameters 
further or something more fundamental about the configuration of this Lustre 
cluster?

With LVM
[RESULT]       ior-easy-write        2.127026 GiB/s : time 122.305 seconds 
[INVALID]
[RESULT]       ior-hard-write        1.408638 GiB/s : time 1.246 seconds 
[INVALID]
[RESULT]        ior-easy-read        1.549550 GiB/s : time 167.881 seconds
[RESULT]        ior-hard-read        0.174036 GiB/s : time 10.063 seconds


Kind Regards,
Finn

On Wed, 20 Apr 2022 at 09:24, Andreas Dilger 
<[email protected]<mailto:[email protected]>> wrote:
On Apr 16, 2022, at 22:51, Finn Rawles Malliagh via lustre-discuss 
<[email protected]<mailto:[email protected]>> wrote:

Hi all,

I have just set up a three-node Lustre configuration, and initial testing shows 
what I think are slow results. The current configuration is 2 OSS, 1 MDS-MGS; 
each OSS/MGS has 4x Intel P3600, 1x Intel P4800, Intel E810 100Gbe eth, 2x 
6252, 380GB dram
I am using Lustre 2.12.8, ZFS 0.7.13, ice-1.8.3, rdma-core-35.0 (RoCEv2 is 
enabled)
All zpools are setup identical for OST1, OST2, and MDT1

[root@stor3 ~]# zpool status
  pool: osstank
 state: ONLINE
  scan: none requested
config:
        NAME        STATE     READ WRITE CKSUM
        osstank     ONLINE       0     0     0
          nvme1n1   ONLINE       0     0     0
          nvme2n1   ONLINE       0     0     0
          nvme3n1   ONLINE       0     0     0
        cache
          nvme0n1   ONLINE       0     0     0

It's been a while since I've done anything with ZFS, but I see a few potential 
issues here:
- firstly, it doesn't make sense IMHO to have an NVMe cache device when the 
main storage
  pool is also NVMe.  You could better use that capacity/bandwidth for storing 
more data
  instead of duplicating it into the cache device.  Also, Lustre cannot use the 
ZIL.
- in general ZFS is not very good at IOPS workloads because of the high 
overhead per block.
  Lustre can't use the ZIL, so no opportunity to accelerate heavy IOPS 
workloads.

When running "./io500 ./config-minimalLUST.ini" on my lustre client, I get 
these performance numbers:
IO500 version io500-isc22_v1 (standard)
[RESULT]       ior-easy-write        1.173435 GiB/s : time 31.703 seconds 
[INVALID]
[RESULT]       ior-hard-write        0.821624 GiB/s : time 1.070 seconds 
[INVALID]
[RESULT]        ior-easy-read        5.177930 GiB/s : time 7.187 seconds
[RESULT]        ior-hard-read        5.331791 GiB/s : time 0.167 seconds

When running "./io500 ./config-minimalLOCAL.ini" on a singular locally mounted 
ZFS pool I get the following performance numbers:
IO500 version io500-isc22_v1 (standard)
[RESULT]       ior-easy-write        1.304500 GiB/s : time 33.302 seconds 
[INVALID]
[RESULT]       ior-hard-write        0.485283 GiB/s : time 1.806 seconds 
[INVALID]
[RESULT]        ior-easy-read        3.078668 GiB/s : time 14.111 seconds
[RESULT]        ior-hard-read        3.183521 GiB/s : time 0.275 seconds

There are definitely some file layout tunables that can improve IO500 
performance for these workloads.
See the default io500.sh file, where they are commented out by default:

  # Example commands to create output directories for Lustre.  Creating
  # top-level directories is allowed, but not the whole directory tree.
  #if (( $(lfs df $workdir | grep -c MDT) > 1 )); then
  #  lfs setdirstripe -D -c -1 $workdir
  #fi
  #lfs setstripe -c 1 $workdir
  #mkdir $workdir/ior-easy $workdir/ior-hard
  #mkdir $workdir/mdtest-easy $workdir/mdtest-hard
  #local osts=$(lfs df $workdir | grep -c OST)
  # Try overstriping for ior-hard to improve scaling, or use wide striping
  #lfs setstripe -C $((osts * 4)) $workdir/ior-hard ||
  #  lfs setstripe -c -1 $workdir/ior-hard
  # Try to use DoM if available, otherwise use default for small files
  #lfs setstripe -E 64k -L mdt $workdir/mdtest-easy || true #DoM?
  #lfs setstripe -E 64k -L mdt $workdir/mdtest-hard || true #DoM?
  #lfs setstripe -E 64k -L mdt $workdir/mdtest-rnd


As you can see above, the IO performance of Lustre isn't really much different 
than the local storage
performance of ZFS.  You are always going to lose some percentage over the 
network and because
of the added distributed locking.  That said, for the hardware that you have, 
it should be getting about
2-3GB/s per NVMe device, and up to 10GB/s over the network, so the limitation 
here is really ZFS.
It would be useful to test with ldiskfs on tje same hardware, maybe with LVM 
aggregating the NVMes.

When running "./io500 ./config-minimalLUST.ini" on my lustre client, I get 
these performance numbers:
IO500 version io500-isc22_v1 (standard)
[RESULT]    mdtest-easy-write        0.931693 kIOPS : time 31.028 seconds 
[INVALID]
[RESULT]    mdtest-hard-write        0.427000 kIOPS : time 31.070 seconds 
[INVALID]
[RESULT]                 find       25.311534 kIOPS : time 1.631 seconds
[RESULT]     mdtest-easy-stat        0.570021 kIOPS : time 50.067 seconds
[RESULT]     mdtest-hard-stat        1.834985 kIOPS : time 7.998 seconds
[RESULT]   mdtest-easy-delete        1.715750 kIOPS : time 17.308 seconds
[RESULT]     mdtest-hard-read        1.006240 kIOPS : time 13.759 seconds
[RESULT]   mdtest-hard-delete        1.624117 kIOPS : time 8.910 seconds
[SCORE ] Bandwidth 2.271383 GiB/s : IOPS 1.526825 kiops : TOTAL 1.862258 
[INVALID]

When running "./io500 ./config-minimalLOCAL.ini" on a singular locally mounted 
ZFS pool I get the following performance numbers:
IO500 version io500-isc22_v1 (standard)
[RESULT]    mdtest-easy-write       47.979181 kIOPS : time 1.838 seconds 
[INVALID]
[RESULT]    mdtest-hard-write       27.801814 kIOPS : time 2.443 seconds 
[INVALID]
[RESULT]                 find     1384.774433 kIOPS : time 0.074 seconds
[RESULT]     mdtest-easy-stat      343.232733 kIOPS : time 1.118 seconds
[RESULT]     mdtest-hard-stat      333.241620 kIOPS : time 1.123 seconds
[RESULT]   mdtest-easy-delete       45.723381 kIOPS : time 1.884 seconds
[RESULT]     mdtest-hard-read       73.637312 kIOPS : time 1.546 seconds
[RESULT]   mdtest-hard-delete       42.191867 kIOPS : time 1.956 seconds
[SCORE ] Bandwidth 1.578256 GiB/s : IOPS 114.726763 kiops : TOTAL 13.456159 
[INVALID]

Definitely the metadata performance is lower here, because each Lustre file has 
to create (at least)
two objects (one on MDT, one or more on OST(s)) and then write and access them 
again.
Lustre metadata performance would definitely benefit from enabling PFL and 
Data-on-MDT (per
above default commands), since it only needs to do the MDT create/access.

I have run an iperf3 test and I was able to reach speeds of around 40GbE so I 
don't think the network links are the issue (Maybe it's something to do with 
lnet?)

If anyone more knowledgeable than me would please educate me on why the 
performance of the local three disk ZFS is more performant than the lustre FS.
I'm very new to this kind of benchmarking so it may also be that I am 
misinterpreting the results/ not applying the test correctly.


Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud








Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Poor(?) Lustre performance

Reply via email to