Andreas, Thank you again for your detailed reply and time, I will have a look further at the lustre IO kit and hopefully, get to the bottom of things.
Cheers, Finn On Thu, 21 Apr 2022 at 00:52, Andreas Dilger <[email protected]> wrote: > Finn, > I can't really say for sure where the performance limitation in your > system is coming from. > > > You'd have to re-run the tests against the local ldiskfs filesystem to see > how the performance compares to with that of Lustre. The important part of > benchmark testing is to systematically build a complete picture from the > ground up to see what the capabilities of the various components of the > storage stack are, and then determine where any bottlenecks are being hit. > > That is what the "lustre-iokit" is intended to do - benchmark starting on > the raw storage (sgpdd-survey), on the local disk filesystem > (obdfilter-survey for local OSDs), then the network (lnet-selftest), and > finally on the client (obdfilter-survey for network OSDs). > > For example, running sgpdd-survey (or "fio") with small and large IO sizes > against the storage devices, individually *AND IN PARALLEL* to determine > their performance characteristics. Also running in parallel is critical, > since you may see e.g. 3GB/s reads, 2GB/s writes from a single NVMe device, > but *not* see 4x that performance when running on 4x NVMe devices because > of CPU and/or PCI and/or memory bandwidth limitations. Similarly, you may > see reasonable per-OSS performance from a single OSS, but network > congestion (on the client, switch(es), or server) may prevent the > performance from scaling as more servers are added. > > This is described in some detail at > https://github.com/DDNStorage/lustre_manual_markdown/blob/master/04.02-Benchmarking%20Lustre%20File%20System%20Performance%20(Lustre%20IO%20Kit).md > > Cheers, Andreas > > On Apr 20, 2022, at 12:03, Finn Rawles Malliagh <[email protected]> > wrote: > > Hi Andreas, > > Thank you for taking the time to reply with such a detailed response. > I have taken your advice on board and made some changes. Firstly, I have > swapped from ZFS and am now using striped LVM groups (Including the P4800X > instead of using it as a cache drive). I have also modified io500.sh to > include the optimisation listed above. Rerunning the IO500 benchmark > provides the metadata results below: > > With ZFS > [RESULT] mdtest-easy-write 0.931693 kIOPS : time 31.028 seconds > [INVALID] > [RESULT] mdtest-hard-write 0.427000 kIOPS : time 31.070 seconds > [INVALID] > [RESULT] find 25.311534 kIOPS : time 1.631 seconds > [RESULT] mdtest-easy-stat 0.570021 kIOPS : time 50.067 seconds > [RESULT] mdtest-hard-stat 1.834985 kIOPS : time 7.998 seconds > [RESULT] mdtest-easy-delete 1.715750 kIOPS : time 17.308 seconds > [RESULT] mdtest-hard-read 1.006240 kIOPS : time 13.759 seconds > [RESULT] mdtest-hard-delete 1.624117 kIOPS : time 8.910 seconds > [SCORE ] Bandwidth 2.271383 GiB/s : IOPS 1.526825 kiops : TOTAL 1.862258 > [INVALID] > > With LVM: > [RESULT] mdtest-easy-write 3.057249 kIOPS : time 27.177 seconds > [INVALID] > [RESULT] mdtest-hard-write 1.576865 kIOPS : time 51.740 seconds > [INVALID] > [RESULT] find 71.979457 kIOPS : time 2.234 seconds > [RESULT] mdtest-easy-stat 1.841655 kIOPS : time 44.443 seconds > [RESULT] mdtest-hard-stat 1.779211 kIOPS : time 45.967 seconds > [RESULT] mdtest-easy-delete 1.559825 kIOPS : time 52.301 seconds > [RESULT] mdtest-hard-read 0.631109 kIOPS : time 127.765 seconds > [RESULT] mdtest-hard-delete 0.856858 kIOPS : time 94.372 seconds > [SCORE ] Bandwidth 0.948100 GiB/s : IOPS 2.359024 kiops : TOTAL 1.495524 > [INVALID] > > I believe these scores are more in line with what I should expect, > however, it seems that my throughput performance is still lacking(?). In > your expert opinion do you think this would be just a case of tuning > IO500/lvm parameters further or something more fundamental about the > configuration of this Lustre cluster? > > With LVM > [RESULT] ior-easy-write 2.127026 GiB/s : time 122.305 seconds > [INVALID] > [RESULT] ior-hard-write 1.408638 GiB/s : time 1.246 seconds > [INVALID] > [RESULT] ior-easy-read 1.549550 GiB/s : time 167.881 seconds > [RESULT] ior-hard-read 0.174036 GiB/s : time 10.063 seconds > > > Kind Regards, > Finn > > On Wed, 20 Apr 2022 at 09:24, Andreas Dilger <[email protected]> > wrote: > >> On Apr 16, 2022, at 22:51, Finn Rawles Malliagh via lustre-discuss < >> [email protected]> wrote: >> >> >> Hi all, >> >> I have just set up a three-node Lustre configuration, and initial testing >> shows what I think are slow results. The current configuration is 2 OSS, 1 >> MDS-MGS; each OSS/MGS has 4x Intel P3600, 1x Intel P4800, Intel E810 100Gbe >> eth, 2x 6252, 380GB dram >> I am using Lustre 2.12.8, ZFS 0.7.13, ice-1.8.3, rdma-core-35.0 (RoCEv2 >> is enabled) >> All zpools are setup identical for OST1, OST2, and MDT1 >> >> [root@stor3 ~]# zpool status >> pool: osstank >> state: ONLINE >> scan: none requested >> config: >> NAME STATE READ WRITE CKSUM >> osstank ONLINE 0 0 0 >> nvme1n1 ONLINE 0 0 0 >> nvme2n1 ONLINE 0 0 0 >> nvme3n1 ONLINE 0 0 0 >> cache >> nvme0n1 ONLINE 0 0 0 >> >> >> It's been a while since I've done anything with ZFS, but I see a few >> potential issues here: >> - firstly, it doesn't make sense IMHO to have an NVMe cache device when >> the main storage >> pool is also NVMe. You could better use that capacity/bandwidth for >> storing more data >> instead of duplicating it into the cache device. Also, Lustre cannot >> use the ZIL. >> - in general ZFS is not very good at IOPS workloads because of the high >> overhead per block. >> Lustre can't use the ZIL, so no opportunity to accelerate heavy IOPS >> workloads. >> >> When running "./io500 ./config-minimalLUST.ini" on my lustre client, I >> get these performance numbers: >> IO500 version io500-isc22_v1 (standard) >> [RESULT] ior-easy-write 1.173435 GiB/s : time 31.703 seconds >> [INVALID] >> [RESULT] ior-hard-write 0.821624 GiB/s : time 1.070 seconds >> [INVALID] >> >> [RESULT] ior-easy-read 5.177930 GiB/s : time 7.187 seconds >> >> [RESULT] ior-hard-read 5.331791 GiB/s : time 0.167 seconds >> >> >> When running "./io500 ./config-minimalLOCAL.ini" on a singular locally >> mounted ZFS pool I get the following performance numbers: >> IO500 version io500-isc22_v1 (standard) >> [RESULT] ior-easy-write 1.304500 GiB/s : time 33.302 seconds >> [INVALID] >> >> [RESULT] ior-hard-write 0.485283 GiB/s : time 1.806 seconds >> [INVALID] >> >> [RESULT] ior-easy-read 3.078668 GiB/s : time 14.111 seconds >> >> [RESULT] ior-hard-read 3.183521 GiB/s : time 0.275 seconds >> >> >> There are definitely some file layout tunables that can improve IO500 >> performance for these workloads. >> See the default io500.sh file, where they are commented out by default: >> >> # Example commands to create output directories for Lustre. Creating >> # top-level directories is allowed, but not the whole directory tree. >> #if (( $(lfs df $workdir | grep -c MDT) > 1 )); then >> # lfs setdirstripe -D -c -1 $workdir >> #fi >> #lfs setstripe -c 1 $workdir >> #mkdir $workdir/ior-easy $workdir/ior-hard >> #mkdir $workdir/mdtest-easy $workdir/mdtest-hard >> #local osts=$(lfs df $workdir | grep -c OST) >> # Try overstriping for ior-hard to improve scaling, or use wide striping >> #lfs setstripe -C $((osts * 4)) $workdir/ior-hard || >> # lfs setstripe -c -1 $workdir/ior-hard >> # Try to use DoM if available, otherwise use default for small files >> #lfs setstripe -E 64k -L mdt $workdir/mdtest-easy || true #DoM? >> #lfs setstripe -E 64k -L mdt $workdir/mdtest-hard || true #DoM? >> #lfs setstripe -E 64k -L mdt $workdir/mdtest-rnd >> >> >> As you can see above, the IO performance of Lustre isn't really much >> different than the local storage >> performance of ZFS. You are always going to lose some percentage over >> the network and because >> of the added distributed locking. That said, for the hardware that you >> have, it should be getting about >> 2-3GB/s per NVMe device, and up to 10GB/s over the network, so the >> limitation here is really ZFS. >> It would be useful to test with ldiskfs on tje same hardware, maybe with >> LVM aggregating the NVMes. >> >> When running "./io500 ./config-minimalLUST.ini" on my lustre client, I >> get these performance numbers: >> IO500 version io500-isc22_v1 (standard) >> >> [RESULT] mdtest-easy-write 0.931693 kIOPS : time 31.028 seconds >> [INVALID] >> >> [RESULT] mdtest-hard-write 0.427000 kIOPS : time 31.070 seconds >> [INVALID] >> [RESULT] find 25.311534 kIOPS : time 1.631 seconds >> [RESULT] mdtest-easy-stat 0.570021 kIOPS : time 50.067 seconds >> [RESULT] mdtest-hard-stat 1.834985 kIOPS : time 7.998 seconds >> [RESULT] mdtest-easy-delete 1.715750 kIOPS : time 17.308 seconds >> [RESULT] mdtest-hard-read 1.006240 kIOPS : time 13.759 seconds >> [RESULT] mdtest-hard-delete 1.624117 kIOPS : time 8.910 seconds >> [SCORE ] Bandwidth 2.271383 GiB/s : IOPS 1.526825 kiops : TOTAL 1.862258 >> [INVALID] >> >> When running "./io500 ./config-minimalLOCAL.ini" on a singular locally >> mounted ZFS pool I get the following performance numbers: >> IO500 version io500-isc22_v1 (standard) >> >> [RESULT] mdtest-easy-write 47.979181 kIOPS : time 1.838 seconds >> [INVALID] >> [RESULT] mdtest-hard-write 27.801814 kIOPS : time 2.443 seconds >> [INVALID] >> [RESULT] find 1384.774433 kIOPS : time 0.074 seconds >> [RESULT] mdtest-easy-stat 343.232733 kIOPS : time 1.118 seconds >> [RESULT] mdtest-hard-stat 333.241620 kIOPS : time 1.123 seconds >> [RESULT] mdtest-easy-delete 45.723381 kIOPS : time 1.884 seconds >> [RESULT] mdtest-hard-read 73.637312 kIOPS : time 1.546 seconds >> [RESULT] mdtest-hard-delete 42.191867 kIOPS : time 1.956 seconds >> [SCORE ] Bandwidth 1.578256 GiB/s : IOPS 114.726763 kiops : TOTAL >> 13.456159 [INVALID] >> >> >> Definitely the metadata performance is lower here, because each Lustre >> file has to create (at least) >> two objects (one on MDT, one or more on OST(s)) and then write and access >> them again. >> Lustre metadata performance would definitely benefit from enabling PFL >> and Data-on-MDT (per >> above default commands), since it only needs to do the MDT create/access. >> >> I have run an iperf3 test and I was able to reach speeds of around 40GbE >> so I don't think the network links are the issue (Maybe it's something to >> do with lnet?) >> >> If anyone more knowledgeable than me would please educate me on why the >> performance of the local three disk ZFS is more performant than the lustre >> FS. >> I'm very new to this kind of benchmarking so it may also be that I am >> misinterpreting the results/ not applying the test correctly. >> >> >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Lustre Principal Architect >> Whamcloud >> >> >> >> >> >> >> >> > Cheers, Andreas > -- > Andreas Dilger > Lustre Principal Architect > Whamcloud > > > > > > > >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
