On Fri, Jan 10, 2020 at 10:21:25PM +0000, Andrew Doran wrote: > Hi Frank, > > On Fri, Jan 10, 2020 at 01:10:02PM +0100, Frank Kardel wrote: > > > Hi ! > > > > With this state of January 2nd we ran some tests for robustness and timing > > with our database setup: > > > > Machine: > > > > Mainboard: S2600WFT > > > > CPU: 2 x Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz > > > > machdep.spectre_v1.mitigated = 0 > > machdep.spectre_v2.hwmitigated = 1 > > machdep.spectre_v2.swmitigated = 1 > > machdep.spectre_v2.method = [GCC retpoline] + [Intel IBRS] > > machdep.spectre_v4.mitigated = 0 > > machdep.spectre_v4.method = (none) > > machdep.mds.mitigated = 0 > > machdep.mds.method = (none) > > machdep.taa.mitigated = 0 > > machdep.taa.method = [MDS] > > > > Memory: > > > > hw.physmem64 = 549446447104 > > hw.usermem64 = 549438365696 > > > > This machine is/has been a challenge to NetBSD as it has 0.5Tb Memory and 32 > > cores. > > > > Testcase is restoring a 1Tb Postgresql-11 database with varying degres of > > Postresql pg_restore parallelism. > > > > Why did we do the tests? The machine was installed with 8.99.24 as that > > supported the memory setup. > > > > The machine was not able to reliably copy with many db/restore processes and > > large memory - see > > > > PR kern/54209: NetBSD 8 large memory performance extremely low > > PR kern/54210: NetBSD-8 processes presumably not exiting > > > > for details. > > > > With Andrew Doran's work on the vm system we restarted the tests. > > > > The baseline is 8.99.24 from around Sep 3 04:10:20 UTC 2018: > > TEST 1 > > FRESH BOOT > > time pg_restore -Upgsql -p5433 -Fd -d db -j5 20200103-db.dmpdir > > 1826.599u 1752.878s 10:36:03.83 9.3% 0+0k 397+0io 1789pf+0w > > > > Higher levels of parallelism lead to a higher probability for catatonic > > systems with increasing restore parallelism. > > Trouble starts around -j8 and gets worse at higher levels. > > > > TEST 2 > > 9.99.33 from around Fri Jan 3 16:14:02 CET 2020 > > FRESH BOOT > > time pg_restore -Upgsql -p5433 -Fd -d db -j28 20200103-db.dmpdir > > 2047.925u 1191.878s 14:24:15.23 6.2% 0+0k 0+0io 5784pf+0w > > > > This survived a -j28 run that was not possible with 8.99.24 - this a a big > > step forward, but ~4h slower real time. > > > > TEST 3 > > FRESH BOOT > > 9.99.34 from around Mon Jan 6 14:43:01 > > time pg_restore -Upgsql -p5433 -Fd -d db -j28 20200103-db.dmpdir > > 1816.348u 1792.530s 10:56:02.56 9.1% 0+0k 395+0io 5620pf+0w > > > > -j5 run to compare to 9.99.33 - big improvement in real run time though > > system time went up. > > > > TEST 4 > > State after TEST 3 run to compare to 8.99.24 > > time pg_restore -Upgsql -p5433 -Fd -d db -j5 20200103-db.dmpdir > > 1706.548u 1748.623s 11:26:38.87 8.3% 0+0k 0+0io 1420pf+0w > > > > This ran faster that -j28 - probably due to less contention, but 50 min > > slower that 8.99.24 after fresh boot. > > > > TEST 5: > > re-run TEST 4 with fresh boot for 8.99.24 comparison > > time pg_restore -Upgsql -p5433 -Fd -d db -j5 20200103-db.dmpdir > > 1710.665u 1611.083s 9:14:56.86 9.9% 0+0k 398+0io 1504pf+0w > > > > better the 8.99.24 for real time. > > > > There seems no big difference in system time between 8.99.24 and 9.99.34, > > but a big improvement in robustness. > > The lockups don't seem to happen any more and there are a fewer short term > > system freezes and the systems remains > > responsive with 9.99.34. > > > > The big differences in real time are interesting but the cause for that may > > not be easy to pinpoint. The database > > runs on an nvme: > > nvme0 at pci10 dev 0 function 0: Intel SSD DC P4500 (rev. 0x00) > > nvme0: NVMe 1.2 > > nvme0: for admin queue interrupting at msix4 vec 0 > > nvme0: INTEL SSDPE2KX040T8, firmware VDV10131, serial ... > > nvme0: for io queue 1 interrupting at msix4 vec 1 affinity to cpu0 > > [...] > > nvme0: for io queue 32 interrupting at msix4 vec 32 affinity to cpu31 > > ld0 at nvme0 nsid 1 > > ld0: 3726 GB, 486401 cyl, 255 head, 63 sec, 512 bytes/sect x 7814037168 > > sectors > > > > And we are seeing transfer rates up to 300Mb/s and up 80% busy on the > > complex I/O (load) and CPU (build index) workload. > > > > So in summary we a a big step forward in robustness. > > > > Thanks to Andrew for the big improvements here. > > Thank you for the detailed testing, and report. > > Many of the changes to the VM system came from Takashi Yamamoto's > yamt-pagecache branch, so it's not all my work. > > I'm glad to hear that this has worked well for you. There are a couple of > things that, time permitting, I would like to get in place over the next few > weeks which should help a little with this workload (and then I am done, for > now). > > The first is enabling Jaromir Dolecek's vm.ubc_direct by default, which may > help with such a high I/O rate. There is a possible deadlock condition with > this that needs to be fixed first.
ubc_direct didn't make it in yet. As an interim measure I bumped the UBC defaults and TLB shootdown limits for amd64 which should help with I/O transfer rate. I also recommend disabling ACPI idle, at least until it can be made less aggressive by default. It causes a significant slowdown. It can be done with detaching all acpicpu devices using "drvctl -d" on each. > The second is pulling in efficient tracking of page clean/dirty status from > the yamt-pagecache branch. This reduces the amount of work fsync() needs to > do, which should be of benefit to the DBMS. This is now in place. In my tests it makes a big improvement to fsync() time: if there's not much data to write back it's like 100x-1000x faster for multi-GB files. Andrew