On 08/27/2018 03:59 AM, Thomas Munro wrote: > On Mon, Aug 27, 2018 at 10:14 AM Tomas Vondra > <tomas.von...@2ndquadrant.com <mailto:tomas.von...@2ndquadrant.com>> wrote: >> zfs (Linux) >> ----------- >> On scale 200, there's pretty much no difference. > > Speculation: It could be that the dnode and/or indirect blocks that > point to data blocks are falling out of memory in my test setup[1] but > not in yours. I don't know, but I guess those blocks compete with > regular data blocks in the ARC? If so it might come down to ARC size > and the amount of other data churning through it. >
Not sure, but I'd expect this to matter on the largest scale. The machine has 64GB of RAM, and scale 8000 is ~120GB with mostly random access. I've repeated the tests with scale 6000 to give ZFS a bit more free space and prevent the issues when there's less than 20% of free space (results later), but I still don't see any massive improvement. > Further speculation: Other filesystems have equivalent data structures, > but for example XFS jams that data into the inode itself in a compact > "extent list" format[2] if it can, to avoid the need for an external > btree. Hmm, I wonder if that format tends to be used for our segment > files. Since cached inodes are reclaimed in a different way than cached > data pages, I wonder if that makes them more sticky in the face of high > data churn rates (or I guess less, depending on your Linux > vfs_cache_pressure setting and number of active files). I suppose the > combination of those two things, sticky inodes with internalised extent > lists, might make it more likely that we can overwrite an old file > without having to fault anything in. > That's possible. The question is how it affects in which cases it's worth disabling the WAL reuse, and why you observe better performance and I don't. > One big difference between your test rig and mine is that your Optane > 900P claims to do about half a million random IOPS. That is about half > a million more IOPS than my spinning disks. (Actually I used my 5400RPM > steam powered machine deliberately for that test: I disabled fsync so > that commit rate wouldn't be slowed down but cache misses would be > obvious. I guess Joyent's storage is somewhere between these two > extremes...) > Yeah. It seems very much like a CPU vs. I/O trade-off, where disabling the WAL reuse saves a bit of I/O but increases the CPU cost. On the SSD the reduced number of I/O requests are not noticeable, but the extra CPU costs does matter (thanks to the high tps values). On slower devices the I/O savings will matter more, probably. >> On scale 2000, the >> throughput actually decreased a bit, by about 5% - from the chart it >> seems disabling the WAL reuse somewhat amplifies impact of checkpoints, >> for some reason. > > Huh. > Not sure what's causing this. On SATA results it's not visible, though. >> I have no idea what happened at the largest scale (8000) - on master >> there's a huge drop after ~120 minutes, which somewhat recovers at ~220 >> minutes (but not fully). Without WAL reuse there's no such drop, >> although there seems to be some degradation after ~220 minutes (i.e. at >> about the same time the master partially recovers. I'm not sure what to >> think about this, I wonder if it might be caused by almost filling the >> disk space, or something like that. I'm rerunning this with scale 600. > > There are lots of reports of ZFS performance degrading when free space > gets below something like 20%. > I've repeated the benchmarks on the Optane SSD with the largest scale reduced to 6000, to see if it prevents the performance drop with less than 20% of free space. It apparently does (see zfs2.pdf), although it does not change the behavior - with WAL reuse disabled it's still a bit slower. I've also done the tests with SATA devices (3x 7.2k drives), to see if it changes the behavior due to I/O vs. CPU trade-off. And it seems to be the case (see zfs-sata.pdf), to some extent. For the smallest scale (200) there's not much difference. For medium (2000) there seems to be a clear improvement, although the behavior is not particularly smooth. On the largest scale (8000) there seems to be a slight improvement, or at least it's not slower like before. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
zfs-sata.pdf
Description: Adobe PDF document
zfs2.pdf
Description: Adobe PDF document