Re: patch to allow disable of WAL recycling

Tomas Vondra Fri, 31 Aug 2018 15:50:44 -0700

On 08/27/2018 03:59 AM, Thomas Munro wrote:
> On Mon, Aug 27, 2018 at 10:14 AM Tomas Vondra
> <tomas.von...@2ndquadrant.com <mailto:tomas.von...@2ndquadrant.com>> wrote:
>> zfs (Linux)
>> -----------
>> On scale 200, there's pretty much no difference.
> 
> Speculation: It could be that the dnode and/or indirect blocks that
> point to data blocks are falling out of memory in my test setup[1] but
> not in yours.  I don't know, but I guess those blocks compete with
> regular data blocks in the ARC?  If so it might come down to ARC size
> and the amount of other data churning through it.
>


Not sure, but I'd expect this to matter on the largest scale. The
machine has 64GB of RAM, and scale 8000 is ~120GB with mostly random
access. I've repeated the tests with scale 6000 to give ZFS a bit more
free space and prevent the issues when there's less than 20% of free
space (results later), but I still don't see any massive improvement.

> Further speculation:  Other filesystems have equivalent data structures,
> but for example XFS jams that data into the inode itself in a compact
> "extent list" format[2] if it can, to avoid the need for an external
> btree.  Hmm, I wonder if that format tends to be used for our segment
> files.  Since cached inodes are reclaimed in a different way than cached
> data pages, I wonder if that makes them more sticky in the face of high
> data churn rates (or I guess less, depending on your Linux
> vfs_cache_pressure setting and number of active files).  I suppose the
> combination of those two things, sticky inodes with internalised extent
> lists, might make it more likely that we can overwrite an old file
> without having to fault anything in.
> 

That's possible. The question is how it affects in which cases it's
worth disabling the WAL reuse, and why you observe better performance
and I don't.

> One big difference between your test rig and mine is that your Optane
> 900P claims to do about half a million random IOPS.  That is about half
> a million more IOPS than my spinning disks.  (Actually I used my 5400RPM
> steam powered machine deliberately for that test: I disabled fsync so
> that commit rate wouldn't be slowed down but cache misses would be
> obvious.  I guess Joyent's storage is somewhere between these two
> extremes...)
> 

Yeah. It seems very much like a CPU vs. I/O trade-off, where disabling
the WAL reuse saves a bit of I/O but increases the CPU cost. On the SSD
the reduced number of I/O requests are not noticeable, but the extra CPU
costs does matter (thanks to the high tps values). On slower devices the
I/O savings will matter more, probably.


>> On scale 2000, the
>> throughput actually decreased a bit, by about 5% - from the chart it
>> seems disabling the WAL reuse somewhat amplifies impact of checkpoints,
>> for some reason.
> 
> Huh.
> 

Not sure what's causing this. On SATA results it's not visible, though.

>> I have no idea what happened at the largest scale (8000) - on master
>> there's a huge drop after ~120 minutes, which somewhat recovers at ~220
>> minutes (but not fully). Without WAL reuse there's no such drop,
>> although there seems to be some degradation after ~220 minutes (i.e. at
>> about the same time the master partially recovers. I'm not sure what to
>> think about this, I wonder if it might be caused by almost filling the
>> disk space, or something like that. I'm rerunning this with scale 600.
> 
> There are lots of reports of ZFS performance degrading when free space
> gets below something like 20%.
> 


I've repeated the benchmarks on the Optane SSD with the largest scale
reduced to 6000, to see if it prevents the performance drop with less
than 20% of free space. It apparently does (see zfs2.pdf), although it
does not change the behavior - with WAL reuse disabled it's still a bit
slower.

I've also done the tests with SATA devices (3x 7.2k drives), to see if
it changes the behavior due to I/O vs. CPU trade-off. And it seems to be
the case (see zfs-sata.pdf), to some extent. For the smallest scale
(200) there's not much difference. For medium (2000) there seems to be a
clear improvement, although the behavior is not particularly smooth. On
the largest scale (8000) there seems to be a slight improvement, or at
least it's not slower like before.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

zfs-sata.pdf
Description: Adobe PDF document

zfs2.pdf
Description: Adobe PDF document

Re: patch to allow disable of WAL recycling

Reply via email to