Re: rebased background worker reimplementation prototype

Tomas Vondra Tue, 16 Jul 2019 12:17:08 -0700

On Tue, Jul 16, 2019 at 10:53:46AM -0700, Andres Freund wrote:

Hi,


On 2019-07-12 15:47:02 +0200, Tomas Vondra wrote:

I've done a bit of benchmarking / testing on this, so let me report some
basic results. I haven't done any significant code review, I've simply
ran a bunch of pgbench runs on different systems with different scales.


Thanks!

System #1
---------
* CPU: Intel i5
* RAM: 8GB
* storage: 6 x SATA SSD RAID0 (Intel S3700)
* autovacuum_analyze_scale_factor = 0.1
* autovacuum_vacuum_cost_delay = 2
* autovacuum_vacuum_cost_limit = 1000
* autovacuum_vacuum_scale_factor = 0.01
* bgwriter_delay = 100
* bgwriter_lru_maxpages = 10000
* checkpoint_timeout = 30min
* max_wal_size = 64GB
* shared_buffers = 1GB


What's the controller situation here? Can the full SATA3 bandwidth on
all of those drives be employed concurrently?


There's just an on-board SATA controller, so it might be a bottleneck.

A single drive can do ~440 MB/s reads sequentially, and the whole RAID0
array (Linux sw raid) does ~1.6GB/s, so not exactly 6x that. But I don't
think we're generating that many writes during the test.

System #2
---------
* CPU: 2x Xeon E5-2620v5
* RAM: 64GB
* storage: 3 x 7.2k SATA RAID0, 1x NVMe
* autovacuum_analyze_scale_factor = 0.1
* autovacuum_vacuum_cost_delay = 2
* autovacuum_vacuum_cost_limit = 1000
* autovacuum_vacuum_scale_factor = 0.01
* bgwriter_delay = 100
* bgwriter_lru_maxpages = 10000
* checkpoint_completion_target = 0.9
* checkpoint_timeout = 15min
* max_wal_size = 32GB
* shared_buffers = 8GB


What type of NVMe disk is this? I'm mostly wondering whether it's fast
enough that there's no conceivable way that IO scheduling is going to
make a meaningful difference, given other bottlenecks in postgres.

In some preliminary benchmark runs I've seen fairly significant gains on
SATA and SAS SSDs, as well as spinning rust, but I've not yet
benchmarked on a decent NVMe SSD.


Intel Optane 900P 280MB (model SSDPED1D280GA) [1].

[1] https://ssd.userbenchmark.com/SpeedTest/315555/INTEL-SSDPED1D280GA

I think one of the main improvements in this generation of drives is
good performance with low queue depth.  See for example [2].

[2] 
https://www.anandtech.com/show/12136/the-intel-optane-ssd-900p-480gb-review/5

Not sure if that plays role here, but I've seen this to afffect prefetch
and similar things.

For each config I've done tests with three scales - small (fits into
shared buffers), medium (fits into RAM) and large (at least 2x the RAM).
Aside from the basic metrics (throughput etc.) I've also sampled data
about 5% of transactions, to be able to look at latency stats.

The tests were done on master and patched code (both in the 'legacy' and
new mode).

I haven't done any temporal analysis yet (i.e. I'm only looking at global
summaries, not tps over time etc).


FWIW, I'm working on a tool that generates correlated graphs of OS, PG,
pgbench stats. Especially being able to correlate the kernel's
'Writeback' stats (grep Writeback: /proc/meminfo) and latency is very
valuable. Sampling wait events over time also is worthwhile.


Good to know, although I don't think it's difficult to fetch the data
from sar and plot it. I might even already have ugly bash scripts doing
that, somewhere.

When running on the 7.2k SATA RAID, the throughput improves with the
medium scale - from ~340tps to ~439tps, which is a pretty significant
jump. But on the large scale this disappears (in fact, it seems to be a
bit lower than master/legacy cases). Of course, all this is just from a
single run (although 4h, so noise should even out).


Any chance there's an order-of-test factor here? In my tests I found two
related issues very important: 1) the first few tests are slower,
because WAL segments don't yet exist. 2) Some poor bugger of a later
test will get hit with anti-wraparound vacuums, even if otherwise not
necessary.


Not sure - I'll check, but I find it unlikely. I need to repeat the
tests to have multiple runs.

The fact that the master and "legacy" numbers differ significantly
e.g. in the "xeon sata scale 1000" latency CDF does make me wonder
whether there's an effect like that. While there might be some small
performance difference due to different stats message sizes, and a few
additional branches, I don't see how it could be that noticable.


That's about the one case where things like anti-wraparound are pretty
much impossible, because the SATA storage is so slow ...

I've also computed latency CDF (from the 5% sample) - I've attached this
for the two interesting cases mentioned in the previous paragraph. This
shows that with the medium scale the latencies move down (with the patch,
both in the legacy and "new" modes), while on large scale the "new" mode
moves a bit to the right to higher values).


Hm. I can't yet explain that.

And finally, I've looked at buffer stats, i.e. number of buffers written
in various ways (checkpoing, bgwriter, backends) etc. Interestingly
enough, these numbers did not change very much - especially on the flash
storage. Maybe that's expected, though.


Some of that is expected, e.g. because file extensions count as backend
writes, and are going to be roughly correlate with throughput, and not
much else. But they're more similar than I'd actually expect.

I do see a pretty big difference in the number of bgwriter written
backends in the "new" case for scale 10000, on the nvme?


Right.

For the SATA SSD case, I wonder if the throughput bottleneck is WAL
writes. I see much more noticable differences if I enable
wal_compression or disable full_page_writes, because otherwise the bulk
of the volume is WAL data.  But even in that case, I see a latency
stddev reduction with the new bgwriter around checkpoints.


I may try that during the next round of tests.

The one case where it did change is the "medium" scale on SATA storage,
where the throughput improved with the patch. But the change is kinda
strange, because the number of buffers evicted by the bgwriter decreased
(and instead it got evicted by the checkpointer). Which might explain the
higher throughput, because checkpointer is probably more efficient.


Well, one problem with the current bgwriter implementation is that the
victim selection isn't good. Because it doesn't perform clock sweep, and
doesn't clean buffers with a usagecount, it'll often run until it finds
a dirty buffer that's pretty far ahead of the clock hand, and clean
those. But with a random test like pgbench it's somewhat likely that
those buffers will get re-dirtied before backends actually get to
reusing them (that's a problem with the new implementation too, the
window just is smaller).  But I'm far from sure that that's the cause here.


OK.

Time for more tests, I guess.


--
Tomas Vondra                  http://www.2ndQuadrant.com

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: rebased background worker reimplementation prototype

Reply via email to