Hello Andres,

And here's v14. It's not something entirely ready.

I'm going to have a careful look at it.

A lot of details have changed, I unfortunately don't remember them all. But there are more important things than the details of the patch.

I've played *a lot* with this patch. I found a bunch of issues:

1) The FileFlushContext context infrastructure isn't actually
  correct. There's two problems: First, using the actual 'fd' number to
  reference a to-be-flushed file isn't meaningful. If there  are lots
  of files open, fds get reused within fd.c.

Hmm.

My assumption is that a file being used (i.e. with modifie pages, being used for writes...) would not be closed before everything is cleared...

After some poking in the code, I think that this issue may indeed be there, although the probability of hitting it is close to 0, but alas not 0:-)

To fix it, ITSM that it is enough to hold a "do not close lock" on the file while a flush is in progress (a short time) that would prevent mdclose to do its stuff.

That part is enough fixed by referencing File instead the fd. The bigger problem is that the infrastructure doesn't deal with files being closed. There can, which isn't that hard to trigger, be smgr invalidations causing smgr handle and thus the file to be closed.

I think this means that the entire flushing infrastructure actually
needs to be hoisted up, onto the smgr/md level.

Hmmm. I'm not sure that it is necessary, see above my suggestion.

2) I noticed that sync_file_range() blocked far more often than I'd
  expected. Reading the kernel code that turned out to be caused by a
  pessimization in the kernel introduced years ago - in many situation
  SFR_WRITE waited for the writes. A fix for this will be in the 4.4
  kernel.

Alas, Pg cannot help issues in the kernel.

3) I found that latency wasn't improved much for workloads that are
  significantly bigger than shared buffers. The problem here is that
  neither bgwriter nor the backends have, so far, done
  sync_file_range() calls. That meant that the old problem of having
  gigabytes of dirty data that periodically get flushed out, still
  exists. Having these do flushes mostly attacks that problem.

I'm concious that the patch only addresses *checkpointer* writes, not those from bgwrither or backends writes. I agree that these should need to be addressed at some point as well, but given the time to get a patch through, the more complex the slower (sort propositions are 10 years old), I think this should be postponed for later.

Benchmarking revealed that for workloads where the hot data set mostly
fits into shared buffers flushing and sorting is anywhere from a small
to a massive improvement, both in throughput and latency. Even without
the patch from 2), although fixing that improves things furhter.

This is consistent with my experiments: sorting improves things, and flushing on top of sorting improves things further.

What I did not expect, and what confounded me for a long while, is that
for workloads where the hot data set does *NOT* fit into shared buffers,
sorting often led to be a noticeable reduction in throughput. Up to
30%.

I did not see such behavior in the many tests I ran. Could you share more precise details so that I can try to reproduce this performance regression? (available memory, shared buffers, db size, ...).

The performance was still much more regular than before, i.e. no
more multi-second periods without any transactions happening.

By now I think I know what's going on: Before the sorting portion of the
patch the write-loop in BufferSync() starts at the current clock hand,
by using StrategySyncStart(). But after the sorting that obviously
doesn't happen anymore - buffers are accessed in their sort order. By
starting at the current clock hand and moving on from there the
checkpointer basically makes it more less likely that victim buffers
need to be written either by the backends themselves or by
bgwriter. That means that the sorted checkpoint writes can, indirectly,
increase the number of unsorted writes by other processes :(

I'm quite surprised at such a large effect on throughput, though.

This explanation seems to suggest that if bgwriter/workders write are sorted and/or coordinated with the checkpointer somehow then all would be well?

ISTM that this explanation could be checked by looking whether bgwriter/workers writes are especially large compared to checkpointer writes in those cases with reduced throughput? The data is in the log.



My benchmarking suggest that that effect is the larger, the shorter the
checkpoint timeout is.

Hmmm. The shorter the timeout, the more likely the sorting NOT to be effective, and the more likely to go back to random I/Os, and maybe to seem some effect of the sync strategy stuff.

That seems to intuitively make sense, give the above explanation attempt. If the checkpoint takes longer the clock hand will almost certainly soon overtake checkpoints 'implicit' hand.

I'm not sure if we can really do anything about this problem. While I'm
pretty jet lagged, I still spent a fair amount of time thinking about
it. Seems to suggest that we need to bring back the setting to
enable/disable sorting :(


What I think needs to happen next with the patch is:
1) Hoist up the FileFlushContext stuff into the smgr layer. Carefully
  handling the issue of smgr invalidations.

Not sure that much is necessary, see above.

2) Replace the boolean checkpoint_flush_to_disk GUC with a list guc that
  later can contain multiple elements like checkpoint, bgwriter,
  backends, ddl, bulk-writes. That seems better than adding GUCs for
  these separately. Then make the flush locations in the patch
  configurable using that.

My 0,02€ on this point: I have not seen much of this style of guc elsewhere. The only one I found while scanning the postgres file are *_path and *_libraries. It seems to me that this would depart significantly from the usual style, so one guc per case, or one shared guc but with only on/off, would blend in more cleanly with the usual style.

3) I think we should remove the sort timing from the checkpoint logging
  before commit. It'll always be pretty short.

I added it to show that it was really short, in response to concerns that my approach of just sorting through indexes to reduce the memory needed instead of copying the data to be sorted did not induce significant performance issues. I prooved my point, but peer pressure made me switch to larger memory anyway.

I think it should be kept while the features are under testing. I do not think that it harms in anyway.

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to