On 2015-10-19 21:14:55 +0200, Fabien COELHO wrote: > >In my performance testing it showed that calling PerformFileFlush() only > >at segment boundaries and in CheckpointWriteDelay() can lead to rather > >spikey IO - not that surprisingly. The sync in CheckpointWriteDelay() is > >problematic because it only is triggered while on schedule, and not when > >behind. > > When behind, the PerformFileFlush should be called on segment > boundaries.
That means it's flushing up to a gigabyte of data at once. Far too much. The implementation pretty always will go behind schedule for some time. Since sync_file_range() doesn't flush in the foreground I don't think it's important to do the flushing in concert with sleeping. > >My testing seems to show that just adding a limit of 32 buffers to > >FileAsynchronousFlush() leads to markedly better results. > > Hmmm. 32 buffers means 256 KB, which is quite small. Why? The aim is to not overwhelm the request queue - which is where the coalescing is done. And usually that's rather small. If you flush much more sync_file_range starts to do work in the foreground. > >I wonder if mmap() && msync(MS_ASYNC) isn't a better replacement for > >sync_file_range(SYNC_FILE_RANGE_WRITE) than posix_fadvise(DONTNEED). It > >might even be possible to later approximate that on windows using > >FlushViewOfFile(). > > I'm not sure that mmap/msync can be used for this purpose, because there is > no real control it seems about where the file is mmapped. I'm not following? Why does it matter where a file is mapped? I have had a friend (Christian Kruse, thanks!) confirm that at least on OSX msync(MS_ASYNC) triggers writeback. A freebsd dev confirmed that that should be the case on freebsd too. > >Having defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE) in > >so many places looks ugly, I want to push that to the underlying > >functions. If we add a different flushing approach we shouldn't have to > >touch several places that don't actually really care. > > I agree that it is pretty ugly, but I do not think that you can remove them > all. Sure, never said all. But most. > >I've replaced the NextBufferToWrite() logic with a binaryheap.h heap - > >seems to work well, with a bit less code actually. > > Hmmm. I'll check. I'm still unconvinced that using a tree for a 2-3 element > set in most case is an improvement. Yes, it'll not matter that much in many cases. But I rather disliked the NextBufferToWrite() implementation, especially that it walkes the array multiple times. And I did see setups with ~15 tablespaces. > >I've also noticed that sleeping logic in CheckpointWriteDelay() isn't > >particularly good. In high throughput workloads the 100ms sleep is too > >long, leading to bursty IO behaviour. If 1k+ buffers a written out a > >second 100ms is a rather long sleep. For another that we only sleep > >100ms when the write rate is low makes the checkpoint finish rather > >quickly - on a slow disk (say microsd) that can cause unneccesary > >slowdowns for concurrent activity. ISTM we should calculate the sleep > >time in a better way. > > I also noted this point, but I'm not sure how to have a better approach, so > I let it as it is. I tried 50 ms & 200 ms on some runs, without significant > effect on performance for the test I ran then. The point of having not too > small a value is that it provide some significant work to the IO subsystem > without overflowing it. I don't think that makes much sense. All a longer sleep achieves is creating a larger burst of writes afterwards. We should really sleep adaptively. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers