Hi Thomas / hackers,

>> The append-only bottleneck appears to be limited by syscalls/s due to small 
>> block size even with everything in FS cache (but not in shared buffers, 
>> please compare with TEST1 as there was no such bottleneck at all):
>>
>>     29.62%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
>>             ---copy_user_enhanced_fast_string
>>                |--17.98%--copyin
>> [..]
>>                |          __pwrite_nocancel
>>                |          FileWrite
>>                |          mdwrite
>>                |          FlushBuffer
>>                |          ReadBuffer_common
>>                |          ReadBufferWithoutRelcache
>>                |          XLogReadBufferExtended
>>                |          XLogReadBufferForRedoExtended
>>                |           --17.57%--btree_xlog_insert
>
> To move these writes out of recovery's way, we should probably just
> run the bgwriter process during crash recovery.  I'm going to look
> into that.

Sounds awesome. Also as this thread is starting to derail the SLRU fsync topic 
- maybe we should change subject? However, to add some data to the separate 
bgwriter: on 14master (already with lseek() caching, SLRU fsyncs out of way, 
better sorting), I've measured the same configuration as last time with still 
the same append-only WAL workload on NVMe and compared with various 
shared_buffers settings (and buffers description sizing from 
pg_shmem_allocations which as You stated is wrongly reported(?) which I'm 
stating only for reference just in case):

shared_buffers=128MB buffers_desc=1024kB 96.778, 0.438 [a]
shared_buffers=256MB buffers_desc=2048kB 62.755, 0.577 [a]
shared_buffers=512MB buffers_desc=4096kB 33.167, 0.62 [a]
shared_buffers=1GB buffers_desc=8192kB 27.303, 0.929
shared_buffers=4GB buffers_desc=32MB 27.185, 1.166
shared_buffers=8GB buffers_desc=64MB 27.649, 1.088 
shared_buffers=16GB buffers_desc=128MB 27.584, 1.201 [b]
shared_buffers=32GB buffers_desc=256MB 32.314, 1.171 [b]
shared_buffers=48GB buffers_desc=384 MB 31.95, 1.217
shared_buffers=64GB buffers_desc=512 MB 31.276, 1.349
shared_buffers=72GB buffers_desc=576 MB 31.925, 1.284
shared_buffers=80GB buffers_desc=640 MB 31.809, 1.413

The amount of WAL to be replayed was ~2.8GB. To me it looks like that
a) flushing dirty buffers by StartupXLog might be a real problem but please 
read-on.
b) there is very low impact by this L2/L3 hypothesis you mention (?), but it's 
not that big and it's not degrading linearly as one would expect (??) 
L1d:L1d:L2:L3 cache sizes on this machine are as follows on this machine: 
32K/32K/256K/46080K. Maybe the DB size is too small.

I've double-checked that in condition [a] (shared_buffers=128MB) there was a 
lot of FlushBuffer() invocations per second (perf stat -e 
probe_postgres:FlushBuffer -I 1000), e.g:
#           time             counts unit events
     1.000485217             79,494      probe_postgres:FlushBuffer
     2.000861366             75,305      probe_postgres:FlushBuffer
     3.001263163             79,662      probe_postgres:FlushBuffer
     4.001650795             80,317      probe_postgres:FlushBuffer
     5.002033273             79,951      probe_postgres:FlushBuffer
     6.002418646             79,993      probe_postgres:FlushBuffer
while at 1GB shared_buffers it sits nearly at zero all the time.  So there is 
like 3x speed-up possible  when StartupXLog wouldn't have to care too much 
about dirty buffers in the critical path (bgwriter would as you say?) at least 
in append-only scenarios.  But ... I've checked some real systems (even older 
versions of PostgreSQL not doing that much of replication, and indeed it's  
e.g. happening 8-12k/s FlushBuffer's() and shared buffers are pretty huge 
(>100GB, 0.5TB RAM) but they are *system-wide* numbers, work is really 
performed by bgwriter not by startup/recovering as in this redo-bench case when 
DB is open for reads and replicating-- so it appears that this isn't 
optimization for hot standby case , but for the DB-closed startup 
recovery/restart/disaster/PITR scenario).

As for the 24GB shared_buffers scenario where dirty buffers are not at all a 
problem with given top profile (output trimmed), again as expected:

    13.41%  postgres  postgres           [.] hash_search_with_hash_value
               |--8.31%--BufTableLookup <- ReadBuffer_common <- 
ReadBufferWithoutRelcache
                --5.11%--smgropen 
                          |--2.77%--XLogReadBufferExtended
                           --2.34%--ReadBufferWithoutRelcache
     7.88%  postgres  postgres           [.] MarkBufferDirty

I've tried to get cache misses ratio via PMCs, apparently on EC2 they are (even 
on bigger) reporting as not-supported or zeros. However interestingly the 
workload has IPC of 1.40 (instruction bound) which to me is strange as I would 
expect BufTableLookup() to be actually heavy memory bound (?) Maybe I'll try on 
some different hardware one day. 

>>                           __pread_nocancel
>>                            --11.44%--FileRead
>>                                      mdread
>>                                      ReadBuffer_common
>>                                      ReadBufferWithoutRelcache
>>                                      XLogReadBufferExtended
>>                                      XLogReadBufferForRedoExtended
>
> For these reads, the solution should be WAL prefetching,(..) But... when 
> combined with Andres's work-in-progress AIO stuff (..)

Yes, I've heard a thing or two about those :) I hope I'll be able to deliver 
some measurements sooner or later of those two together (AIO+WALprefetch).

-Jakub Wartak.

Reply via email to