On 2016-01-16 10:01:25 +0100, Fabien COELHO wrote: > > Hello Andres, > > >I measured it in a different number of cases, both on SSDs and spinning > >rust. I just reproduced it with: > > > >postgres-ckpt14 \ > > -D /srv/temp/pgdev-dev-800/ \ > > -c maintenance_work_mem=2GB \ > > -c fsync=on \ > > -c synchronous_commit=off \ > > -c shared_buffers=2GB \ > > -c wal_level=hot_standby \ > > -c max_wal_senders=10 \ > > -c max_wal_size=100GB \ > > -c checkpoint_timeout=30s > > > >Using a fresh cluster each time (copied from a "template" to save time) > >and using > >pgbench -M prepared -c 16 -j 16 -T 300 -P 1
So, I've analyzed the problem further, and I think I found something rater interesting. I'd profiled the kernel looking where it blocks in the IO request queues, and found that the wal writer was involved surprisingly often. So, in a workload where everything (checkpoint, bgwriter, backend writes) is flushed: 2995 tps After I kill the wal writer with -STOP: 10887 tps Stracing the wal writer shows: 17:29:02.001517 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17857, si_uid=1000} --- 17:29:02.001538 rt_sigreturn({mask=[]}) = 0 17:29:02.001582 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable) 17:29:02.001615 write(3, "\210\320\5\0\1\0\0\0\0@\330_/\0\0\0w\f\0\0\0\0\0\0\0\4\0\2\t\30\0\372"..., 49152) = 49152 17:29:02.001671 fdatasync(3) = 0 17:29:02.005022 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17825, si_uid=1000} --- 17:29:02.005043 rt_sigreturn({mask=[]}) = 0 17:29:02.005081 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable) 17:29:02.005111 write(3, "\210\320\5\0\1\0\0\0\0\0\331_/\0\0\0\7\26\0\0\0\0\0\0T\251\0\0\0\0\0\0"..., 8192) = 8192 17:29:02.005147 fdatasync(3) = 0 17:29:02.008688 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17866, si_uid=1000} --- 17:29:02.008705 rt_sigreturn({mask=[]}) = 0 17:29:02.008730 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable) 17:29:02.008757 write(3, "\210\320\5\0\1\0\0\0\0 \331_/\0\0\0\267\30\0\0\0\0\0\0 "..., 98304) = 98304 17:29:02.008822 fdatasync(3) = 0 17:29:02.016125 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} --- 17:29:02.016141 rt_sigreturn({mask=[]}) = 0 17:29:02.016174 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable) 17:29:02.016204 write(3, "\210\320\5\0\1\0\0\0\0\240\332_/\0\0\0s\5\0\0\0\0\0\0\t\30\0\2|8\2u"..., 57344) = 57344 17:29:02.016281 fdatasync(3) = 0 17:29:02.019181 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} --- 17:29:02.019199 rt_sigreturn({mask=[]}) = 0 17:29:02.019226 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable) 17:29:02.019249 write(3, "\210\320\5\0\1\0\0\0\0\200\333_/\0\0\0\307\f\0\0\0\0\0\0 "..., 73728) = 73728 17:29:02.019355 fdatasync(3) = 0 17:29:02.022680 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} --- 17:29:02.022696 rt_sigreturn({mask=[]}) = 0 I.e. we're fdatasync()ing small amount of pages. Roughly 500 times a second. As soon as the wal writer is stopped, it's much bigger chunks, on the order of 50-130 pages. And, not that surprisingly, that improves performance, because there's far fewer cache flushes submitted to the hardware. > I'm running some tests similar to those above... > Do you do some warmup when testing? I guess the answer is "no". Doesn't make a difference here, I tried both. As long as before/after benchmarks start from the same state... > I understand that you have 8 cores/16 threads on your host? On one of them, 4 cores/8 threads on the laptop. > Loading scale 800 data for 300 seconds tests takes much more than 300 > seconds (init takes ~360 seconds, vacuum & index are slow). With 30 seconds > checkpoint cycles and without any warmup, I feel that these tests are really > on the very short (too short) side, so I'm not sure how much I can trust > such results as significant. The data I reported were with more real life > like parameters. I see exactly the same with 300s or 1000s checkpoint cycles, it just takes a lot longer to repeat. They're also similar (although obviously both before/after patch are higher) if I disable full_page_writes, thereby eliminating a lot of other IO. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers