Hi, On 2021-04-28 19:24:53 -0400, Tom Lane wrote: > But I happened to notice the accumulated CPU time for the background > processes: > > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND > tgl 19048 0.0 4.4 229952 92196 ?? Ss 3:19PM 19:59.19 > postgres: startup recovering 000000010000001400000022 > tgl 19051 0.0 0.1 229656 1696 ?? Ss 3:19PM 27:09.14 > postgres: walreceiver streaming 14/227D8F14 > tgl 19052 0.0 0.1 229904 2516 ?? Ss 3:19PM 17:38.17 > postgres: walsender tgl [local] streaming 14/227D8F14 > > IOW, we've spent over twice as many CPU cycles shipping data to the > standby as we did in applying the WAL on the standby. Is this > expected? I've got wal_consistency_checking = all, which is bloating > the WAL volume quite a bit, but still it seems like the walsender and > walreceiver have little excuse for spending more cycles per byte > than the startup process.
I don't really know how the time calculation works on mac. Is there a chance it includes time spent doing IO? On the primary the WAL IO is done by a lot of backends, but on the standby it's all going to be the walreceiver. And the walreceiver does fsyncs in a not particularly efficient manner. FWIW, on my linux workstation no such difference is visible: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND andres 2910540 9.4 0.0 2237252 126680 ? Ss 16:55 0:20 postgres: dev assert standby: startup recovering 00000001000000020000003F andres 2910544 5.2 0.0 2236724 9260 ? Ss 16:55 0:11 postgres: dev assert standby: walreceiver streaming 2/3FDCF118 andres 2910545 2.1 0.0 2237036 10672 ? Ss 16:55 0:04 postgres: dev assert: walsender andres [local] streaming 2/3FDCF118 > (This is testing b3ee4c503, so if Thomas' WAL changes improved > efficiency of the replay process at all, the discrepancy could be > even worse in HEAD.) The prefetching isn't enabled by default, so I'd not expect meaningful differences... And even with the prefetching enabled, our normal regression tests largely are resident in s_b, so there shouldn't be much prefetching. Oh! I was about to ask how much shared buffers your primary / standby have. And I think I may actually have reproduce a variant of the issue! I previously had played around with different settings that I thought might increase the likelihood of reproducing the problem. But this time I set shared_buffers lower than before, and got: 2021-04-28 17:03:22.174 PDT [2913840][] LOG: database system was shut down in recovery at 2021-04-28 17:03:11 PDT 2021-04-28 17:03:22.174 PDT [2913840][] LOG: entering standby mode 2021-04-28 17:03:22.178 PDT [2913840][1/0] LOG: redo starts at 2/416C6278 2021-04-28 17:03:37.628 PDT [2913840][1/0] LOG: consistent recovery state reached at 4/7F5C3200 2021-04-28 17:03:37.628 PDT [2913840][1/0] FATAL: invalid memory alloc request size 3053455757 2021-04-28 17:03:37.628 PDT [2913839][] LOG: database system is ready to accept read only connections 2021-04-28 17:03:37.636 PDT [2913839][] LOG: startup process (PID 2913840) exited with exit code 1 This reproduces across restarts. Yay, I guess. Isn't it off that we get a "database system is ready to accept read only connections"? Greetings, Andres Freund