Hello, I have a database running on i3.8xlarge (256GB RAM, 32 CPU cores, 4x 1.9TB NVMe drive) AWS instance with about 5TB of disk space occupied, ext4, Ubuntu 16.04.
Multi-tenant DB with about 40000 tables, insert heavy. I started a new slave with identical HW specs, SR. DB started syncing from master, which took about 4 hours, then it started applying the WALs. However, it seems it can't catch up. Delay is still around 3 hours (measured with now() - pg_last_xact_replay_timestamp()), even a day later. It goes a few 100s up and down, but it seems to float around 3h mark. Disk IO is low at about 10%, measured with iostat, no connected clients, recovery process is at around 90% CPU single core usage. Tried tuning the various parameters, but with no avail. Only thing I found suspicious is stracing the recovery process constantly produces many errors such as: lseek(428, 0, SEEK_END) = 780124160 lseek(30, 0, SEEK_END) = 212992 read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily unavailable) lseek(680, 0, SEEK_END) = 493117440 read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily unavailable) lseek(774, 0, SEEK_END) = 583368704 ...[snip]... read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily unavailable) lseek(774, 0, SEEK_END) = 583368704 read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily unavailable) lseek(277, 0, SEEK_END) = 502882304 lseek(6, 516096, SEEK_SET) = 516096 read(6, "\227\320\5\0\1\0\0\0\0\340\7\246\26\274\0\0\315\0\0\0\0\0\0\0}\0178\5&/\260\r"..., 8192) = 8192 read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily unavailable) lseek(735, 0, SEEK_END) = 272809984 read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily unavailable) lseek(277, 0, SEEK_END) = 502882304 ls -l fd/9 lr-x------ 1 postgres postgres 64 Oct 21 06:21 fd/9 -> pipe:[46358] Perf top on recovery produces: 27.76% postgres [.] pglz_decompress 9.90% [kernel] [k] entry_SYSCALL_64_after_swapgs 7.09% postgres [.] hash_search_with_hash_value 4.26% libpthread-2.23.so [.] llseek 3.64% libpthread-2.23.so [.] __read_nocancel 2.80% [kernel] [k] __fget_light 2.67% postgres [.] 0x000000000034d3ba 1.85% [kernel] [k] ext4_llseek 1.84% postgres [.] pg_comp_crc32c_sse42 1.44% postgres [.] hash_any 1.35% postgres [.] 0x000000000036afad 1.29% postgres [.] MarkBufferDirty 1.21% postgres [.] XLogReadRecord [...] Tried changing the process limits with prlimit to unlimited, but no change. I can turn off the WAL compression but I doubt this is the main culprit. Any ideas appreciated. Regards, Boris