Yes, as stated, the lag went very much down after disabling wal compression, it's manageable at least. Network is 10GB.
Lep pozdrav, *Boris Sagadin* InfoSplet, informacijske tehnologije, d.o.o. *www.infosplet.com* <http://www.infosplet.com/> | Tel: 0590 / 45 800, GSM: 041 / 337 848 On Wed, Oct 24, 2018 at 3:34 PM, Hellmuth Vargas <hiv...@gmail.com> wrote: > Hi > El mié., 24 de oct. de 2018 a la(s) 00:39, Boris Sagadin ( > bo...@infosplet.com) escribió: > >> Yes, times are all identical, set to UTC, ntpd is used. >> >> log_delay >> ----------- >> 15.788175 >> >> This is delay at this moment, but we graph replication delay and it's >> fluctuating between 0 and 30s. >> > > > But the fluctuation is between 0 and 30s!1, are not 4 hours fortunately. > Apart from the theme wal compression I think you should check networks > > > >> Before I turned off wal compression, lag was much bigger (0 to up to 8 >> minutes). We have lots of tables (40k) and many upserts. >> >> >> Boris >> >> On Tue, Oct 23, 2018 at 8:24 PM, Hellmuth Vargas <hiv...@gmail.com> >> wrote: >> >>> Hi >>> >>> Both servers are configured with the same date, time and time >>> configuration? >>> >>> El mar., 23 de oct. de 2018 a la(s) 13:16, Hellmuth Vargas ( >>> hiv...@gmail.com) escribió: >>> >>>> Hi >>>> >>>> which result you get from the following query: >>>> >>>> SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() >>>> THEN 0 >>>> ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) >>>> END AS log_delay; >>>> >>>> source: >>>> >>>> https://severalnines.com/blog/postgresql-streaming- >>>> replication-deep-dive >>>> >>>> El mar., 23 de oct. de 2018 a la(s) 11:28, Boris Sagadin ( >>>> bo...@infosplet.com) escribió: >>>> >>>>> Nothing special, just: >>>>> >>>>> standby_mode = 'on' >>>>> primary_conninfo = 'host=... user=repmgr application_name=nodex' >>>>> recovery_target_timeline = 'latest' >>>>> >>>>> >>>>> Boris >>>>> >>>>> On Tue, Oct 23, 2018 at 3:10 PM, Hellmuth Vargas <hiv...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi >>>>>> >>>>>> can share recovery.conf file settings?? >>>>>> >>>>>> El mar., 23 de oct. de 2018 a la(s) 00:28, Boris Sagadin ( >>>>>> bo...@infosplet.com) escribió: >>>>>> >>>>>>> Yes, turning wal_compression off improves things. Slave that was >>>>>>> mentioned unfortunately lagged too much before this setting was applied >>>>>>> and >>>>>>> was turned off. However the remaining slave lags less now, although >>>>>>> still >>>>>>> occasionally up to a few minutes. I think single threadedness of >>>>>>> recovery >>>>>>> is a big slowdown for write heavy databases. Maybe an option to increase >>>>>>> wal_size beyond 16MB in v11 will help. >>>>>>> >>>>>>> In the meantime we'll solve this by splitting the DB to 2 or 3 >>>>>>> clusters or maybe trying out some sharding solution like Citus. >>>>>>> >>>>>>> >>>>>>> Boris >>>>>>> >>>>>>> On Sun, Oct 21, 2018 at 9:06 AM, Boris Sagadin <bo...@infosplet.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> I have a database running on i3.8xlarge (256GB RAM, 32 CPU cores, >>>>>>>> 4x 1.9TB NVMe drive) AWS instance with about 5TB of disk space >>>>>>>> occupied, >>>>>>>> ext4, Ubuntu 16.04. >>>>>>>> >>>>>>>> Multi-tenant DB with about 40000 tables, insert heavy. >>>>>>>> >>>>>>>> I started a new slave with identical HW specs, SR. DB started >>>>>>>> syncing from master, which took about 4 hours, then it started >>>>>>>> applying the >>>>>>>> WALs. However, it seems it can't catch up. Delay is still around 3 >>>>>>>> hours >>>>>>>> (measured with now() - pg_last_xact_replay_timestamp()), even a >>>>>>>> day later. It goes a few 100s up and down, but it seems to float >>>>>>>> around 3h >>>>>>>> mark. >>>>>>>> >>>>>>>> Disk IO is low at about 10%, measured with iostat, no connected >>>>>>>> clients, recovery process is at around 90% CPU single core usage. >>>>>>>> >>>>>>>> Tried tuning the various parameters, but with no avail. Only thing >>>>>>>> I found suspicious is stracing the recovery process constantly produces >>>>>>>> many errors such as: >>>>>>>> >>>>>>>> lseek(428, 0, SEEK_END) = 780124160 >>>>>>>> lseek(30, 0, SEEK_END) = 212992 >>>>>>>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource >>>>>>>> temporarily unavailable) >>>>>>>> lseek(680, 0, SEEK_END) = 493117440 >>>>>>>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource >>>>>>>> temporarily unavailable) >>>>>>>> lseek(774, 0, SEEK_END) = 583368704 >>>>>>>> >>>>>>>> ...[snip]... >>>>>>>> >>>>>>>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource >>>>>>>> temporarily unavailable) >>>>>>>> lseek(774, 0, SEEK_END) = 583368704 >>>>>>>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource >>>>>>>> temporarily unavailable) >>>>>>>> lseek(277, 0, SEEK_END) = 502882304 >>>>>>>> lseek(6, 516096, SEEK_SET) = 516096 >>>>>>>> read(6, >>>>>>>> "\227\320\5\0\1\0\0\0\0\340\7\246\26\274\0\0\315\0\0\0\0\0\0\0}\0178\5&/\260\r"..., >>>>>>>> 8192) = 8192 >>>>>>>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource >>>>>>>> temporarily unavailable) >>>>>>>> lseek(735, 0, SEEK_END) = 272809984 >>>>>>>> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource >>>>>>>> temporarily unavailable) >>>>>>>> lseek(277, 0, SEEK_END) = 502882304 >>>>>>>> >>>>>>>> ls -l fd/9 >>>>>>>> lr-x------ 1 postgres postgres 64 Oct 21 06:21 fd/9 -> pipe:[46358] >>>>>>>> >>>>>>>> >>>>>>>> Perf top on recovery produces: >>>>>>>> >>>>>>>> 27.76% postgres [.] pglz_decompress >>>>>>>> 9.90% [kernel] [k] entry_SYSCALL_64_after_swapgs >>>>>>>> 7.09% postgres [.] hash_search_with_hash_value >>>>>>>> 4.26% libpthread-2.23.so [.] llseek >>>>>>>> 3.64% libpthread-2.23.so [.] __read_nocancel >>>>>>>> 2.80% [kernel] [k] __fget_light >>>>>>>> 2.67% postgres [.] 0x000000000034d3ba >>>>>>>> 1.85% [kernel] [k] ext4_llseek >>>>>>>> 1.84% postgres [.] pg_comp_crc32c_sse42 >>>>>>>> 1.44% postgres [.] hash_any >>>>>>>> 1.35% postgres [.] 0x000000000036afad >>>>>>>> 1.29% postgres [.] MarkBufferDirty >>>>>>>> 1.21% postgres [.] XLogReadRecord >>>>>>>> [...] >>>>>>>> >>>>>>>> Tried changing the process limits with prlimit to unlimited, but no >>>>>>>> change. >>>>>>>> >>>>>>>> I can turn off the WAL compression but I doubt this is the main >>>>>>>> culprit. Any ideas appreciated. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Boris >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Cordialmente, >>>>>> >>>>>> Ing. Hellmuth I. Vargas S. >>>>>> Esp. Telemática y Negocios por Internet >>>>>> Oracle Database 10g Administrator Certified Associate >>>>>> EnterpriseDB Certified PostgreSQL 9.3 Associate >>>>>> >>>>>> >>>>> >>>> >>>> -- >>>> Cordialmente, >>>> >>>> Ing. Hellmuth I. Vargas S. >>>> >>>> >>>> >>> >>> -- >>> Cordialmente, >>> >>> Ing. Hellmuth I. Vargas S. >>> >>> >>> >> > > -- > Cordialmente, > > Ing. Hellmuth I. Vargas S. > Esp. Telemática y Negocios por Internet > Oracle Database 10g Administrator Certified Associate > EnterpriseDB Certified PostgreSQL 9.3 Associate > >