Re: Postgres 10, slave not catching up with master

Boris Sagadin Wed, 24 Oct 2018 10:13:02 -0700

Yes, as stated, the lag went very much down after disabling wal
compression, it's manageable at least. Network is 10GB.


Lep pozdrav,

*Boris Sagadin*
InfoSplet, informacijske tehnologije, d.o.o.
*www.infosplet.com* <http://www.infosplet.com/> | Tel: 0590 / 45 800, GSM:
041 / 337 848

On Wed, Oct 24, 2018 at 3:34 PM, Hellmuth Vargas <[email protected]> wrote:

> Hi
> El mié., 24 de oct. de 2018 a la(s) 00:39, Boris Sagadin (
> [email protected]) escribió:
>
>> Yes, times are all identical, set to UTC, ntpd is used.
>>
>>  log_delay
>> -----------
>>  15.788175
>>
>> This is delay at this moment, but we graph replication delay and it's
>> fluctuating between 0 and 30s.
>>
>
>
> But the fluctuation is between 0 and 30s!1, are not 4 hours fortunately.
> Apart from the theme wal compression I think you should check networks
>
>
>
>> Before I turned off wal compression, lag was much bigger (0 to up to 8
>> minutes). We have lots of tables (40k) and many upserts.
>>
>>
>> Boris
>>
>> On Tue, Oct 23, 2018 at 8:24 PM, Hellmuth Vargas <[email protected]>
>> wrote:
>>
>>> Hi
>>>
>>> Both servers are configured with the same date, time and time
>>> configuration?
>>>
>>> El mar., 23 de oct. de 2018 a la(s) 13:16, Hellmuth Vargas (
>>> [email protected]) escribió:
>>>
>>>> Hi
>>>>
>>>> which result you get from the following query:
>>>>
>>>> SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
>>>> THEN 0
>>>> ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())
>>>> END AS log_delay;
>>>>
>>>> source:
>>>>
>>>> https://severalnines.com/blog/postgresql-streaming-
>>>> replication-deep-dive
>>>>
>>>> El mar., 23 de oct. de 2018 a la(s) 11:28, Boris Sagadin (
>>>> [email protected]) escribió:
>>>>
>>>>> Nothing special, just:
>>>>>
>>>>> standby_mode = 'on'
>>>>> primary_conninfo = 'host=...  user=repmgr application_name=nodex'
>>>>> recovery_target_timeline = 'latest'
>>>>>
>>>>>
>>>>> Boris
>>>>>
>>>>> On Tue, Oct 23, 2018 at 3:10 PM, Hellmuth Vargas <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> can share recovery.conf file settings??
>>>>>>
>>>>>> El mar., 23 de oct. de 2018 a la(s) 00:28, Boris Sagadin (
>>>>>> [email protected]) escribió:
>>>>>>
>>>>>>> Yes, turning wal_compression off improves things. Slave that was
>>>>>>> mentioned unfortunately lagged too much before this setting was applied 
>>>>>>> and
>>>>>>> was turned off. However the remaining slave lags less now, although 
>>>>>>> still
>>>>>>> occasionally up to a few minutes. I think single threadedness of 
>>>>>>> recovery
>>>>>>> is a big slowdown for write heavy databases. Maybe an option to increase
>>>>>>> wal_size beyond 16MB in v11 will help.
>>>>>>>
>>>>>>> In the meantime we'll solve this by splitting the DB to 2 or 3
>>>>>>> clusters or maybe trying out some sharding solution like Citus.
>>>>>>>
>>>>>>>
>>>>>>> Boris
>>>>>>>
>>>>>>> On Sun, Oct 21, 2018 at 9:06 AM, Boris Sagadin <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I have a database running on i3.8xlarge (256GB RAM, 32 CPU cores,
>>>>>>>> 4x 1.9TB NVMe drive) AWS instance with about 5TB of disk space 
>>>>>>>> occupied,
>>>>>>>> ext4, Ubuntu 16.04.
>>>>>>>>
>>>>>>>> Multi-tenant DB with about 40000 tables, insert heavy.
>>>>>>>>
>>>>>>>> I started a new slave with identical HW specs, SR. DB started
>>>>>>>> syncing from master, which took about 4 hours, then it started 
>>>>>>>> applying the
>>>>>>>> WALs. However, it seems it can't catch up. Delay is still around 3 
>>>>>>>> hours
>>>>>>>> (measured with now() - pg_last_xact_replay_timestamp()), even a
>>>>>>>> day later. It goes a few 100s up and down, but it seems to float 
>>>>>>>> around 3h
>>>>>>>> mark.
>>>>>>>>
>>>>>>>> Disk IO is low at about 10%, measured with iostat, no connected
>>>>>>>> clients, recovery process is at around 90% CPU single core usage.
>>>>>>>>
>>>>>>>> Tried tuning the various parameters, but with no avail. Only thing
>>>>>>>> I found suspicious is stracing the recovery process constantly produces
>>>>>>>> many errors such as:
>>>>>>>>
>>>>>>>> lseek(428, 0, SEEK_END)                 = 780124160
>>>>>>>> lseek(30, 0, SEEK_END)                  = 212992
>>>>>>>> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource
>>>>>>>> temporarily unavailable)
>>>>>>>> lseek(680, 0, SEEK_END)                 = 493117440
>>>>>>>> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource
>>>>>>>> temporarily unavailable)
>>>>>>>> lseek(774, 0, SEEK_END)                 = 583368704
>>>>>>>>
>>>>>>>> ...[snip]...
>>>>>>>>
>>>>>>>> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource
>>>>>>>> temporarily unavailable)
>>>>>>>> lseek(774, 0, SEEK_END)                 = 583368704
>>>>>>>> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource
>>>>>>>> temporarily unavailable)
>>>>>>>> lseek(277, 0, SEEK_END)                 = 502882304
>>>>>>>> lseek(6, 516096, SEEK_SET)              = 516096
>>>>>>>> read(6, 
>>>>>>>> "\227\320\5\0\1\0\0\0\0\340\7\246\26\274\0\0\315\0\0\0\0\0\0\0}\0178\5&/\260\r"...,
>>>>>>>> 8192) = 8192
>>>>>>>> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource
>>>>>>>> temporarily unavailable)
>>>>>>>> lseek(735, 0, SEEK_END)                 = 272809984
>>>>>>>> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource
>>>>>>>> temporarily unavailable)
>>>>>>>> lseek(277, 0, SEEK_END)                 = 502882304
>>>>>>>>
>>>>>>>> ls -l fd/9
>>>>>>>> lr-x------ 1 postgres postgres 64 Oct 21 06:21 fd/9 -> pipe:[46358]
>>>>>>>>
>>>>>>>>
>>>>>>>> Perf top on recovery produces:
>>>>>>>>
>>>>>>>>  27.76%  postgres            [.] pglz_decompress
>>>>>>>>    9.90%  [kernel]            [k] entry_SYSCALL_64_after_swapgs
>>>>>>>>    7.09%  postgres            [.] hash_search_with_hash_value
>>>>>>>>    4.26%  libpthread-2.23.so  [.] llseek
>>>>>>>>    3.64%  libpthread-2.23.so  [.] __read_nocancel
>>>>>>>>    2.80%  [kernel]            [k] __fget_light
>>>>>>>>    2.67%  postgres            [.] 0x000000000034d3ba
>>>>>>>>    1.85%  [kernel]            [k] ext4_llseek
>>>>>>>>    1.84%  postgres            [.] pg_comp_crc32c_sse42
>>>>>>>>    1.44%  postgres            [.] hash_any
>>>>>>>>    1.35%  postgres            [.] 0x000000000036afad
>>>>>>>>    1.29%  postgres            [.] MarkBufferDirty
>>>>>>>>    1.21%  postgres            [.] XLogReadRecord
>>>>>>>> [...]
>>>>>>>>
>>>>>>>> Tried changing the process limits with prlimit to unlimited, but no
>>>>>>>> change.
>>>>>>>>
>>>>>>>> I can turn off the WAL compression but I doubt this is the main
>>>>>>>> culprit. Any ideas appreciated.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Boris
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Cordialmente,
>>>>>>
>>>>>> Ing. Hellmuth I. Vargas S.
>>>>>> Esp. Telemática y Negocios por Internet
>>>>>> Oracle Database 10g Administrator Certified Associate
>>>>>> EnterpriseDB Certified PostgreSQL 9.3 Associate
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Cordialmente,
>>>>
>>>> Ing. Hellmuth I. Vargas S.
>>>>
>>>>
>>>>
>>>
>>> --
>>> Cordialmente,
>>>
>>> Ing. Hellmuth I. Vargas S.
>>>
>>>
>>>
>>
>
> --
> Cordialmente,
>
> Ing. Hellmuth I. Vargas S.
> Esp. Telemática y Negocios por Internet
> Oracle Database 10g Administrator Certified Associate
> EnterpriseDB Certified PostgreSQL 9.3 Associate
>
>

Re: Postgres 10, slave not catching up with master

Reply via email to