The replicator is ok and the replicated as well. %Cpu(s): 0.2 us, 1.0 sy, 0.0 ni, 94.8 id, 4.0 wa, 0.0 hi, 0.0 si, 0.0 st
CPU is really low on both. I am running pg_basebackup again everytime. Any other suggestions? Il giorno mer 11 mar 2020 alle ore 23:13 Adrian Klaver < adrian.kla...@aklaver.com> ha scritto: > On 3/11/20 2:12 PM, Nicola Contu wrote: > > CPU load on the server to be built? No. > > CPU load, I/O load on the servers in the replication chain. > > Basically you just recently, it seems, imposed extra overhead to the > process by encrypting/decrypting. From what I gather from earlier post > then your replication started failing. Trying to establish whether the > former is causing the latter. > > Not sure what you are referring to in regards to server to be built? > > > System logs don't show anything relevant unfortunately > > > > Il mer 11 mar 2020, 21:34 Adrian Klaver <adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com>> ha scritto: > > > > On 3/11/20 11:59 AM, Nicola Contu wrote: > > > I am actually cascading. > > > The master is in nyh, the first slave is in Dallas and the one > > having > > > problems is in Dallas as well on the same switch of the one > > replicating > > > from the master. > > > > > > It always worked not sure what is wrong now. We just encrypted > > disks on > > > all servers > > > > Do you have before and after on CPU load, I/O throughput? > > > > Do system logs show anything relevant during replication drop out? > > > > > > > > > > > Il mer 11 mar 2020, 18:57 Adrian Klaver > > <adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com> > > > <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com>>> ha scritto: > > > > > > On 3/11/20 2:54 AM, Nicola Contu wrote: > > > > These are the lines before > > > > > > > > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [1-1] > > > > db=cmdv3,user=zabbix_check ERROR: recovery is in progress > > > > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [2-1] > > > > db=cmdv3,user=zabbix_check HINT: WAL control functions > > cannot be > > > > executed during recovery. > > > > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [3-1] > > > > db=cmdv3,user=zabbix_check STATEMENT: select > > > > greatest(0,pg_wal_lsn_diff(pg_current_wal_lsn(), > > replay_lsn)) from > > > > pg_stat_replication where client_addr ='10.150.20.22' > > > > > > > > That query is made by Zabbix. So I stopped the zabbix > > agent and > > > tested > > > > again. But still failing, because of this now : > > > > > > > > pg_basebackup: starting background WAL receiver > > > > pg_basebackup: created temporary replication slot > > > "pg_basebackup_51199" > > > > *pg_basebackup: could not receive data from WAL stream: > > SSL SYSCALL > > > > error: EOF detected > > > > *^C4699810/504983062 kB (70%), 0/1 tablespace > > > > (...ql11/data/base/16401/231363544.2) > > > > > > So you started over with a pg_basebackup? > > > > > > Also from below: > > > > > > 2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1] > > > db=[unknown],user=replicator LOG: terminating walsender > > process due to > > > replication timeout > > > > > > Where are the master and standby in relation to each other > > network wise? > > > > > > Intervening firewalls, network latency issues? > > > > > > > > > > > > > > > > > > > > > here the full log starting right before the last try : > > > > > > > > 2020-03-11 09:22:44 GMT [] [12598]: [4508-1] db=,user= LOG: > > > > restartpoint complete: wrote 19565 buffers (0.2%); 0 WAL > > file(s) > > > > added, 0 removed, 7 recycled; write=270.014 s, sync=0.009 > s, > > > > total=270.036 s; sync files=804, longest=0.001 s, > > average=0.000 s; > > > > distance=131239 kB, estimate=725998 kB > > > > 2020-03-11 09:22:44 GMT [] [12598]: [4509-1] db=,user= LOG: > > > recovery > > > > restart point at 643A/D8C05F70 > > > > 2020-03-11 09:22:44 GMT [] [12598]: [4510-1] db=,user= > > DETAIL: Last > > > > completed transaction was at log time 2020-03-11 > > 09:22:44.050084+00. > > > > 2020-03-11 09:23:14 GMT [] [12598]: [4511-1] db=,user= LOG: > > > > restartpoint starting: time > > > > 2020-03-11 09:27:44 GMT [] [12598]: [4512-1] db=,user= LOG: > > > > restartpoint complete: wrote 17069 buffers (0.2%); 0 WAL > > file(s) > > > > added, 0 removed, 17 recycled; write=269.879 s, sync=0.006 > s, > > > > total=269.902 s; sync files=811, longest=0.001 s, > > average=0.000 s; > > > > distance=120469 kB, estimate=665445 kB > > > > 2020-03-11 09:27:44 GMT [] [12598]: [4513-1] db=,user= LOG: > > > recovery > > > > restart point at 643A/E01AB438 > > > > 2020-03-11 09:27:44 GMT [] [12598]: [4514-1] db=,user= > > DETAIL: Last > > > > completed transaction was at log time 2020-03-11 > > 09:27:43.945485+00. > > > > 2020-03-11 09:27:44 GMT [] [12598]: [4515-1] db=,user= LOG: > > > > restartpoint starting: force wait > > > > 2020-03-11 09:29:24 GMT [10.222.8.2(47834)] [50961]: [1-1] > > > > db=cmdv3,user=nis LOG: duration: 1402.004 ms statement: > > SELECT id, > > > > name, parent_id, parent, short_name, sales_rep_id FROM > > mmx_clients; > > > > 2020-03-11 09:29:34 GMT [10.222.8.2(47834)] [50961]: [2-1] > > > > db=cmdv3,user=nis LOG: duration: 9493.259 ms statement: > > SELECT > > > slid, > > > > gnid, sof_id, client_id, product FROM mmx_slids; > > > > 2020-03-11 09:32:14 GMT [] [12598]: [4516-1] db=,user= LOG: > > > > restartpoint complete: wrote 71260 buffers (0.8%); 0 WAL > > file(s) > > > > added, 0 removed, 13 recycled; write=269.953 s, sync=0.012 > s, > > > > total=269.979 s; sync files=760, longest=0.002 s, > > average=0.000 s; > > > > distance=123412 kB, estimate=611242 kB > > > > 2020-03-11 09:32:14 GMT [] [12598]: [4517-1] db=,user= LOG: > > > recovery > > > > restart point at 643A/E7A30498 > > > > 2020-03-11 09:32:14 GMT [] [12598]: [4518-1] db=,user= > > DETAIL: Last > > > > completed transaction was at log time 2020-03-11 > > 09:32:13.916101+00. > > > > 2020-03-11 09:32:44 GMT [] [12598]: [4519-1] db=,user= LOG: > > > > restartpoint starting: time > > > > 2020-03-11 09:37:14 GMT [] [12598]: [4520-1] db=,user= LOG: > > > > restartpoint complete: wrote 27130 buffers (0.3%); 0 WAL > > file(s) > > > > added, 0 removed, 12 recycled; write=270.026 s, sync=0.007 > s, > > > > total=270.052 s; sync files=814, longest=0.001 s, > > average=0.000 s; > > > > distance=280595 kB, estimate=578177 kB > > > > 2020-03-11 09:37:14 GMT [] [12598]: [4521-1] db=,user= LOG: > > > recovery > > > > restart point at 643A/F8C351C8 > > > > 2020-03-11 09:37:14 GMT [] [12598]: [4522-1] db=,user= > > DETAIL: Last > > > > completed transaction was at log time 2020-03-11 > > 09:37:14.067443+00. > > > > 2020-03-11 09:37:44 GMT [] [12598]: [4523-1] db=,user= LOG: > > > > restartpoint starting: time > > > > 2020-03-11 09:42:14 GMT [] [12598]: [4524-1] db=,user= LOG: > > > > restartpoint complete: wrote 26040 buffers (0.3%); 0 WAL > > file(s) > > > > added, 0 removed, 9 recycled; write=269.850 s, sync=0.019 > s, > > > > total=269.886 s; sync files=834, longest=0.002 s, > > average=0.000 s; > > > > distance=236392 kB, estimate=543999 kB > > > > 2020-03-11 09:42:14 GMT [] [12598]: [4525-1] db=,user= LOG: > > > recovery > > > > restart point at 643B/730F3F8 > > > > 2020-03-11 09:42:14 GMT [] [12598]: [4526-1] db=,user= > > DETAIL: Last > > > > completed transaction was at log time 2020-03-11 > > 09:42:13.900088+00. > > > > 2020-03-11 09:42:44 GMT [] [12598]: [4527-1] db=,user= LOG: > > > > restartpoint starting: time > > > > 2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: > [1-1] > > > > db=[unknown],user=replicator LOG: terminating walsender > > process > > > due to > > > > replication timeout > > > > 2020-03-11 09:47:14 GMT [] [12598]: [4528-1] db=,user= LOG: > > > > restartpoint complete: wrote 20966 buffers (0.2%); 0 WAL > > file(s) > > > > added, 0 removed, 9 recycled; write=270.048 s, sync=0.014 > s, > > > > total=270.085 s; sync files=852, longest=0.001 s, > > average=0.000 s; > > > > distance=183749 kB, estimate=507974 kB > > > > 2020-03-11 09:47:14 GMT [] [12598]: [4529-1] db=,user= LOG: > > > recovery > > > > restart point at 643B/12680A80 > > > > 2020-03-11 09:47:14 GMT [] [12598]: [4530-1] db=,user= > > DETAIL: Last > > > > completed transaction was at log time 2020-03-11 > > 09:47:14.069731+00. > > > > 2020-03-11 09:47:44 GMT [] [12598]: [4531-1] db=,user= LOG: > > > > restartpoint starting: time > > > > > > > > > > > > > > > > Il giorno mer 11 mar 2020 alle ore 01:53 Adrian Klaver > > > > <adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com> <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com>> > > > <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com> > > > <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com>>>> ha scritto: > > > > > > > > On 3/10/20 8:17 AM, Nicola Contu wrote: > > > > Please post to list also. > > > > Ccing list. > > > > > > > > What came immediately before the temporary file error? > > > > > > > > > 2020-03-10 15:10:17 GMT [[local]] [28171]: [1-1] > > > > > db=postgres,user=postgres LOG: temporary file: path > > > > > "base/pgsql_tmp/pgsql_tmp28171.0", size 382474936 > > > > > 2020-03-10 15:10:17 GMT [[local]] [28171]: [4-1] > > > > > db=postgres,user=postgres LOG: could not send data > > to client: > > > > Broken pipe > > > > > 2020-03-10 15:10:17 GMT [[local]] [28171]: [5-1] > > > > > db=postgres,user=postgres FATAL: connection to > > client lost > > > > > 2020-03-10 15:10:26 GMT [] [12598]: [3544-1] > > db=,user= LOG: > > > > > restartpoint complete: wrote 37315 buffers > > (0.4%); 0 WAL > > > file(s) > > > > > added, 0 removed, 16 recycled; write=269.943 s, > > sync=0.039 s, > > > > > total=269.999 s; sync files=1010, longest=0.001 s, > > > average=0.000 s; > > > > > distance=175940 kB, estimate=416149 kB > > > > > 2020-03-10 15:10:26 GMT [] [12598]: [3545-1] > > db=,user= LOG: > > > > recovery > > > > > restart point at 6424/1D7DEDE8 > > > > > > > > > > It is a cascade replication > > > > > > > > > > Il giorno mar 10 mar 2020 alle ore 15:58 Adrian > Klaver > > > > > <adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com> > > > <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com>> > > <mailto:adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com> > > > <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com>>> > > > > <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com> > > > <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com>> > > > > <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com> > > > <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com>>>>> ha scritto: > > > > > > > > > > On 3/10/20 2:26 AM, Nicola Contu wrote: > > > > > > Hello, > > > > > > I have two servers connected to the same > > switch running > > > > postgres 11.5 > > > > > > > > > > > > I am trying to replicate one of those > > servers after > > > a planned > > > > > work on > > > > > > the master, so the replica has been lost. It > has > > > always worked > > > > > but now I > > > > > > get this : > > > > > > > > > > > > pg_basebackup: could not receive data from > WAL > > > stream: server > > > > > closed the > > > > > > connection unexpectedly > > > > > > This probably means the server > > terminated > > > abnormally > > > > > > before or while processing the > request. > > > > > > > > > > > > I don't really understand what the issue is. > > > > > > > > > > I would start with the logs from the Postgres > > server > > > you are > > > > taking the > > > > > backup from. > > > > > > > > > > > I had this issue last week as well in > another DC > > > and I had to > > > > > reboot the > > > > > > slave to make it working (not sure why it > > helped) > > > > > > > > > > > > Do you know what can cause this? > > > > > > > > > > > > Thank you, > > > > > > Nicola > > > > > > > > > > > > > > > -- > > > > > Adrian Klaver > > > > > adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com> > > > <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com>> > > <mailto:adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com> > > > <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com>>> > > > > <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com> > > > <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com>> > > <mailto:adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com> > > > <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com>>>> > > > > > > > > > > > > > > > > > -- > > > > Adrian Klaver > > > > adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com> <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com>> > > > <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com> <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com>>> > > > > > > > > > > > > > -- > > > Adrian Klaver > > > adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com> > > <mailto:adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com > >> > > > > > > > > > -- > > Adrian Klaver > > adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com> > > > > > -- > Adrian Klaver > adrian.kla...@aklaver.com >