Hello: My customer encountered some connection timeout, while using one primary-one standby streaming replication.
The original log is japanese, because there are no error-code like oracle's ora-xxx, I tried to translate the japanese information into English, But that might be not correct English for PG. The most important part is: 2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL: Could not receive data from WAL stream: could not receive data from server: connection timeout scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory I was asked about: In what occasion will the above fatal error occur? I looked into the postgresql.conf file for the primary and standby server. And made some experiences. I found: Senario I: If the wal file wanted is removed manually: Both in primary and standby, log will be like this: FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000000000011 has already been removed Senario II: If replication_timeout=60, wal_receiver_status_interval=0, log will be like this: On primary side: LOG: terminating walsender process due to replication timeout On standby side: FATAL: could not receive data from WAL stream: LOG: streaming replication successfully connected to primary But they are not the same to my customer's case. And I also noticed that my customer's recovery.conf: In it it has primary_conninfo as this: host=DB1 port=5432 application_name=testpg user=postgres connect_timeout=10 keepalives_idle=5 keepalives_interval=1 That is to say: keepalives_idle=5 keepalives_interval=1 keepalives_count=3 With that setting, I think that For every 5 seconds, In order to see whether the connection is alive, 3 keepalives package will be sent, And if there is no repsonse in 1 seconds, then he connection will be invalid. If my unserstanding is right, when master is busy at network, the three keepalives_ parameter will also cause the connection time out error. But I haven't found a good explanation fitting the logs' FATAL error. Can anybody give me some info? 2013-09-22 09:40:42 JST[27987][412d33c5.2d23-95024][0][00000]LOG: Recovery replay at point 1F1/45EF8E20 2013-09-22 09:40:42 JST[27987][412d33c5.2d23-95025][0][00000]DETAIL: Last transaction log time 2013-09-22 09:40:42.673062+09 2013-09-22 09:44:00 JST[27987][412d33c5.2d23-95026][0][00000]LOG: restartpoint starting: time 2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95027][0][00000]LOG: restartpoint complete: wrote 1051 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=104.773 s, sync=0.003 s, total=104.781 s; sync files=90, longest=0.001 s, average=0.000 s 2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95028][0][00000]LOG: Recovery replay at point 1F1/45EF8E20 2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95029][0][00000]DETAIL: Last transaction log time 2013-09-22 09:45:44.77683+09 2013-09-22 09:49:00 JST[27987][412d33c5.2d23-95030][0][00000]LOG: restartpoint starting: time 2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95031][0][00000]LOG: restartpoint complete: wrote 963 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=96.096 s, sync=0.006 s, total=96.106 s; sync files=98, longest=0.002 s, average=0.000 s 2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95032][0][00000]LOG: Recovery replay at point 1F1/46CE8EA0 2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95033][0][00000]DETAIL: Last transaction log time 2013-09-22 09:50:36.564966+09 2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL: Could not receive data from WAL stream: could not receive data from server: connection timeout scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory 2013-09-22 09:52:50 JST[27806][51d1fbc5.6c9e-12][0][00000]LOG: Log file 497、Segment 74、Offset 12263424 Page address 1F0/8DBB2000 is not expected scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory scp: /opt/PostgresPlus/9.2AS/data/arch/000000AD.history: No such file or directory 2013-09-22 09:52:51 JST[2878][523e3f63.b3e-1][0][00000]LOG: Streaming replication connected to primary server successfully 2013-09-22 09:54:00 JST[27987][412d33c5.2d23-95034][0][00000]LOG: restartpoint starting: time 2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95035][0][00000]LOG: restartpoint complete: wrote 9929 buffers (3.2%); 0 transaction log file(s) added, 0 removed, 10 recycled; write=149.436 s, sync=0.032 s, total=149.472 s; sync files=90, longest=0.012 s, average=0.000 s 2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95036][0][00000]LOG: Recovery replay at point 1F1/48E9D238 2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95037][0][00000]DETAIL: Last transaction log time 2013-09-22 09:56:29.015514+09 Thanks in advance jian gao