On 2017-02-28 07:38, Erik Rijkers wrote:
On 2017-02-27 15:08, Petr Jelinek wrote:

0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch + 0002-Fix-after-trigger-execution-in-logical-replication.patch + 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch + snapbuild-v4-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
snapbuild-v4-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
snapbuild-v4-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch + snapbuild-v4-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch + snapbuild-v4-0005-Skip-unnecessary-snapshot-builds.patch +
0001-Logical-replication-support-for-initial-data-copy-v6.patch

This is the most frequent error that happens while doing pgbench-runs over logical replication: I run it continuously all day, and every few hours an error occurs of the kind seen below: a table (pgbench_history, mostly) ends up 1 row short (673466 instead of 673467). I have the script wait a long time before calling it an error (because in theory it could still 'finish', and end successfully (although that has not happened yet, once the system got into this state).

-- pgbench -c 16 -j 8 -T 120 -P 24 -n -M simple  --  scale 25

[...]
6972 a,b,t,h: 2500000 25 250 673467 e53236c09 643235708 f952814c3 559d618cd master 6973 a,b,t,h: 2500000 25 250 673466 e53236c09 643235708 f952814c3 4b09337e3 replica NOK a22fb00a6
-- wait another 5 s   (total 20 s) (unchanged 1)
-- getting md5 (cb)
6972 a,b,t,h: 2500000 25 250 673467 e53236c09 643235708 f952814c3 559d618cd master 6973 a,b,t,h: 2500000 25 250 673466 e53236c09 643235708 f952814c3 4b09337e3 replica NOK a22fb00a6
-- wait another 5 s   (total 25 s) (unchanged 2)
-- getting md5 (cb)
6972 a,b,t,h: 2500000 25 250 673467 e53236c09 643235708 f952814c3 559d618cd master 6973 a,b,t,h: 2500000 25 250 673466 e53236c09 643235708 f952814c3 4b09337e3 replica NOK a22fb00a6
-- wait another 5 s   (total 30 s) (unchanged 3)
-- getting md5 (cb)
6972 a,b,t,h: 2500000 25 250 673467 e53236c09 643235708 f952814c3 559d618cd master 6973 a,b,t,h: 2500000 25 250 673466 e53236c09 643235708 f952814c3 4b09337e3 replica NOK a22fb00a6
-- wait another 5 s   (total 35 s) (unchanged 4)


I gathered some info in this (proabably deadlocked) state in the hope there is something suspicious in there:


UID         PID   PPID  C STIME TTY      STAT   TIME CMD
rijkers 71203 1 0 20:06 pts/57 S 0:00 postgres -D /var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/data -p 6973 -c wal_level=replica [...] rijkers 71214 71203 0 20:06 ? Ss 0:00 \_ postgres: logger process rijkers 71216 71203 0 20:06 ? Ss 0:00 \_ postgres: checkpointer process rijkers 71217 71203 0 20:06 ? Ss 0:00 \_ postgres: writer process rijkers 71218 71203 0 20:06 ? Ss 0:00 \_ postgres: wal writer process rijkers 71219 71203 0 20:06 ? Ss 0:00 \_ postgres: autovacuum launcher process rijkers 71220 71203 0 20:06 ? Ss 0:00 \_ postgres: stats collector process rijkers 71221 71203 0 20:06 ? Ss 0:00 \_ postgres: bgworker: logical replication launcher rijkers 71222 71203 0 20:06 ? Ss 0:00 \_ postgres: bgworker: logical replication worker 30042

rijkers 71201 1 0 20:06 pts/57 S 0:00 postgres -D /var/data1/pg_stuff/pg_installations/pgsql.logical_replication/data -p 6972 -c wal_level=logical [...] rijkers 71206 71201 0 20:06 ? Ss 0:00 \_ postgres: logger process rijkers 71208 71201 0 20:06 ? Ss 0:00 \_ postgres: checkpointer process rijkers 71209 71201 0 20:06 ? Ss 0:00 \_ postgres: writer process rijkers 71210 71201 0 20:06 ? Ss 0:00 \_ postgres: wal writer process rijkers 71211 71201 0 20:06 ? Ss 0:00 \_ postgres: autovacuum launcher process rijkers 71212 71201 0 20:06 ? Ss 0:00 \_ postgres: stats collector process rijkers 71213 71201 0 20:06 ? Ss 0:00 \_ postgres: bgworker: logical replication launcher rijkers 71223 71201 0 20:06 ? Ss 0:00 \_ postgres: wal sender process rijkers [local] idle




-- replica:
 port | shared_buffers | work_mem | m_w_m | e_c_s
------+----------------+----------+-------+-------
 6973 | 100MB          | 50MB     | 2GB   | 64GB
(1 row)

select  current_setting('port') as port
, datname                               as db
,  to_char(pg_database_size(datname), '9G999G999G999G999')
|| ' (' || pg_size_pretty(pg_database_size(datname)) || ')' as dbsize
, pid
, application_name                      as app
, xact_start
, query_start
, regexp_replace( cast(now() - query_start as text), E'\.[[:digit:]]+\$', '') as duration
, state
, wait_event_type                               as wt_evt_type
, wait_event                                    as wt_evt
from pg_stat_activity
where   query !~ 'pg_stat_activity'
 -- and query != '<IDLE>'
port | db | dbsize | pid | app | xact_start | query_start | duration | state | wt_evt_type | wt_evt
------+--------+-----------------------------+-------+----------------------------------+------------+-------------+----------+-------+-------------+------------------
6973 | testdb | 817,243,955 (779 MB) | 71222 | logical replication worker 30042 | | | | idle | Activity | LogicalApplyMain
(1 row)

select * from pg_locks;
locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath
------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+-----------------+---------+----------
relation | 16384 | 11547 | | | | | | | | 4/568 | 77713 | AccessShareLock | t | t virtualxid | | | | | 4/568 | | | | | 4/568 | 77713 | ExclusiveLock | t | t
(2 rows)

select * from pg_stat_subscription;
subid | subname | pid | relid | received_lsn | last_msg_send_time | last_msg_receipt_time | latest_end_lsn | latest_end_time
-------+---------+-------+-------+--------------+-------------------------------+-------------------------------+----------------+-------------------------------
30042 | sub1 | 71222 | | 29/6C91C980 | 2017-02-28 20:11:50.818774+01 | 2017-02-28 20:11:50.818845+01 | 29/6C91C980 | 2017-02-28 20:11:50.818774+01
(1 row)

select * from pg_subscription;
subdbid | subname | subowner | subenabled | subconninfo | subslotname | subpublications
---------+---------+----------+------------+-------------+-------------+-----------------
16384 | sub1 | 10 | t | port=6972 | sub1 | {pub1}
(1 row)

select * from pg_subscription_rel;
 srsubid | srrelid | srsubstate | srsublsn
---------+---------+------------+----------
   30042 |   30022 | r          |
   30042 |   30019 | r          |
   30042 |   30030 | r          |
   30042 |   30025 | r          |
(4 rows)

-- master
\c - - - 6972
select pid , usesysid /* , usename */ , application_name /* , client_addr */ /* , client_hostname */ /* , client_port */ , backend_start , backend_xmin , state , sent_location , write_location , flush_location , replay_location , sync_priority , sync_state from pg_stat_replication; pid | usesysid | application_name | backend_start | backend_xmin | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+------------------+-----------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
71223 | 10 | sub1 | 2017-02-28 20:06:50.3357+01 | | streaming | 29/88BC9BB0 | 29/88BC9BB0 | 29/88BC9BB0 | 29/88BC9BB0 | 0 | async
(1 row)



Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to