To the wonderful BDR folks, I have a 5 node pg+bdr cluster that I've spun up. I have verified that each node can connect via psql to each other node. I am having an issue creating the bdr group and having the nodes join each other.
The steps I am taking are as follows (I used pg compiled using getting started guide instructions): 1. initdb 2. start postgres 3. create database pgbdr 4. create extension btree_gist (actually I'm doing this on template1 for my purposes) 5. create extension bdr 6. on node 0 I am doing: SELECT bdr.bdr_group_create( local_node_name := 'pgbdr0', node_external_dsn := 'host=10.244.2.2 port=5432 user=postgres dbname=pgbdr' ); 7. on nodes 1-4 I am doing (adjusted for the nodes IP): SELECT bdr.bdr_group_join( local_node_name := 'pgbdr1', node_external_dsn := 'host=10.244.2.6 port=5432 user=postgres dbname=pgbdr', join_using_dsn := 'host=10.244.2.2 port=5432 user=postgres dbname=pgbdr' ); Sometimes this works and all nodes are joined, but it isn't consistent and I'm not sure at the moment how to debug why. >From the most recent attempt on node 0: pgbdr=# SELECT * FROM bdr.bdr_nodes; node_sysid | node_timeline | node_dboid | node_status | node_name | node_local_dsn | node_init_from_dsn ---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+-------------------- 6147857338297172162 | 1 | 16908 | r | rdpg0 | host=10.244.2.2 port=5432 user=postgres dbname=pgbdr | (1 row) >From the most recent attempt on node 1: pgbdr=# SELECT * FROM bdr.bdr_nodes; node_sysid | node_timeline | node_dboid | node_status | node_name | node_local_dsn | node_init_from_d sn ---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+---------------------------------- -------------------- 6147857345042854082 | 1 | 16908 | i | rdpg1 | host=10.244.2.6 port=5432 user=postgres dbname=pgbdr | host=10.244.2.2 port=5432 user=po stgres dbname=pgbdr (1 row) I have pasted some relevant logs from node 0 further below. Even if I use 'SELECT bdr.bdr_node_join_wait_for_ready();' it will usually just hang. How can I debug why the nodes are not connecting to each other find out where they getting stuck trying to join? ~Wayne Wayne E. Seguin wayneeseg...@gmail.com wayneeseguin on irc.freenode.net http://twitter.com/wayneeseguin/ https://github.com/wayneeseguin/ d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG: 00000: received replication command: IDENTIFY_SYSTEM [171/773] d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION: exec_replication_command, walsender.c:1291 d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG: 00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857345042854082_1_16908__" LOGICAL bdr d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION: exec_replication_command, walsender.c:1291 d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG: 00000: searching for logical decoding starting point, starting at 0/18E87A0 d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION: DecodingContextFindStartpoint, logical.c:469 d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG: 00000: logical decoding found consistent point at 0/18E87A0 d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DETAIL: There are no running transactions. d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION: SnapBuildFindSnapshot, snapbuild.c:1271 d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG: 00000: exported logical decoding snapshot: "0000071E-1" with 0 transaction IDs d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION: SnapBuildExportSnapshot, snapbuild.c:604 d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG: 00000: received replication command: IDENTIFY_SYSTEM d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION: exec_replication_command, walsender.c:1291 d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG: 00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857344392790209_1_16908__" LOGICAL bdr d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION: exec_replication_command, walsender.c:1291 d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG: 00000: searching for logical decoding starting point, starting at 0/18EC970 d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION: DecodingContextFindStartpoint, logical.c:469 d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 00000: logical decoding found initial starting point at 0/18EC970 d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL: 1 transaction needs to finish. d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION: SnapBuildFindSnapshot, snapbuild.c:1338 d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG: 00000: received replication command: IDENTIFY_SYSTEM d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION: exec_replication_command, walsender.c:1291 d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG: 00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857340000055490_1_16908__" LOGICAL bdr d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION: exec_replication_command, walsender.c:1291 d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG: 08006: unexpected EOF on client connection with an open transaction d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION: SocketBackend, postgres.c:344 d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG: 00000: searching for logical decoding starting point, starting at 0/18EC9F0 d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION: DecodingContextFindStartpoint, logical.c:469 d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 00000: logical decoding found initial starting point at 0/18EC9F0 d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL: 1 transaction needs to finish. d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION: SnapBuildFindSnapshot, snapbuild.c:1338 d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 00000: logical decoding found consistent point at 0/18ECA30 d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL: Transaction ID 1822 finished; no more running transactions. d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION: SnapBuildEndTxn, snapbuild.c:907 d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 00000: logical decoding found consistent point at 0/18ECA30 d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL: Transaction ID 1822 finished; no more running transactions. d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION: SnapBuildEndTxn, snapbuild.c:907 d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 00000: exported logical decoding snapshot: "00000720-1" with 0 transaction IDs d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION: SnapBuildExportSnapshot, snapbuild.c:604 d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 00000: exported logical decoding snapshot: "0000071F-1" with 1 transaction ID d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION: SnapBuildExportSnapshot, snapbuild.c:604 d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG: 00000: received replication command: IDENTIFY_SYSTEM d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION: exec_replication_command, walsender.c:1291 d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG: 00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857340923527359_1_16908__" LOGICAL bdr d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION: exec_replication_command, walsender.c:1291 d= p=392 a=DEBUG: 00000: started 0 elections d= p=392 a=LOCATION: bdr_sequencer_start_elections, bdr_seq.c:688 d= p=392 a=DEBUG: 00000: started 0 votes d= p=392 a=LOCATION: bdr_sequencer_vote, bdr_seq.c:629 d= p=392 a=DEBUG: 00000: tallied 0 elections d= p=392 a=LOCATION: bdr_sequencer_tally, bdr_seq.c:755 d= p=392 a=DEBUG: 00000: checked 0 sequences for filling d= p=392 a=LOCATION: bdr_sequencer_fill_sequences, bdr_seq.c:1034 d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 08006: unexpected EOF on client connection with an open transaction d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION: SocketBackend, postgres.c:344 d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 08006: unexpected EOF on client connection with an open transaction d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION: SocketBackend, postgres.c:344 d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG: 00000: searching for logical decoding starting point, starting at 0/18ECA60 d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION: DecodingContextFindStartpoint, logical.c:469 d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 00000: logical decoding found initial starting point at 0/18ECA60 d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL: 2 transactions need to finish. d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION: SnapBuildFindSnapshot, snapbuild.c:1338 d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 00000: logical decoding found consistent point at 0/18ECAD0 d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL: Transaction ID 1823 finished; no more running transactions. d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION: SnapBuildEndTxn, snapbuild.c:907 d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 00000: exported logical decoding snapshot: "00000721-1" with 0 transaction IDs d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION: SnapBuildExportSnapshot, snapbuild.c:604 d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 08006: unexpected EOF on client connection with an open transaction d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION: SocketBackend, postgres.c:344