Here are some review comments for the patch v21-0003 ====== Commit message
1. pg_upgrade fails if the old node has slots which status is 'lost' or they do not consume all WAL records. These are needed for prevent the data loss. ~ Maybe some minor brush-up like: SUGGESTION In order to prevent data loss, pg_upgrade will fail if the old node has slots with the status 'lost', or with unconsumed WAL records. ====== src/bin/pg_upgrade/check.c 2. check_for_confirmed_flush_lsn + /* Check that all logical slots are not in 'lost' state. */ + res = executeQueryOrDie(conn, + "SELECT slot_name FROM pg_catalog.pg_replication_slots " + "WHERE temporary = false AND wal_status = 'lost';"); + + ntups = PQntuples(res); + i_slotname = PQfnumber(res, "slot_name"); + + for (i = 0; i < ntups; i++) + { + is_error = true; + + pg_log(PG_WARNING, + "\nWARNING: logical replication slot \"%s\" is obsolete.", + PQgetvalue(res, i, i_slotname)); + } + + PQclear(res); + + if (is_error) + pg_fatal("logical replication slots not to be in 'lost' state."); + 2a. (GENERAL) The above code for checking lost state seems out of place in this function which is meant for checking confirmed flush lsn. Maybe you jammed both kinds of logic into one function to save on the extra PGconn or something but IMO two separate functions would be better. e.g. - check_for_lost_slots - check_for_confirmed_flush_lsn ~ 2b. + /* Check that all logical slots are not in 'lost' state. */ SUGGESTION /* Check there are no logical replication slots with a 'lost' state. */ ~ 2c. + res = executeQueryOrDie(conn, + "SELECT slot_name FROM pg_catalog.pg_replication_slots " + "WHERE temporary = false AND wal_status = 'lost';"); This SQL fragment is very much like others in previous patches. Be sure to make all the cases and clauses consistent with all those similar SQL fragments. ~ 2d. + is_error = true; That doesn't need to be in the loop. Better to just say: is_error = (ntups > 0); ~ 2e. There is a mix of terms in the WARNING and in the pg_fatal -- e.g. "obsolete" versus "lost". Is it OK? ~ 2f. + pg_fatal("logical replication slots not to be in 'lost' state."); English? And maybe it should be much more verbose... "Upgrade of this installation is not allowed because one or more logical replication slots with a state of 'lost' were detected." ~~~ 3. check_for_confirmed_flush_lsn + /* + * Check that all logical replication slots have reached the latest + * checkpoint position (SHUTDOWN_CHECKPOINT record). This checks cannot be + * done in case of live_check because the server has not been written the + * SHUTDOWN_CHECKPOINT record yet. + */ + if (!live_check) + { + res = executeQueryOrDie(conn, + "SELECT slot_name FROM pg_catalog.pg_replication_slots " + "WHERE confirmed_flush_lsn != '%X/%X' AND temporary = false;", + old_cluster.controldata.chkpnt_latest_upper, + old_cluster.controldata.chkpnt_latest_lower); + + ntups = PQntuples(res); + i_slotname = PQfnumber(res, "slot_name"); + + for (i = 0; i < ntups; i++) + { + is_error = true; + + pg_log(PG_WARNING, + "\nWARNING: logical replication slot \"%s\" has not consumed WALs yet", + PQgetvalue(res, i, i_slotname)); + } + + PQclear(res); + PQfinish(conn); + + if (is_error) + pg_fatal("All logical replication slots consumed all the WALs."); ~ 3a. /This checks/This check/ ~ 3b. I don't think the separation of chkpnt_latest_upper/chkpnt_latest_lower is needed like this. AFAIK there is an LSN_FORMAT_ARGS(lsn) macro designed for handling exactly this kind of parameter substitution. ~ 3c. + is_error = true; That doesn't need to be in the loop. Better to just say: is_error = (ntups > 0); ~ 3d. + pg_fatal("All logical replication slots consumed all the WALs."); The message seems backward. shouldn't it say something like: "Upgrade of this installation is not allowed because one or more logical replication slots still have unconsumed WAL records." ====== src/bin/pg_upgrade/controldata.c 4. get_control_data + /* + * Upper and lower part of LSN must be read and stored + * separately because it is reported as %X/%X format. + */ + cluster->controldata.chkpnt_latest_upper = + strtoul(p, &slash, 16); + cluster->controldata.chkpnt_latest_lower = + strtoul(++slash, NULL, 16); I felt that this field separation code is maybe not necessary. Please refer to other review comments in this post. ====== src/bin/pg_upgrade/pg_upgrade.h 5. ControlData + + uint32 chkpnt_latest_upper; + uint32 chkpnt_latest_lower; } ControlData; ~ Actually, I did not recognise the reason why this cannot be stored properly as a single XLogRecPtr field. Please see other review comments in this post. ====== .../t/003_logical_replication_slots.pl 6. GENERAL Many of the changes to this file are just renaming the 'old_node'/'new_node' to 'old_publisher'/'new_publisher'. This seems a basic change not really associated with this patch 0003. To reduce the code churn, this change should be moved into the earlier patch where this test file (003_logical_replication_slots.pl) was first introduced, ~~~ 7. # Cause a failure at the start of pg_upgrade because slot do not finish # consuming all the WALs ~ Can you give a more detailed explanation in the comment of how this test case achieves what it says? ====== src/test/regress/sql/misc_functions.sql 8. @@ -236,4 +236,4 @@ SELECT * FROM pg_split_walfile_name('invalid'); SELECT segment_number > 0 AS ok_segment_number, timeline_id FROM pg_split_walfile_name('000000010000000100000000'); SELECT segment_number > 0 AS ok_segment_number, timeline_id - FROM pg_split_walfile_name('ffffffFF00000001000000af'); + FROM pg_split_walfile_name('ffffffFF00000001000000af'); \ No newline at end of file ~ What is this change for? It looks like maybe some accidental whitespace change happened. ------ Kind Regards, Peter Smith. Fujitsu Australia