Hi, On 2020-01-16 14:30:12 +0300, Michail Nikolaev wrote: > First thing we need to consider - checksums and wal_log_hints are > widely used these days. So, at any moment master could send FPW page > with new "killed tuples" hints and overwrite hints set by standby. > Moreover it is not possible to distinguish hints are set by primary or > standby.
Note that the FPIs are only going to be sent for the first write to a page after a checksum. I don't think you're suggesting we rely on them for correctness (this far into the email at least), but it still seems worthwhile to point out. > And there is where hot_standby_feedback comes to play. Master node > considers xmin of hot_standy_feedback replicas (RecentGlobalXmin) > while setting "killed tuples" bits. So, if hot_standby_feedback is > enabled on standby for a while - it could safely trust hint bits from > master. Well, not easily. There's no guarantee that the node it reports hot_standby_feedback to is actually the primary. It could be an cascading replica setup, that doesn't report hot_standby_feedback upstream. Also hot_standby_feedback only takes effect while a streaming connection is active, if that is even temporarily interrupted, the primary will loose all knowledge of the standby's horizon - unless replication slots are in use, that is. Additionally, we also need to handle the case where the replica currently has restarted, and is recovering using local WAL, and/or archive based recovery. In that case the standby could already have sent a *newer* horizon as feedback upstream, but currently again have an older view. It is entirely possible that the standby is consistent and queryable in such a state (if nothing forced disk writes during WAL replay, minRecoveryLSN will not be set to something too new). > Also, standby could set own hints using xmin it sends to primary > during feedback (but without marking page as dirty). We do something similar for heap hint bits already... > Of course all is not so easy, there are a few things and corner cases > to care about > * Looks like RecentGlobalXmin could be moved backwards in case of new > replica with lower xmin is connected (or by switching some replica to > hot_standby_feedback=on). We must ensure RecentGlobalXmin is moved > strictly forward. I'm not sure this is a problem. If that happens we cannot rely on the different xmin horizon anyway, because action may have been taken on the old RecentGlobalXmin. Thus we need to be safe against that anyway. > * hot_standby_feedback could be enabled on the fly. In such a case we > need distinguish transactions which are safe or unsafe to deal with > hints. Standby could receive fresh RecentGlobalXmin as response to > feedback message. All standby transactions with xmin >= > RecentGlobalXmin are safe to use hints. > * hot_standby_feedback could be disabled on the fly. In such situation > standby needs to continue to send feedback while canceling all queries > with ignore_killed_tuples=true. Once all such queries are canceled - > feedback are no longer needed and should be disabled. I don't think we can rely on hot_standby_feedback at all. We can to avoid unnecessary cancellations, etc, and even assume it's setup up reasonably for some configurations, but there always needs to be an independent correctness backstop. I think it might be more promising to improve the the kill logic based on the WAL logged horizons from the primary. All I think we need to do is to use a more conservatively computed RecentGlobalXmin when determining whether tuples are dead, I think. We already regularly log a xl_running_xacts, adding information about the primaries horizon to that, and stashing it in shared memory on the standby, shouldn't be too hard. Then we can make a correct, albeit likely overly pessimistic, visibility determinations about tuples, and go on to set LP_DEAD. There are some complexities around how to avoid unnecessary query cancellations. We'd not want to trigger recovery conflicts based on the new xl_running_xacts field, as that'd make the conflict rate go through the roof - but I think we could safely use the logical minimum of the local RecentGlobalXmin, and the primaries'. That should allow us to set additional LP_DEAD safely, I believe. We could even rely on those LP_DEAD bits. But: I'm less clear on how we can make sure that we can *rely* on LP_DEAD to skip over entries during scans, however. The bits set as described above would be safe, but we also can see LP_DEAD set by the primary (and even upstream cascading standbys at least in case of new base backups taken from them), due to them being not being WAL logged. As we don't WAL log, there is no conflict associated with the LP_DEADs being set. My gut feeling is that it's going to be very hard to get around this, without adding WAL logging for _bt_killitems et al (including an interface for kill_prior_tuple to report the used horizon to the index). I'm wondering if we could recycle BTPageOpaqueData.xact to store the horizon applying to killed tuples on the page. We don't need to store the level for leaf pages, because we have BTP_LEAF, so we could make space for that (potentially signalled by a new BTP flag). Obviously we have to be careful with storing xids in the index, due to potential wraparound danger - but I think such page would have to be vacuumed anyway, before a potential wraparound. I think we could safely unset the xid during nbtree single page cleanup, and vacuum, by making sure no LP_DEAD entries survive, and by including the horizon in the generated WAL record. That however still doesn't really fully allow us to set LP_DEAD on standbys, however - but it'd allow us to take the primary's LP_DEADs into account on a standby. I think we'd run into torn page issues, if we were to do so without WAL logging, because we'd rely on the LP_DEAD bits and BTPageOpaqueData.xact to be in sync. I *think* we might be safe to do so *iff* the page's LSN indicates that there has been a WAL record covering it since the last redo location. I only had my first cup of coffee for the day, so I might also just be entirely off base here. Greetings, Andres Freund