On Wed, 8 Jul 2020 at 21:35, Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Wed, Jul 8, 2020 at 11:16 AM Masahiko Sawada > <masahiko.saw...@2ndquadrant.com> wrote: > > > > On Tue, 7 Jul 2020 at 15:40, Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > > > > > > Okay, but isn't there some advantage with this approach (manage 2PC at > > > postgres_fdw level) as well which is that any node will be capable of > > > handling global transactions rather than doing them via central > > > coordinator? I mean any node can do writes or reads rather than > > > probably routing them (at least writes) via coordinator node. > > > > The postgres server where the client started the transaction works as > > the coordinator node. I think this is true for both this patch and > > that 2PC patch. From the perspective of atomic commit, any node will > > be capable of handling global transactions in both approaches. > > > > Okay, but then probably we need to ensure that GID has to be unique > even if that gets generated on different nodes? I don't know if that > is ensured.
Yes, if you mean GID is global transaction id specified to PREPARE TRANSACTION, it has to be unique. In that 2PC patch, GID is generated in form of 'fx_<random string>_<server oid>_<user oid>'. I believe it can ensure uniqueness in most cases. In addition, there is FDW API to generate an arbitrary identifier. > > > > Now, I > > > agree that even if this advantage is there in the current approach, we > > > can't lose the crash-safety aspect of other approach. Will you be > > > able to summarize what was the problem w.r.t crash-safety and how your > > > patch has dealt it? > > > > Since this patch proceeds 2PC without any logging, foreign > > transactions prepared on foreign servers are left over without any > > clues if the coordinator crashes during commit. Therefore, after > > restart, the user will need to find and resolve in-doubt foreign > > transactions manually. > > > > Okay, but is it because we can't directly WAL log in postgres_fdw or > there is some other reason for not doing so? Yes, I think it is because we cannot WAL log in postgres_fdw. Maybe I missed the point in your question. Please correct me if I missed something. > > > > > > > > > > Looking at the commit procedure with this patch: > > > > > > > > When starting a new transaction on a foreign server, postgres_fdw > > > > executes pg_global_snapshot_import() to import the global snapshot. > > > > After some work, in pre-commit phase we do: > > > > > > > > 1. generate global transaction id, say 'gid' > > > > 2. execute PREPARE TRANSACTION 'gid' on all participants. > > > > 3. prepare global snapshot locally, if the local node also involves > > > > the transaction > > > > 4. execute pg_global_snapshot_prepare('gid') for all participants > > > > > > > > During step 2 to 4, we calculate the maximum CSN from the CSNs > > > > returned from each pg_global_snapshot_prepare() executions. > > > > > > > > 5. assign global snapshot locally, if the local node also involves the > > > > transaction > > > > 6. execute pg_global_snapshot_assign('gid', max-csn) on all > > > > participants. > > > > > > > > Then, we commit locally (i.g. mark the current transaction as > > > > committed in clog). > > > > > > > > After that, in post-commit phase, execute COMMIT PREPARED 'gid' on all > > > > participants. > > > > > > > > > > As per my current understanding, the overall idea is as follows. For > > > global transactions, pg_global_snapshot_prepare('gid') will set the > > > transaction status as InDoubt and generate CSN (let's call it NodeCSN) > > > at the node where that function is executed, it also returns the > > > NodeCSN to the coordinator. Then the coordinator (the current > > > postgres_fdw node on which write transaction is being executed) > > > computes MaxCSN based on the return value (NodeCSN) of prepare > > > (pg_global_snapshot_prepare) from all nodes. It then assigns MaxCSN > > > to each node. Finally, when Commit Prepared is issued for each node > > > that MaxCSN will be written to each node including the current node. > > > So, with this idea, each node will have the same view of CSN value > > > corresponding to any particular transaction. > > > > > > For Snapshot management, the node which receives the query generates a > > > CSN (CurrentCSN) and follows the simple rule that the tuple having a > > > xid with CSN lesser than CurrentCSN will be visible. Now, it is > > > possible that when we are examining a tuple, the CSN corresponding to > > > xid that has written the tuple has a value as INDOUBT which will > > > indicate that the transaction is yet not committed on all nodes. And > > > we wait till we get the valid CSN value corresponding to xid and then > > > use it to check if the tuple is visible. > > > > > > Now, one thing to note here is that for global transactions we > > > primarily rely on CSN value corresponding to a transaction for its > > > visibility even though we still maintain CLOG for local transaction > > > status. > > > > > > Leaving aside the incomplete parts and or flaws of the current patch, > > > does the above match the top-level idea of this patch? > > > > I'm still studying this patch but your understanding seems right to me. > > > > Cool. While studying, if you can try to think whether this approach is > different from the global coordinator based approach then it would be > great. Here is my initial thought apart from other reasons the global > coordinator based design can help us to do the global transaction > management and snapshots. It can allocate xids for each transaction > and then collect the list of running xacts (or CSN) from each node and > then prepare a global snapshot that can be used to perform any > transaction. OTOH, in the design proposed in this patch, we don't need any > coordinator to manage transactions and snapshots because each node's > current CSN will be sufficient for snapshot and visibility as > explained above. Yeah, my thought is the same as you. Since both approaches have strong points and weak points I cannot mention which is a better approach, but that 2PC patch would go well together with the design proposed in this patch. > Now, sure this assumes that there is no clock skew > on different nodes or somehow we take care of the same (Note that in > the proposed patch the CSN is a timestamp.). As far as I read Clock-SI paper, we take care of the clock skew by putting some waits on the transaction start and reading tuples on the remote node. > > > > I am not sure > > > if my understanding of this patch at this stage is completely correct > > > or whether we want to follow the approach of this patch but I think at > > > least lets first be sure if such a top-level idea can achieve what we > > > want to do here. > > > > > > > Considering how to integrate this global snapshot feature with the 2PC > > > > patch, what the 2PC patch needs to at least change is to allow FDW to > > > > store an FDW-private data that is passed to subsequent FDW transaction > > > > API calls. Currently, in the current 2PC patch, we call Prepare API > > > > for each participant servers one by one, and the core pass only > > > > metadata such as ForeignServer, UserMapping, and global transaction > > > > identifier. So it's not easy to calculate the maximum CSN across > > > > multiple transaction API calls. I think we can change the 2PC patch to > > > > add a void pointer into FdwXactRslvState, struct passed from the core, > > > > in order to store FDW-private data. It's going to be the maximum CSN > > > > in this case. That way, at the first Prepare API calls postgres_fdw > > > > allocates the space and stores CSN to that space. And at subsequent > > > > Prepare API calls it can calculate the maximum of csn, and then is > > > > able to the step 3 to 6 when preparing the transaction on the last > > > > participant. Another idea would be to change 2PC patch so that the > > > > core passes a bunch of participants grouped by FDW. > > > > > > > > > > IIUC with this the coordinator needs the communication with the nodes > > > twice at the prepare stage, once to prepare the transaction in each > > > node and get CSN from each node and then to communicate MaxCSN to each > > > node? > > > > Yes, I think so too. > > > > > Also, we probably need InDoubt CSN status at prepare phase to > > > make snapshots and global visibility work. > > > > I think it depends on how global CSN feature works. > > > > For instance, in that 2PC patch, if the coordinator crashes during > > preparing a foreign transaction, the global transaction manager > > recovers and regards it as "prepared" regardless of the foreign > > transaction actually having been prepared. And it sends ROLLBACK > > PREPARED after recovery completed. With global CSN patch, as you > > mentioned, at prepare phase the coordinator needs to communicate > > participants twice other than sending PREPARE TRANSACTION: > > pg_global_snapshot_prepare() and pg_global_snapshot_assign(). > > > > If global CSN patch needs different cleanup work depending on the CSN > > status, we will need InDoubt CSN status so that the global transaction > > manager can distinguish between a foreign transaction that has > > executed pg_global_snapshot_prepare() and the one that has executed > > pg_global_snapshot_assign(). > > > > On the other hand, if it's enough to just send ROLLBACK or ROLLBACK > > PREPARED in that case, I think we don't need InDoubt CSN status. There > > is no difference between those foreign transactions from the global > > transaction manager perspective. > > > > I think InDoubt status helps in checking visibility in the proposed > patch wherein if we find the status of the transaction as InDoubt, we > wait till we get some valid CSN for it as explained in my previous > email. So whether we use it for Rollback/Rollback Prepared, it is > required for this design. Yes, InDoubt status is required for checking visibility. My comment was it's not necessary from the perspective of atomic commit. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services