On Wed, 8 Jul 2020 at 21:35, Amit Kapila <amit.kapil...@gmail.com> wrote:
>
> On Wed, Jul 8, 2020 at 11:16 AM Masahiko Sawada
> <masahiko.saw...@2ndquadrant.com> wrote:
> >
> > On Tue, 7 Jul 2020 at 15:40, Amit Kapila <amit.kapil...@gmail.com> wrote:
> > >
> > >
> > > Okay, but isn't there some advantage with this approach (manage 2PC at
> > > postgres_fdw level) as well which is that any node will be capable of
> > > handling global transactions rather than doing them via central
> > > coordinator?  I mean any node can do writes or reads rather than
> > > probably routing them (at least writes) via coordinator node.
> >
> > The postgres server where the client started the transaction works as
> > the coordinator node. I think this is true for both this patch and
> > that 2PC patch. From the perspective of atomic commit, any node will
> > be capable of handling global transactions in both approaches.
> >
>
> Okay, but then probably we need to ensure that GID has to be unique
> even if that gets generated on different nodes?  I don't know if that
> is ensured.

Yes, if you mean GID is global transaction id specified to PREPARE
TRANSACTION, it has to be unique. In that 2PC patch, GID is generated
in form of 'fx_<random string>_<server oid>_<user oid>'. I believe it
can ensure uniqueness in most cases. In addition, there is FDW API to
generate an arbitrary identifier.

>
> > >  Now, I
> > > agree that even if this advantage is there in the current approach, we
> > > can't lose the crash-safety aspect of other approach.  Will you be
> > > able to summarize what was the problem w.r.t crash-safety and how your
> > > patch has dealt it?
> >
> > Since this patch proceeds 2PC without any logging, foreign
> > transactions prepared on foreign servers are left over without any
> > clues if the coordinator crashes during commit. Therefore, after
> > restart, the user will need to find and resolve in-doubt foreign
> > transactions manually.
> >
>
> Okay, but is it because we can't directly WAL log in postgres_fdw or
> there is some other reason for not doing so?

Yes, I think it is because we cannot WAL log in postgres_fdw. Maybe I
missed the point in your question. Please correct me if I missed
something.

>
> >
> > >
> > > > Looking at the commit procedure with this patch:
> > > >
> > > > When starting a new transaction on a foreign server, postgres_fdw
> > > > executes pg_global_snapshot_import() to import the global snapshot.
> > > > After some work, in pre-commit phase we do:
> > > >
> > > > 1. generate global transaction id, say 'gid'
> > > > 2. execute PREPARE TRANSACTION 'gid' on all participants.
> > > > 3. prepare global snapshot locally, if the local node also involves
> > > > the transaction
> > > > 4. execute pg_global_snapshot_prepare('gid') for all participants
> > > >
> > > > During step 2 to 4, we calculate the maximum CSN from the CSNs
> > > > returned from each pg_global_snapshot_prepare() executions.
> > > >
> > > > 5. assign global snapshot locally, if the local node also involves the
> > > > transaction
> > > > 6. execute pg_global_snapshot_assign('gid', max-csn) on all 
> > > > participants.
> > > >
> > > > Then, we commit locally (i.g. mark the current transaction as
> > > > committed in clog).
> > > >
> > > > After that, in post-commit phase, execute COMMIT PREPARED 'gid' on all
> > > > participants.
> > > >
> > >
> > > As per my current understanding, the overall idea is as follows.  For
> > > global transactions, pg_global_snapshot_prepare('gid') will set the
> > > transaction status as InDoubt and generate CSN (let's call it NodeCSN)
> > > at the node where that function is executed, it also returns the
> > > NodeCSN to the coordinator.  Then the coordinator (the current
> > > postgres_fdw node on which write transaction is being executed)
> > > computes MaxCSN based on the return value (NodeCSN) of prepare
> > > (pg_global_snapshot_prepare) from all nodes.  It then assigns MaxCSN
> > > to each node.  Finally, when Commit Prepared is issued for each node
> > > that MaxCSN will be written to each node including the current node.
> > > So, with this idea, each node will have the same view of CSN value
> > > corresponding to any particular transaction.
> > >
> > > For Snapshot management, the node which receives the query generates a
> > > CSN (CurrentCSN) and follows the simple rule that the tuple having a
> > > xid with CSN lesser than CurrentCSN will be visible.  Now, it is
> > > possible that when we are examining a tuple, the CSN corresponding to
> > > xid that has written the tuple has a value as INDOUBT which will
> > > indicate that the transaction is yet not committed on all nodes.  And
> > > we wait till we get the valid CSN value corresponding to xid and then
> > > use it to check if the tuple is visible.
> > >
> > > Now, one thing to note here is that for global transactions we
> > > primarily rely on CSN value corresponding to a transaction for its
> > > visibility even though we still maintain CLOG for local transaction
> > > status.
> > >
> > > Leaving aside the incomplete parts and or flaws of the current patch,
> > > does the above match the top-level idea of this patch?
> >
> > I'm still studying this patch but your understanding seems right to me.
> >
>
> Cool. While studying, if you can try to think whether this approach is
> different from the global coordinator based approach then it would be
> great.  Here is my initial thought apart from other reasons the global
> coordinator based design can help us to do the global transaction
> management and snapshots.  It can allocate xids for each transaction
> and then collect the list of running xacts (or CSN) from each node and
> then prepare a global snapshot that can be used to perform any
> transaction. OTOH, in the design proposed in this patch, we don't need any
> coordinator to manage transactions and snapshots because each node's
> current CSN will be sufficient for snapshot and visibility as
> explained above.

Yeah, my thought is the same as you. Since both approaches have strong
points and weak points I cannot mention which is a better approach,
but that 2PC patch would go well together with the design proposed in
this patch.

> Now, sure this assumes that there is no clock skew
> on different nodes or somehow we take care of the same (Note that in
> the proposed patch the CSN is a timestamp.).

As far as I read Clock-SI paper, we take care of the clock skew by
putting some waits on the transaction start and reading tuples on the
remote node.

>
> > > I am not sure
> > > if my understanding of this patch at this stage is completely correct
> > > or whether we want to follow the approach of this patch but I think at
> > > least lets first be sure if such a top-level idea can achieve what we
> > > want to do here.
> > >
> > > > Considering how to integrate this global snapshot feature with the 2PC
> > > > patch, what the 2PC patch needs to at least change is to allow FDW to
> > > > store an FDW-private data that is passed to subsequent FDW transaction
> > > > API calls. Currently, in the current 2PC patch, we call Prepare API
> > > > for each participant servers one by one, and the core pass only
> > > > metadata such as ForeignServer, UserMapping, and global transaction
> > > > identifier. So it's not easy to calculate the maximum CSN across
> > > > multiple transaction API calls. I think we can change the 2PC patch to
> > > > add a void pointer into FdwXactRslvState, struct passed from the core,
> > > > in order to store FDW-private data. It's going to be the maximum CSN
> > > > in this case. That way, at the first Prepare API calls postgres_fdw
> > > > allocates the space and stores CSN to that space. And at subsequent
> > > > Prepare API calls it can calculate the maximum of csn, and then is
> > > > able to the step 3 to 6 when preparing the transaction on the last
> > > > participant. Another idea would be to change 2PC patch so that the
> > > > core passes a bunch of participants grouped by FDW.
> > > >
> > >
> > > IIUC with this the coordinator needs the communication with the nodes
> > > twice at the prepare stage, once to prepare the transaction in each
> > > node and get CSN from each node and then to communicate MaxCSN to each
> > > node?
> >
> > Yes, I think so too.
> >
> > > Also, we probably need InDoubt CSN status at prepare phase to
> > > make snapshots and global visibility work.
> >
> > I think it depends on how global CSN feature works.
> >
> > For instance, in that 2PC patch, if the coordinator crashes during
> > preparing a foreign transaction, the global transaction manager
> > recovers and regards it as "prepared" regardless of the foreign
> > transaction actually having been prepared. And it sends ROLLBACK
> > PREPARED after recovery completed. With global CSN patch, as you
> > mentioned, at prepare phase the coordinator needs to communicate
> > participants twice other than sending PREPARE TRANSACTION:
> > pg_global_snapshot_prepare() and pg_global_snapshot_assign().
> >
> > If global CSN patch needs different cleanup work depending on the CSN
> > status, we will need InDoubt CSN status so that the global transaction
> > manager can distinguish between a foreign transaction that has
> > executed pg_global_snapshot_prepare() and the one that has executed
> > pg_global_snapshot_assign().
> >
> > On the other hand, if it's enough to just send ROLLBACK or ROLLBACK
> > PREPARED in that case, I think we don't need InDoubt CSN status. There
> > is no difference between those foreign transactions from the global
> > transaction manager perspective.
> >
>
> I think InDoubt status helps in checking visibility in the proposed
> patch wherein if we find the status of the transaction as InDoubt, we
> wait till we get some valid CSN for it as explained in my previous
> email.  So whether we use it for Rollback/Rollback Prepared, it is
> required for this design.

Yes, InDoubt status is required for checking visibility. My comment
was it's not necessary from the perspective of atomic commit.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply via email to