Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Robert Barrett Fri, 28 Aug 2020 08:13:32 -0700

Hi Jason,

Thanks for this KIP, I think this will be a huge operational improvement
and overall it looks great to me.


I'm not sure how much value the MaxActiveTransactionDuration metric adds,
given that we have the --find-hanging option in the tool. As you mention,
instances of these transactions are expected to be rare, and a
partition-level metric, which can generate a lot of data, seems very
heavyweight for such a rare occurrence. I think "alert on
PartitionsWithLateTransactionsCount" followed by "run kafka-transactions
--find-hanging on the relevant broker" is a reasonable process for cluster
operators to follow.

Thanks,
Bob

On Thu, Aug 27, 2020 at 9:23 PM Guozhang Wang <wangg...@gmail.com> wrote:

> Hi Jason,
>
> Thanks for the written KIP. I think this is going to be a very useful tool
> for operational improvements since with eos in its current stage, we cannot
> confidently assert that we are bug-free, and even in the future when we are
> confident this is still going to be leveraged by older versioned brokers.
> Regarding the solution, I've also debated myself whether Kafka should
> "self-heal" automatically when detected in such situations, or should we
> instead build into ecosystem tooling to let operators do it. And I've also
> convinced myself that the latter should be a better solution to keep Kafka
> software itself simpler.
>
> Regarding the KIP itself, I have a few meta comments below:
>
> 1. I'd like to clarify how we can make "--abort" work with old brokers,
> since without the additional field "Partitions" the tool needs to set the
> coordinator epoch correctly instead of "-1"? Arguably that's still doable
> but would require different call paths, and it's not clear whether that's
> worth doing for old versions.
>
> 2. Why do we have to enforce "DescribeProducers" to be sent to only leaders
> while ListTransactions can be sent to any brokers? Or is it really
> "ListTransactions to be sent to coordinators only"? From the workflow
> you've described, based on the results back from DescribeProducers, we
> should just immediately send ListTransactions to the
> corresponding coordinators based on the collected producer ids, instead of
> trying to send to any brokers right?
>
> Also I'm a bit concerned if "ListTransactions" could potentially return too
> much data with "StateFilters" set to all states, including completed ones.
> Do we expect users ever want to know transactions that are not pending? On
> the other hand, maybe we can just require users to specify the "pids[]" in
> this request too to further filter those un-interested transactions. This
> also works well with the workflow: we know exactly from "DescribeProducers"
> which pids are we diagnosing right now, so in the follow-up
> "ListTransactions" we should also only care for those partitions only.
>
> 3. One thing I'm a bit hesitant about is that, is `Describe` permission on
> the associated topic sufficient to allow any users to get all producer
> information writing to the specific topic-partitions including last
> timestamp, txn-start-timestamp etc, which may be considered sensitive?
> Should we require "ClusterAction" to only allow operators only?
>
> Below are more detailed comments:
>
> 4. From the example it seems "TxnStartOffset" should be included in the
> DescribeTransaction response schema? Otherwise the user would not get it in
> the following WriteTxnMarker request.
>
> 5. It is a bit easier for readers to highlight the added fields in the
> existing WriteTxnMarkerRequest (btw I read is that we are only adding
> "Partitions" with the starting offset, right?). Also as for its response it
> seems we do not make any schema changes except adding one more potential
> error code "INVALID_TXN_STATE" to it, right? If that's the case we can just
> state that explicitly.
>
> 6. It is not clear to me for the overloaded function that the following
> option classes are not specified, what should be the default options?
>
> * ListTransactionsOptions
> * DescribeTransactionsOptions
> * DescribeProducersOptions
>
> Also, it seems AbortTransactionOptions would just be empty? If yes do we
> really need this option class for now?
>
> 7. A couple questions from the cmd tool examples:
> 7.1 Is "--broker" a required or optional (in that case I presume we would
> just query all brokers iteratively) in "--find-hanging"?
> 7.2 Seems "list-producers" is not exposed as a standalone feature in the
> cmd but only used in the wrapping "--find-hanging", is that intentional?
> Personally I feel exposing a "--list-producers" may be useful too: if we
> believe the user has the right ACL, it is legitimate to return the producer
> information to her anyways. But that is debatable in the meta point 3)
> above.
> 7.3 "Describing Transactions": we should also explain how that would be
> executed, e.g. at least we should clarify that we would first find the
> coordinator based on the transactional.id and hence users do not need to
> specify one.
> 7.4. In "Aborting Transactions", should we also specify the "--broker" node
> as a required option? Otherwise we would not know which broker to send to.
>
>
> Overall, nice written one, thanks Jason.
>
> Guozhang
>
>
> On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <lu...@confluent.io>
> wrote:
>
> > >> Would it be worth returning transactional.id.expiration.ms in the
> > DescribeProducersResponse?
> >
> > > That's an interesting thought as well. Are you trying to avoid the need
> > to
> > specify it through the command line? The tool could also query the value
> > with DescribeConfigs I suppose.
> >
> > Basically. I'm not sure how useful this will be in practice, though it
> > might help when debugging.
> >
> > Lucas
> >
> > On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hey Lucas,
> > >
> > > Thanks for the comments. Responses below:
> > >
> > > > Given that it's possible for replica producer states to diverge from
> > each
> > > other, it would be very useful if DescribeProducers(Request,Response)
> and
> > > tooling is able to query all partition replicas for their producers
> > >
> > > Yes, it makes sense to me to let DescribeProducers work on both
> followers
> > > and leaders. In fact, I'm encouraged that there are use cases for this
> > work
> > > other than detecting hanging transactions. That was indeed the hope,
> but
> > I
> > > didn't have anything specific in mind. I will update the proposal.
> > >
> > > > Would it be worth returning transactional.id.expiration.ms in the
> > > DescribeProducersResponse?
> > >
> > > That's an interesting thought as well. Are you trying to avoid the need
> > to
> > > specify it through the command line? The tool could also query the
> value
> > > with DescribeConfigs I suppose.
> > >
> > > Thanks,
> > > Jason
> > >
> > > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <lu...@confluent.io>
> > > wrote:
> > >
> > > > Hi Jason,
> > > >
> > > > This looks like a very useful tool, thanks for writing it up.
> > > >
> > > > Given that it's possible for replica producer states to diverge from
> > each
> > > > other, it would be very useful if DescribeProducers(Request,Response)
> > and
> > > > tooling is able to query all partition replicas for their producers.
> > One
> > > > way I can see this being used immediately is in kafka's system tests,
> > > > especially the ones that inject failures. At the end of the test we
> can
> > > > query all replicas and make sure that their states have not
> diverged. I
> > > can
> > > > also see it being useful when debugging production clusters too.
> > > >
> > > > Would it be worth returning transactional.id.expiration.ms in the
> > > > DescribeProducersResponse?
> > > >
> > > > Cheers,
> > > >
> > > > Lucas
> > > >
> > > >
> > > >
> > > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <rndg...@gmail.com>
> > > wrote:
> > > >
> > > > > Yes, that definitely sounds reasonable.  Thanks, Jason!
> > > > >
> > > > > Ron
> > > > >
> > > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <
> ja...@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > Hey Ron,
> > > > > >
> > > > > > We do not typically backport new APIs to older versions. I think
> we
> > > can
> > > > > > however make the --abort command compatible with older versions.
> It
> > > > would
> > > > > > require a user to do some analysis on their own to identify a
> > hanging
> > > > > > transaction, but then they can use the tool from a new release to
> > > > > recover.
> > > > > > For example, users could detect a hanging transaction through the
> > > > > existing
> > > > > > "LastStableOffsetLag" metric and then collect the needed
> > information
> > > > > from a
> > > > > > dump of the log (or producer snapshot). It's more work, but at
> > least
> > > > it's
> > > > > > possible. Does that sound fair?
> > > > > >
> > > > > > Thanks,
> > > > > > Jason
> > > > > >
> > > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <
> rndg...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi Jason.  Thanks for the excellently-written KIP.
> > > > > > >
> > > > > > > Will the implementation be backported to prior Kafka versions?
> > The
> > > > > > reason
> > > > > > > I ask is because if it is not backported and similar
> > functionality
> > > is
> > > > > not
> > > > > > > otherwise made available for older versions, then the only
> > recourse
> > > > > > (aside
> > > > > > > from deleting and recreating the topic as you pointed out) may
> be
> > > to
> > > > > > > upgrade to 2.7 (or whatever version ends up getting this
> > > > > functionality).
> > > > > > > Such an upgrade may not be desirable, especially if the number
> of
> > > > > > > intermediate versions is considerable. I understand the mantra
> of
> > > > > "never
> > > > > > > fall too many versions behind" but the reality of it is that it
> > > isn't
> > > > > > > always the case.  Even if the version is relatively recent, an
> > > > upgrade
> > > > > > may
> > > > > > > still not be possible for some time, and a quicker resolution
> may
> > > be
> > > > > > > necessary.
> > > > > > >
> > > > > > > Ron
> > > > > > >
> > > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <
> > > ja...@confluent.io>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi All,
> > > > > > > >
> > > > > > > > I've added a proposal to handle the problem of hanging
> > > > transactions:
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > > > > .
> > > > > > > > In theory, this should never happen. In practice, we have hit
> > one
> > > > bug
> > > > > > > where
> > > > > > > > it was possible and there are few good options today to
> > recover.
> > > > > Take a
> > > > > > > > look and let me know what you think.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Jason
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Reply via email to