Hi Jason, Thanks for this KIP, I think this will be a huge operational improvement and overall it looks great to me.
I'm not sure how much value the MaxActiveTransactionDuration metric adds, given that we have the --find-hanging option in the tool. As you mention, instances of these transactions are expected to be rare, and a partition-level metric, which can generate a lot of data, seems very heavyweight for such a rare occurrence. I think "alert on PartitionsWithLateTransactionsCount" followed by "run kafka-transactions --find-hanging on the relevant broker" is a reasonable process for cluster operators to follow. Thanks, Bob On Thu, Aug 27, 2020 at 9:23 PM Guozhang Wang <wangg...@gmail.com> wrote: > Hi Jason, > > Thanks for the written KIP. I think this is going to be a very useful tool > for operational improvements since with eos in its current stage, we cannot > confidently assert that we are bug-free, and even in the future when we are > confident this is still going to be leveraged by older versioned brokers. > Regarding the solution, I've also debated myself whether Kafka should > "self-heal" automatically when detected in such situations, or should we > instead build into ecosystem tooling to let operators do it. And I've also > convinced myself that the latter should be a better solution to keep Kafka > software itself simpler. > > Regarding the KIP itself, I have a few meta comments below: > > 1. I'd like to clarify how we can make "--abort" work with old brokers, > since without the additional field "Partitions" the tool needs to set the > coordinator epoch correctly instead of "-1"? Arguably that's still doable > but would require different call paths, and it's not clear whether that's > worth doing for old versions. > > 2. Why do we have to enforce "DescribeProducers" to be sent to only leaders > while ListTransactions can be sent to any brokers? Or is it really > "ListTransactions to be sent to coordinators only"? From the workflow > you've described, based on the results back from DescribeProducers, we > should just immediately send ListTransactions to the > corresponding coordinators based on the collected producer ids, instead of > trying to send to any brokers right? > > Also I'm a bit concerned if "ListTransactions" could potentially return too > much data with "StateFilters" set to all states, including completed ones. > Do we expect users ever want to know transactions that are not pending? On > the other hand, maybe we can just require users to specify the "pids[]" in > this request too to further filter those un-interested transactions. This > also works well with the workflow: we know exactly from "DescribeProducers" > which pids are we diagnosing right now, so in the follow-up > "ListTransactions" we should also only care for those partitions only. > > 3. One thing I'm a bit hesitant about is that, is `Describe` permission on > the associated topic sufficient to allow any users to get all producer > information writing to the specific topic-partitions including last > timestamp, txn-start-timestamp etc, which may be considered sensitive? > Should we require "ClusterAction" to only allow operators only? > > Below are more detailed comments: > > 4. From the example it seems "TxnStartOffset" should be included in the > DescribeTransaction response schema? Otherwise the user would not get it in > the following WriteTxnMarker request. > > 5. It is a bit easier for readers to highlight the added fields in the > existing WriteTxnMarkerRequest (btw I read is that we are only adding > "Partitions" with the starting offset, right?). Also as for its response it > seems we do not make any schema changes except adding one more potential > error code "INVALID_TXN_STATE" to it, right? If that's the case we can just > state that explicitly. > > 6. It is not clear to me for the overloaded function that the following > option classes are not specified, what should be the default options? > > * ListTransactionsOptions > * DescribeTransactionsOptions > * DescribeProducersOptions > > Also, it seems AbortTransactionOptions would just be empty? If yes do we > really need this option class for now? > > 7. A couple questions from the cmd tool examples: > 7.1 Is "--broker" a required or optional (in that case I presume we would > just query all brokers iteratively) in "--find-hanging"? > 7.2 Seems "list-producers" is not exposed as a standalone feature in the > cmd but only used in the wrapping "--find-hanging", is that intentional? > Personally I feel exposing a "--list-producers" may be useful too: if we > believe the user has the right ACL, it is legitimate to return the producer > information to her anyways. But that is debatable in the meta point 3) > above. > 7.3 "Describing Transactions": we should also explain how that would be > executed, e.g. at least we should clarify that we would first find the > coordinator based on the transactional.id and hence users do not need to > specify one. > 7.4. In "Aborting Transactions", should we also specify the "--broker" node > as a required option? Otherwise we would not know which broker to send to. > > > Overall, nice written one, thanks Jason. > > Guozhang > > > On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <lu...@confluent.io> > wrote: > > > >> Would it be worth returning transactional.id.expiration.ms in the > > DescribeProducersResponse? > > > > > That's an interesting thought as well. Are you trying to avoid the need > > to > > specify it through the command line? The tool could also query the value > > with DescribeConfigs I suppose. > > > > Basically. I'm not sure how useful this will be in practice, though it > > might help when debugging. > > > > Lucas > > > > On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <ja...@confluent.io> > > wrote: > > > > > Hey Lucas, > > > > > > Thanks for the comments. Responses below: > > > > > > > Given that it's possible for replica producer states to diverge from > > each > > > other, it would be very useful if DescribeProducers(Request,Response) > and > > > tooling is able to query all partition replicas for their producers > > > > > > Yes, it makes sense to me to let DescribeProducers work on both > followers > > > and leaders. In fact, I'm encouraged that there are use cases for this > > work > > > other than detecting hanging transactions. That was indeed the hope, > but > > I > > > didn't have anything specific in mind. I will update the proposal. > > > > > > > Would it be worth returning transactional.id.expiration.ms in the > > > DescribeProducersResponse? > > > > > > That's an interesting thought as well. Are you trying to avoid the need > > to > > > specify it through the command line? The tool could also query the > value > > > with DescribeConfigs I suppose. > > > > > > Thanks, > > > Jason > > > > > > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <lu...@confluent.io> > > > wrote: > > > > > > > Hi Jason, > > > > > > > > This looks like a very useful tool, thanks for writing it up. > > > > > > > > Given that it's possible for replica producer states to diverge from > > each > > > > other, it would be very useful if DescribeProducers(Request,Response) > > and > > > > tooling is able to query all partition replicas for their producers. > > One > > > > way I can see this being used immediately is in kafka's system tests, > > > > especially the ones that inject failures. At the end of the test we > can > > > > query all replicas and make sure that their states have not > diverged. I > > > can > > > > also see it being useful when debugging production clusters too. > > > > > > > > Would it be worth returning transactional.id.expiration.ms in the > > > > DescribeProducersResponse? > > > > > > > > Cheers, > > > > > > > > Lucas > > > > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <rndg...@gmail.com> > > > wrote: > > > > > > > > > Yes, that definitely sounds reasonable. Thanks, Jason! > > > > > > > > > > Ron > > > > > > > > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson < > ja...@confluent.io> > > > > > wrote: > > > > > > > > > > > Hey Ron, > > > > > > > > > > > > We do not typically backport new APIs to older versions. I think > we > > > can > > > > > > however make the --abort command compatible with older versions. > It > > > > would > > > > > > require a user to do some analysis on their own to identify a > > hanging > > > > > > transaction, but then they can use the tool from a new release to > > > > > recover. > > > > > > For example, users could detect a hanging transaction through the > > > > > existing > > > > > > "LastStableOffsetLag" metric and then collect the needed > > information > > > > > from a > > > > > > dump of the log (or producer snapshot). It's more work, but at > > least > > > > it's > > > > > > possible. Does that sound fair? > > > > > > > > > > > > Thanks, > > > > > > Jason > > > > > > > > > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino < > rndg...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > Hi Jason. Thanks for the excellently-written KIP. > > > > > > > > > > > > > > Will the implementation be backported to prior Kafka versions? > > The > > > > > > reason > > > > > > > I ask is because if it is not backported and similar > > functionality > > > is > > > > > not > > > > > > > otherwise made available for older versions, then the only > > recourse > > > > > > (aside > > > > > > > from deleting and recreating the topic as you pointed out) may > be > > > to > > > > > > > upgrade to 2.7 (or whatever version ends up getting this > > > > > functionality). > > > > > > > Such an upgrade may not be desirable, especially if the number > of > > > > > > > intermediate versions is considerable. I understand the mantra > of > > > > > "never > > > > > > > fall too many versions behind" but the reality of it is that it > > > isn't > > > > > > > always the case. Even if the version is relatively recent, an > > > > upgrade > > > > > > may > > > > > > > still not be possible for some time, and a quicker resolution > may > > > be > > > > > > > necessary. > > > > > > > > > > > > > > Ron > > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson < > > > ja...@confluent.io> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > > > > I've added a proposal to handle the problem of hanging > > > > transactions: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions > > > > > > > > . > > > > > > > > In theory, this should never happen. In practice, we have hit > > one > > > > bug > > > > > > > where > > > > > > > > it was possible and there are few good options today to > > recover. > > > > > Take a > > > > > > > > look and let me know what you think. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Jason > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > -- Guozhang >