Hi Jason, Thanks for the written KIP. I think this is going to be a very useful tool for operational improvements since with eos in its current stage, we cannot confidently assert that we are bug-free, and even in the future when we are confident this is still going to be leveraged by older versioned brokers. Regarding the solution, I've also debated myself whether Kafka should "self-heal" automatically when detected in such situations, or should we instead build into ecosystem tooling to let operators do it. And I've also convinced myself that the latter should be a better solution to keep Kafka software itself simpler.
Regarding the KIP itself, I have a few meta comments below: 1. I'd like to clarify how we can make "--abort" work with old brokers, since without the additional field "Partitions" the tool needs to set the coordinator epoch correctly instead of "-1"? Arguably that's still doable but would require different call paths, and it's not clear whether that's worth doing for old versions. 2. Why do we have to enforce "DescribeProducers" to be sent to only leaders while ListTransactions can be sent to any brokers? Or is it really "ListTransactions to be sent to coordinators only"? From the workflow you've described, based on the results back from DescribeProducers, we should just immediately send ListTransactions to the corresponding coordinators based on the collected producer ids, instead of trying to send to any brokers right? Also I'm a bit concerned if "ListTransactions" could potentially return too much data with "StateFilters" set to all states, including completed ones. Do we expect users ever want to know transactions that are not pending? On the other hand, maybe we can just require users to specify the "pids[]" in this request too to further filter those un-interested transactions. This also works well with the workflow: we know exactly from "DescribeProducers" which pids are we diagnosing right now, so in the follow-up "ListTransactions" we should also only care for those partitions only. 3. One thing I'm a bit hesitant about is that, is `Describe` permission on the associated topic sufficient to allow any users to get all producer information writing to the specific topic-partitions including last timestamp, txn-start-timestamp etc, which may be considered sensitive? Should we require "ClusterAction" to only allow operators only? Below are more detailed comments: 4. From the example it seems "TxnStartOffset" should be included in the DescribeTransaction response schema? Otherwise the user would not get it in the following WriteTxnMarker request. 5. It is a bit easier for readers to highlight the added fields in the existing WriteTxnMarkerRequest (btw I read is that we are only adding "Partitions" with the starting offset, right?). Also as for its response it seems we do not make any schema changes except adding one more potential error code "INVALID_TXN_STATE" to it, right? If that's the case we can just state that explicitly. 6. It is not clear to me for the overloaded function that the following option classes are not specified, what should be the default options? * ListTransactionsOptions * DescribeTransactionsOptions * DescribeProducersOptions Also, it seems AbortTransactionOptions would just be empty? If yes do we really need this option class for now? 7. A couple questions from the cmd tool examples: 7.1 Is "--broker" a required or optional (in that case I presume we would just query all brokers iteratively) in "--find-hanging"? 7.2 Seems "list-producers" is not exposed as a standalone feature in the cmd but only used in the wrapping "--find-hanging", is that intentional? Personally I feel exposing a "--list-producers" may be useful too: if we believe the user has the right ACL, it is legitimate to return the producer information to her anyways. But that is debatable in the meta point 3) above. 7.3 "Describing Transactions": we should also explain how that would be executed, e.g. at least we should clarify that we would first find the coordinator based on the transactional.id and hence users do not need to specify one. 7.4. In "Aborting Transactions", should we also specify the "--broker" node as a required option? Otherwise we would not know which broker to send to. Overall, nice written one, thanks Jason. Guozhang On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <lu...@confluent.io> wrote: > >> Would it be worth returning transactional.id.expiration.ms in the > DescribeProducersResponse? > > > That's an interesting thought as well. Are you trying to avoid the need > to > specify it through the command line? The tool could also query the value > with DescribeConfigs I suppose. > > Basically. I'm not sure how useful this will be in practice, though it > might help when debugging. > > Lucas > > On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <ja...@confluent.io> > wrote: > > > Hey Lucas, > > > > Thanks for the comments. Responses below: > > > > > Given that it's possible for replica producer states to diverge from > each > > other, it would be very useful if DescribeProducers(Request,Response) and > > tooling is able to query all partition replicas for their producers > > > > Yes, it makes sense to me to let DescribeProducers work on both followers > > and leaders. In fact, I'm encouraged that there are use cases for this > work > > other than detecting hanging transactions. That was indeed the hope, but > I > > didn't have anything specific in mind. I will update the proposal. > > > > > Would it be worth returning transactional.id.expiration.ms in the > > DescribeProducersResponse? > > > > That's an interesting thought as well. Are you trying to avoid the need > to > > specify it through the command line? The tool could also query the value > > with DescribeConfigs I suppose. > > > > Thanks, > > Jason > > > > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <lu...@confluent.io> > > wrote: > > > > > Hi Jason, > > > > > > This looks like a very useful tool, thanks for writing it up. > > > > > > Given that it's possible for replica producer states to diverge from > each > > > other, it would be very useful if DescribeProducers(Request,Response) > and > > > tooling is able to query all partition replicas for their producers. > One > > > way I can see this being used immediately is in kafka's system tests, > > > especially the ones that inject failures. At the end of the test we can > > > query all replicas and make sure that their states have not diverged. I > > can > > > also see it being useful when debugging production clusters too. > > > > > > Would it be worth returning transactional.id.expiration.ms in the > > > DescribeProducersResponse? > > > > > > Cheers, > > > > > > Lucas > > > > > > > > > > > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <rndg...@gmail.com> > > wrote: > > > > > > > Yes, that definitely sounds reasonable. Thanks, Jason! > > > > > > > > Ron > > > > > > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <ja...@confluent.io> > > > > wrote: > > > > > > > > > Hey Ron, > > > > > > > > > > We do not typically backport new APIs to older versions. I think we > > can > > > > > however make the --abort command compatible with older versions. It > > > would > > > > > require a user to do some analysis on their own to identify a > hanging > > > > > transaction, but then they can use the tool from a new release to > > > > recover. > > > > > For example, users could detect a hanging transaction through the > > > > existing > > > > > "LastStableOffsetLag" metric and then collect the needed > information > > > > from a > > > > > dump of the log (or producer snapshot). It's more work, but at > least > > > it's > > > > > possible. Does that sound fair? > > > > > > > > > > Thanks, > > > > > Jason > > > > > > > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <rndg...@gmail.com> > > > > wrote: > > > > > > > > > > > Hi Jason. Thanks for the excellently-written KIP. > > > > > > > > > > > > Will the implementation be backported to prior Kafka versions? > The > > > > > reason > > > > > > I ask is because if it is not backported and similar > functionality > > is > > > > not > > > > > > otherwise made available for older versions, then the only > recourse > > > > > (aside > > > > > > from deleting and recreating the topic as you pointed out) may be > > to > > > > > > upgrade to 2.7 (or whatever version ends up getting this > > > > functionality). > > > > > > Such an upgrade may not be desirable, especially if the number of > > > > > > intermediate versions is considerable. I understand the mantra of > > > > "never > > > > > > fall too many versions behind" but the reality of it is that it > > isn't > > > > > > always the case. Even if the version is relatively recent, an > > > upgrade > > > > > may > > > > > > still not be possible for some time, and a quicker resolution may > > be > > > > > > necessary. > > > > > > > > > > > > Ron > > > > > > > > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson < > > ja...@confluent.io> > > > > > > wrote: > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > > I've added a proposal to handle the problem of hanging > > > transactions: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions > > > > > > > . > > > > > > > In theory, this should never happen. In practice, we have hit > one > > > bug > > > > > > where > > > > > > > it was possible and there are few good options today to > recover. > > > > Take a > > > > > > > look and let me know what you think. > > > > > > > > > > > > > > Thanks, > > > > > > > Jason > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- -- Guozhang