Hi Boyang, Thanks for the comments. Responses below:
> 1. For the analysis section, is there any consistency guarantee for `ListTransactions` and `DescribeTransactions`? Let's say the coordinator receives a DescribeTransactions while the transaction is almost complete at the same time, should we have isolation to avoid returning stale information? If an admin client sends excessive describe requests, would it affect the normal processing on transactions? It's similar to `ListGroups` and `DescribeGroup`. There is no guarantee on the consistency between separate calls. The API is just reflecting the current state. In general, tools need to be careful when trying to take some action based on that state. This is one of the reasons I changed the `WriteTxnMarkers` API to take the start offset of the transaction the user wants to abort. In case we get into some race with the coordinator, at least we won't end up aborting the wrong transaction. > 2. I'm not sure whether `PartitionsWithLateTransactionsCount` is providing much value here. Users have no responsibilities to tight their transaction session size below the max transaction timeout. As long as there is prolonged progress being made from the client side, the root cause would be on some ill-performing producers, which means we should monitor the producer client instead. Not sure I follow. The max transaction timeout is enforced in the `AddPartitionsToTxn` API. The client is not allowed a larger timeout, so it seems fair for the broker to treat transactions which have been open longer than this as "late." The benefit of this metric is that it provides a simple alert criteria (i.e. alert for any positive value). > 3. It would be good to highlight new fields in the `WriteTxnMarkersRequest` schema Ack. Will do. > 4. The response schema in `ListTransactions` section was wrong, which should be `ListTransactionsResponse` Thanks. I'll fix the name, but the schema looks like what I wanted. >5. From the original context of ticket 9144, the reasoning for the hanging transaction is due to an uncensored open transaction on the partition leader. Could we just add the direct admin request support like `findAllSuspiciousTransaction` to detect that, by scanning all partition leaders and transaction coordinators within the cluster, and figure out any open transaction on the partition leader side not known to any coordinator? This is basically what the --find-hanging command is doing. I debated moving this logic either into the admin client or the broker, but in the end, I decided it was better to have good general purpose APIs and let the complexity be in the tooling. > 6. typo: "In this case, we the command..." => "In this case, the command..." Ack. Will fix. Thanks, Jason On Thu, Aug 27, 2020 at 2:46 PM Boyang Chen <reluctanthero...@gmail.com> wrote: > Thanks Jason for the tooling proposal. A couple of comments: > > 1. For the analysis section, is there any consistency guarantee for > `ListTransactions` and `DescribeTransactions`? Let's say the coordinator > receives a DescribeTransactions while the transaction is almost complete at > the same time, should we have isolation to avoid returning stale > information? If an admin client sends excessive describe requests, would it > affect the normal processing on transactions? > > 2. I'm not sure whether `PartitionsWithLateTransactionsCount` is providing > much value here. Users have no responsibilities to tight their transaction > session size below the max transaction timeout. As long as there is > prolonged progress being made from the client side, the root cause would be > on some ill-performing producers, which means we should monitor the > producer client instead. > > 3. It would be good to highlight new fields in the `WriteTxnMarkersRequest` > schema > > 4. The response schema in `ListTransactions` section was wrong, which > should be `ListTransactionsResponse` > > 5. From the original context of ticket 9144, the reasoning for the hanging > transaction is due to an uncensored open transaction on the partition > leader. Could we just add the direct admin request support like > `findAllSuspiciousTransaction` to detect that, by scanning all partition > leaders and transaction coordinators within the cluster, and figure out any > open transaction on the partition leader side not known to any coordinator? > > 6. typo: "In this case, we the command..." => "In this case, the > command..." > > Boyang > > On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <lu...@confluent.io> > wrote: > > > >> Would it be worth returning transactional.id.expiration.ms in the > > DescribeProducersResponse? > > > > > That's an interesting thought as well. Are you trying to avoid the need > > to > > specify it through the command line? The tool could also query the value > > with DescribeConfigs I suppose. > > > > Basically. I'm not sure how useful this will be in practice, though it > > might help when debugging. > > > > Lucas > > > > On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <ja...@confluent.io> > > wrote: > > > > > Hey Lucas, > > > > > > Thanks for the comments. Responses below: > > > > > > > Given that it's possible for replica producer states to diverge from > > each > > > other, it would be very useful if DescribeProducers(Request,Response) > and > > > tooling is able to query all partition replicas for their producers > > > > > > Yes, it makes sense to me to let DescribeProducers work on both > followers > > > and leaders. In fact, I'm encouraged that there are use cases for this > > work > > > other than detecting hanging transactions. That was indeed the hope, > but > > I > > > didn't have anything specific in mind. I will update the proposal. > > > > > > > Would it be worth returning transactional.id.expiration.ms in the > > > DescribeProducersResponse? > > > > > > That's an interesting thought as well. Are you trying to avoid the need > > to > > > specify it through the command line? The tool could also query the > value > > > with DescribeConfigs I suppose. > > > > > > Thanks, > > > Jason > > > > > > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <lu...@confluent.io> > > > wrote: > > > > > > > Hi Jason, > > > > > > > > This looks like a very useful tool, thanks for writing it up. > > > > > > > > Given that it's possible for replica producer states to diverge from > > each > > > > other, it would be very useful if DescribeProducers(Request,Response) > > and > > > > tooling is able to query all partition replicas for their producers. > > One > > > > way I can see this being used immediately is in kafka's system tests, > > > > especially the ones that inject failures. At the end of the test we > can > > > > query all replicas and make sure that their states have not > diverged. I > > > can > > > > also see it being useful when debugging production clusters too. > > > > > > > > Would it be worth returning transactional.id.expiration.ms in the > > > > DescribeProducersResponse? > > > > > > > > Cheers, > > > > > > > > Lucas > > > > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <rndg...@gmail.com> > > > wrote: > > > > > > > > > Yes, that definitely sounds reasonable. Thanks, Jason! > > > > > > > > > > Ron > > > > > > > > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson < > ja...@confluent.io> > > > > > wrote: > > > > > > > > > > > Hey Ron, > > > > > > > > > > > > We do not typically backport new APIs to older versions. I think > we > > > can > > > > > > however make the --abort command compatible with older versions. > It > > > > would > > > > > > require a user to do some analysis on their own to identify a > > hanging > > > > > > transaction, but then they can use the tool from a new release to > > > > > recover. > > > > > > For example, users could detect a hanging transaction through the > > > > > existing > > > > > > "LastStableOffsetLag" metric and then collect the needed > > information > > > > > from a > > > > > > dump of the log (or producer snapshot). It's more work, but at > > least > > > > it's > > > > > > possible. Does that sound fair? > > > > > > > > > > > > Thanks, > > > > > > Jason > > > > > > > > > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino < > rndg...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > Hi Jason. Thanks for the excellently-written KIP. > > > > > > > > > > > > > > Will the implementation be backported to prior Kafka versions? > > The > > > > > > reason > > > > > > > I ask is because if it is not backported and similar > > functionality > > > is > > > > > not > > > > > > > otherwise made available for older versions, then the only > > recourse > > > > > > (aside > > > > > > > from deleting and recreating the topic as you pointed out) may > be > > > to > > > > > > > upgrade to 2.7 (or whatever version ends up getting this > > > > > functionality). > > > > > > > Such an upgrade may not be desirable, especially if the number > of > > > > > > > intermediate versions is considerable. I understand the mantra > of > > > > > "never > > > > > > > fall too many versions behind" but the reality of it is that it > > > isn't > > > > > > > always the case. Even if the version is relatively recent, an > > > > upgrade > > > > > > may > > > > > > > still not be possible for some time, and a quicker resolution > may > > > be > > > > > > > necessary. > > > > > > > > > > > > > > Ron > > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson < > > > ja...@confluent.io> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > > > > I've added a proposal to handle the problem of hanging > > > > transactions: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions > > > > > > > > . > > > > > > > > In theory, this should never happen. In practice, we have hit > > one > > > > bug > > > > > > > where > > > > > > > > it was possible and there are few good options today to > > recover. > > > > > Take a > > > > > > > > look and let me know what you think. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Jason > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >