Thanks for the updates Jason. I'm pretty satisfied with the overall motivation and proposed solution, just a couple of more comments.
1. Why do we need to use type string for `StatesFilter` instead of a short value, as we could translate it and save space? 2. I'm wondering whether the requirement for Describe permission on TransactionalId works when we are heading towards https://issues.apache.org/jira/browse/KAFKA-9454, where we could rely on consumer group id instead of defining the transactional id. At a first look, I think it should be ok but just want to raise this point. 3. Could the --find-hanging work with checking all brokers in the cluster, or multiple brokers as a list? 4. Similar to transaction abortion, I guess there is a trade-off for too-specific vs too-general for the required number of arguments. However, supposedly I would like to wipe out all the associated transactions with the given transactional id, or I want to clean up *all *hanging transactions in the cluster, do I need to write the script on my own? Maybe we could discuss a bit on whether we would like to support a more holistic API, or this is good for now. On Thu, Sep 10, 2020 at 7:53 AM Tom Bentley <tbent...@redhat.com> wrote: > Sounds good to me, thanks! > > On Wed, Sep 9, 2020 at 5:30 PM Jason Gustafson <ja...@confluent.io> wrote: > > > Hey Tom, > > > > Yeah, that's fair. I will update the proposal. I was also thinking of > > adding a separate column for duration, just to save users the trouble of > > computing it. > > > > Thanks, > > Jason > > > > On Wed, Sep 9, 2020 at 1:21 AM Tom Bentley <tbent...@redhat.com> wrote: > > > > > Hi Jason, > > > > > > The KIP looks good to me, but I had one question. AFAIU the > LastTimestamp > > > column in the output of --describe-producers and --find-hanging is > there > > so > > > the users of the tool know the txnLastUpdateTimestamp of the > > > TransactionMetadata and from that and the (max) timeout can infer > > something > > > about the likelihood that this really is a stuck transaction. If that's > > the > > > case then what is the benefit in displaying it as a ms offset from the > > unix > > > epoch, rather than an actual date time? > > > > > > Thanks, > > > > > > Tom > > > > > > On Mon, Aug 31, 2020 at 11:28 PM Guozhang Wang <wangg...@gmail.com> > > wrote: > > > > > > > Thanks Jason, I do not have more comments on the KIP then. > > > > > > > > On Mon, Aug 31, 2020 at 3:19 PM Jason Gustafson <ja...@confluent.io> > > > > wrote: > > > > > > > > > > Hmm, but the "TxnStartOffset" is not included in the > > > DescribeProducers > > > > > response either? > > > > > > > > > > Oh, I accidentally called it `CurrentTxnStartTimestamp` in the > > schema. > > > > > Fixed now! > > > > > > > > > > -Jason > > > > > > > > > > On Mon, Aug 31, 2020 at 3:04 PM Guozhang Wang <wangg...@gmail.com> > > > > wrote: > > > > > > > > > > > On Mon, Aug 31, 2020 at 12:28 PM Jason Gustafson < > > ja...@confluent.io > > > > > > > > > > wrote: > > > > > > > > > > > > > Hey Guozhang, > > > > > > > > > > > > > > Thanks for the detailed comments. Responses inline: > > > > > > > > > > > > > > > 1. I'd like to clarify how we can make "--abort" work with > old > > > > > brokers, > > > > > > > since without the additional field "Partitions" the tool needs > to > > > set > > > > > the > > > > > > > coordinator epoch correctly instead of "-1"? Arguably that's > > still > > > > > doable > > > > > > > but would require different call paths, and it's not clear > > whether > > > > > that's > > > > > > > worth doing for old versions. > > > > > > > > > > > > > > That's a good question. What I had in mind was to write the > > marker > > > > > using > > > > > > > the last coordinator epoch that was used by the respective > > > > ProducerId. > > > > > I > > > > > > > realized that I left the coordinator epoch out of the > > > > > `DescribeProducers` > > > > > > > response, so I have updated the KIP to include it. It is > possible > > > > that > > > > > > > there is no coordinator epoch associated with a given > ProducerId > > > > (e.g. > > > > > if > > > > > > > it is the first transaction from that producer), but in this > case > > > we > > > > > can > > > > > > > use 0. > > > > > > > > > > > > > > As for whether this is worth doing, I guess I would be more > > > inclined > > > > to > > > > > > > leave it out if users had a reasonable alternative today to > > address > > > > > this > > > > > > > problem. > > > > > > > > > > > > > > > 2. Why do we have to enforce "DescribeProducers" to be sent > to > > > only > > > > > > > leaders > > > > > > > while ListTransactions can be sent to any brokers? Or is it > > really > > > > > > > "ListTransactions to be sent to coordinators only"? From the > > > workflow > > > > > > > you've described, based on the results back from > > DescribeProducers, > > > > we > > > > > > > should just immediately send ListTransactions to the > > > > > > > corresponding coordinators based on the collected producer ids, > > > > instead > > > > > > of > > > > > > > trying to send to any brokers right? > > > > > > > > > > > > > > I'm going to change `DescribeProducers` so that it can be > handled > > > by > > > > > any > > > > > > > replica of a topic partition. This was suggested by Lucas in > > order > > > to > > > > > > allow > > > > > > > this API to be used for replica consistency testing. As far as > > > > > > > `ListTransactions`, I was treating this similarly to > > `ListGroups`. > > > > > > Although > > > > > > > we know that the coordinators are the leaders of the > > > > > __transaction_state > > > > > > > partitions, this is more of an implementation detail. From an > API > > > > > > > perspective, we say that any broker could be a transaction > > > > coordinator. > > > > > > > > > > > > > > > 3. One thing I'm a bit hesitant about is that, is `Describe` > > > > > permission > > > > > > > on > > > > > > > the associated topic sufficient to allow any users to get all > > > > producer > > > > > > > information writing to the specific topic-partitions including > > last > > > > > > > timestamp, txn-start-timestamp etc, which may be considered > > > > sensitive? > > > > > > > Should we require "ClusterAction" to only allow operators only? > > > > > > > > > > > > > > That's a fair point. Do you think `Read` permission would be > > > > > reasonable? > > > > > > > This is all information that could be obtained by reading the > > > topic. > > > > > > > > > > > > > > Yeah that makes sense. > > > > > > > > > > > > > > > > > > > > 4. From the example it seems "TxnStartOffset" should be > > included > > > in > > > > > the > > > > > > > DescribeTransaction response schema? Otherwise the user would > not > > > get > > > > > it > > > > > > in > > > > > > > the following WriteTxnMarker request. > > > > > > > > > > > > > > The `DescribeTransaction` API is sent to the transaction > > > coordinator, > > > > > > which > > > > > > > does not know the start offset of a transaction in each topic > > > > > partition. > > > > > > > That is why we need `DescribeProducers`. > > > > > > > > > > > > > > > > > > > Hmm, but the "TxnStartOffset" is not included in the > > > DescribeProducers > > > > > > response either? > > > > > > > > > > > > > > > > > > > > > > > > > > > 5. It is a bit easier for readers to highlight the added > fields > > > in > > > > > the > > > > > > > existing WriteTxnMarkerRequest (btw I read is that we are only > > > adding > > > > > > > "Partitions" with the starting offset, right?). Also as for its > > > > > response > > > > > > it > > > > > > > seems we do not make any schema changes except adding one more > > > > > potential > > > > > > > error code "INVALID_TXN_STATE" to it, right? If that's the case > > we > > > > can > > > > > > just > > > > > > > state that explicitly. > > > > > > > > > > > > > > I highlighted the new field in the request. For the response, > the > > > KIP > > > > > > > states the following: "There are no changes to the response > > schema, > > > > but > > > > > > it > > > > > > > will be bumped. Note that we are also enabling flexible version > > > > > support." > > > > > > > > > > > > > > > 6. It is not clear to me for the overloaded function that the > > > > > following > > > > > > > option classes are not specified, what should be the default > > > options? > > > > > > > ... > > > > > > > > > > > > > > I was just trying to stick with existing conventions, but I > will > > > add > > > > > some > > > > > > > more detail here. I think we should probably still include > > > > > > > `AbortTransactionOptions`. The `Options` classes are how users > > > > override > > > > > > > timeouts. > > > > > > > > > > > > > > > 7.1 Is "--broker" a required or optional (in that case I > > presume > > > we > > > > > > would > > > > > > > just query all brokers iteratively) in "--find-hanging"? > > > > > > > > > > > > > > I think it should be required as a reasonable way to limit the > > > scope > > > > of > > > > > > the > > > > > > > search. This is meant to be guided by metrics after all. If we > do > > > not > > > > > > limit > > > > > > > the scope to a single broker, then the behavior might get worse > > as > > > > the > > > > > > > cluster grows. I will clarify this. > > > > > > > > > > > > > > > 7.2 Seems "list-producers" is not exposed as a standalone > > feature > > > > in > > > > > > the > > > > > > > cmd but only used in the wrapping "--find-hanging", is that > > > > > intentional? > > > > > > > Personally I feel exposing a "--list-producers" may be useful > > too: > > > if > > > > > we > > > > > > > believe the user has the right ACL, it is legitimate to return > > the > > > > > > producer > > > > > > > information to her anyways. But that is debatable in the meta > > point > > > > 3) > > > > > > > above. > > > > > > > > > > > > > > Yeah, I was planning to add this to support the use case that > > Lucas > > > > > > > mentioned. There is some awkwardness since it is a little > > difficult > > > > to > > > > > > > convey different sources of information through the same > > command. I > > > > > guess > > > > > > > we can do `--list producers` and `--list transactions` and > > explain > > > in > > > > > the > > > > > > > documentation. Maybe that is good enough. > > > > > > > > > > > > > > > 7.3 "Describing Transactions": we should also explain how > that > > > > would > > > > > be > > > > > > > executed, e.g. at least we should clarify that we would first > > find > > > > the > > > > > > > coordinator based on the transactional.id and hence users do > not > > > > need > > > > > to > > > > > > > specify one. > > > > > > > > > > > > > > Sure, makes sense. > > > > > > > > > > > > > > > 7.4. In "Aborting Transactions", should we also specify the > > > > > "--broker" > > > > > > > node > > > > > > > as a required option? Otherwise we would not know which broker > to > > > > send > > > > > > to. > > > > > > > > > > > > > > The --topic and --partition arguments are required, so the > target > > > is > > > > > > always > > > > > > > the leader of that partition. > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > Jason > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Aug 28, 2020 at 8:13 AM Robert Barrett < > > > > > bob.barr...@confluent.io > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Jason, > > > > > > > > > > > > > > > > Thanks for this KIP, I think this will be a huge operational > > > > > > improvement > > > > > > > > and overall it looks great to me. > > > > > > > > > > > > > > > > I'm not sure how much value the MaxActiveTransactionDuration > > > metric > > > > > > adds, > > > > > > > > given that we have the --find-hanging option in the tool. As > > you > > > > > > mention, > > > > > > > > instances of these transactions are expected to be rare, and > a > > > > > > > > partition-level metric, which can generate a lot of data, > seems > > > > very > > > > > > > > heavyweight for such a rare occurrence. I think "alert on > > > > > > > > PartitionsWithLateTransactionsCount" followed by "run > > > > > > kafka-transactions > > > > > > > > --find-hanging on the relevant broker" is a reasonable > process > > > for > > > > > > > cluster > > > > > > > > operators to follow. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Bob > > > > > > > > > > > > > > > > On Thu, Aug 27, 2020 at 9:23 PM Guozhang Wang < > > > wangg...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi Jason, > > > > > > > > > > > > > > > > > > Thanks for the written KIP. I think this is going to be a > > very > > > > > useful > > > > > > > > tool > > > > > > > > > for operational improvements since with eos in its current > > > stage, > > > > > we > > > > > > > > cannot > > > > > > > > > confidently assert that we are bug-free, and even in the > > future > > > > > when > > > > > > we > > > > > > > > are > > > > > > > > > confident this is still going to be leveraged by older > > > versioned > > > > > > > brokers. > > > > > > > > > Regarding the solution, I've also debated myself whether > > Kafka > > > > > should > > > > > > > > > "self-heal" automatically when detected in such situations, > > or > > > > > should > > > > > > > we > > > > > > > > > instead build into ecosystem tooling to let operators do > it. > > > And > > > > > I've > > > > > > > > also > > > > > > > > > convinced myself that the latter should be a better > solution > > to > > > > > keep > > > > > > > > Kafka > > > > > > > > > software itself simpler. > > > > > > > > > > > > > > > > > > Regarding the KIP itself, I have a few meta comments below: > > > > > > > > > > > > > > > > > > 1. I'd like to clarify how we can make "--abort" work with > > old > > > > > > brokers, > > > > > > > > > since without the additional field "Partitions" the tool > > needs > > > to > > > > > set > > > > > > > the > > > > > > > > > coordinator epoch correctly instead of "-1"? Arguably > that's > > > > still > > > > > > > doable > > > > > > > > > but would require different call paths, and it's not clear > > > > whether > > > > > > > that's > > > > > > > > > worth doing for old versions. > > > > > > > > > > > > > > > > > > 2. Why do we have to enforce "DescribeProducers" to be sent > > to > > > > only > > > > > > > > leaders > > > > > > > > > while ListTransactions can be sent to any brokers? Or is it > > > > really > > > > > > > > > "ListTransactions to be sent to coordinators only"? From > the > > > > > workflow > > > > > > > > > you've described, based on the results back from > > > > DescribeProducers, > > > > > > we > > > > > > > > > should just immediately send ListTransactions to the > > > > > > > > > corresponding coordinators based on the collected producer > > ids, > > > > > > instead > > > > > > > > of > > > > > > > > > trying to send to any brokers right? > > > > > > > > > > > > > > > > > > Also I'm a bit concerned if "ListTransactions" could > > > potentially > > > > > > return > > > > > > > > too > > > > > > > > > much data with "StateFilters" set to all states, including > > > > > completed > > > > > > > > ones. > > > > > > > > > Do we expect users ever want to know transactions that are > > not > > > > > > pending? > > > > > > > > On > > > > > > > > > the other hand, maybe we can just require users to specify > > the > > > > > > "pids[]" > > > > > > > > in > > > > > > > > > this request too to further filter those un-interested > > > > > transactions. > > > > > > > This > > > > > > > > > also works well with the workflow: we know exactly from > > > > > > > > "DescribeProducers" > > > > > > > > > which pids are we diagnosing right now, so in the follow-up > > > > > > > > > "ListTransactions" we should also only care for those > > > partitions > > > > > > only. > > > > > > > > > > > > > > > > > > 3. One thing I'm a bit hesitant about is that, is > `Describe` > > > > > > permission > > > > > > > > on > > > > > > > > > the associated topic sufficient to allow any users to get > all > > > > > > producer > > > > > > > > > information writing to the specific topic-partitions > > including > > > > last > > > > > > > > > timestamp, txn-start-timestamp etc, which may be considered > > > > > > sensitive? > > > > > > > > > Should we require "ClusterAction" to only allow operators > > only? > > > > > > > > > > > > > > > > > > Below are more detailed comments: > > > > > > > > > > > > > > > > > > 4. From the example it seems "TxnStartOffset" should be > > > included > > > > in > > > > > > the > > > > > > > > > DescribeTransaction response schema? Otherwise the user > would > > > not > > > > > get > > > > > > > it > > > > > > > > in > > > > > > > > > the following WriteTxnMarker request. > > > > > > > > > > > > > > > > > > 5. It is a bit easier for readers to highlight the added > > fields > > > > in > > > > > > the > > > > > > > > > existing WriteTxnMarkerRequest (btw I read is that we are > > only > > > > > adding > > > > > > > > > "Partitions" with the starting offset, right?). Also as for > > its > > > > > > > response > > > > > > > > it > > > > > > > > > seems we do not make any schema changes except adding one > > more > > > > > > > potential > > > > > > > > > error code "INVALID_TXN_STATE" to it, right? If that's the > > case > > > > we > > > > > > can > > > > > > > > just > > > > > > > > > state that explicitly. > > > > > > > > > > > > > > > > > > 6. It is not clear to me for the overloaded function that > the > > > > > > following > > > > > > > > > option classes are not specified, what should be the > default > > > > > options? > > > > > > > > > > > > > > > > > > * ListTransactionsOptions > > > > > > > > > * DescribeTransactionsOptions > > > > > > > > > * DescribeProducersOptions > > > > > > > > > > > > > > > > > > Also, it seems AbortTransactionOptions would just be empty? > > If > > > > yes > > > > > do > > > > > > > we > > > > > > > > > really need this option class for now? > > > > > > > > > > > > > > > > > > 7. A couple questions from the cmd tool examples: > > > > > > > > > 7.1 Is "--broker" a required or optional (in that case I > > > presume > > > > we > > > > > > > would > > > > > > > > > just query all brokers iteratively) in "--find-hanging"? > > > > > > > > > 7.2 Seems "list-producers" is not exposed as a standalone > > > feature > > > > > in > > > > > > > the > > > > > > > > > cmd but only used in the wrapping "--find-hanging", is that > > > > > > > intentional? > > > > > > > > > Personally I feel exposing a "--list-producers" may be > useful > > > > too: > > > > > if > > > > > > > we > > > > > > > > > believe the user has the right ACL, it is legitimate to > > return > > > > the > > > > > > > > producer > > > > > > > > > information to her anyways. But that is debatable in the > meta > > > > point > > > > > > 3) > > > > > > > > > above. > > > > > > > > > 7.3 "Describing Transactions": we should also explain how > > that > > > > > would > > > > > > be > > > > > > > > > executed, e.g. at least we should clarify that we would > first > > > > find > > > > > > the > > > > > > > > > coordinator based on the transactional.id and hence users > do > > > not > > > > > > need > > > > > > > to > > > > > > > > > specify one. > > > > > > > > > 7.4. In "Aborting Transactions", should we also specify the > > > > > > "--broker" > > > > > > > > node > > > > > > > > > as a required option? Otherwise we would not know which > > broker > > > to > > > > > > send > > > > > > > > to. > > > > > > > > > > > > > > > > > > > > > > > > > > > Overall, nice written one, thanks Jason. > > > > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet < > > > > > > lu...@confluent.io> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > >> Would it be worth returning > > > transactional.id.expiration.ms > > > > in > > > > > > the > > > > > > > > > > DescribeProducersResponse? > > > > > > > > > > > > > > > > > > > > > That's an interesting thought as well. Are you trying > to > > > > avoid > > > > > > the > > > > > > > > need > > > > > > > > > > to > > > > > > > > > > specify it through the command line? The tool could also > > > query > > > > > the > > > > > > > > value > > > > > > > > > > with DescribeConfigs I suppose. > > > > > > > > > > > > > > > > > > > > Basically. I'm not sure how useful this will be in > > practice, > > > > > though > > > > > > > it > > > > > > > > > > might help when debugging. > > > > > > > > > > > > > > > > > > > > Lucas > > > > > > > > > > > > > > > > > > > > On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson < > > > > > > ja...@confluent.io > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hey Lucas, > > > > > > > > > > > > > > > > > > > > > > Thanks for the comments. Responses below: > > > > > > > > > > > > > > > > > > > > > > > Given that it's possible for replica producer states > to > > > > > diverge > > > > > > > > from > > > > > > > > > > each > > > > > > > > > > > other, it would be very useful if > > > > > > > DescribeProducers(Request,Response) > > > > > > > > > and > > > > > > > > > > > tooling is able to query all partition replicas for > their > > > > > > producers > > > > > > > > > > > > > > > > > > > > > > Yes, it makes sense to me to let DescribeProducers work > > on > > > > both > > > > > > > > > followers > > > > > > > > > > > and leaders. In fact, I'm encouraged that there are use > > > cases > > > > > for > > > > > > > > this > > > > > > > > > > work > > > > > > > > > > > other than detecting hanging transactions. That was > > indeed > > > > the > > > > > > > hope, > > > > > > > > > but > > > > > > > > > > I > > > > > > > > > > > didn't have anything specific in mind. I will update > the > > > > > > proposal. > > > > > > > > > > > > > > > > > > > > > > > Would it be worth returning > > > transactional.id.expiration.ms > > > > > in > > > > > > > the > > > > > > > > > > > DescribeProducersResponse? > > > > > > > > > > > > > > > > > > > > > > That's an interesting thought as well. Are you trying > to > > > > avoid > > > > > > the > > > > > > > > need > > > > > > > > > > to > > > > > > > > > > > specify it through the command line? The tool could > also > > > > query > > > > > > the > > > > > > > > > value > > > > > > > > > > > with DescribeConfigs I suppose. > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > Jason > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet < > > > > > > > > lu...@confluent.io> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi Jason, > > > > > > > > > > > > > > > > > > > > > > > > This looks like a very useful tool, thanks for > writing > > it > > > > up. > > > > > > > > > > > > > > > > > > > > > > > > Given that it's possible for replica producer states > to > > > > > diverge > > > > > > > > from > > > > > > > > > > each > > > > > > > > > > > > other, it would be very useful if > > > > > > > > DescribeProducers(Request,Response) > > > > > > > > > > and > > > > > > > > > > > > tooling is able to query all partition replicas for > > their > > > > > > > > producers. > > > > > > > > > > One > > > > > > > > > > > > way I can see this being used immediately is in > kafka's > > > > > system > > > > > > > > tests, > > > > > > > > > > > > especially the ones that inject failures. At the end > of > > > the > > > > > > test > > > > > > > we > > > > > > > > > can > > > > > > > > > > > > query all replicas and make sure that their states > have > > > not > > > > > > > > > diverged. I > > > > > > > > > > > can > > > > > > > > > > > > also see it being useful when debugging production > > > clusters > > > > > > too. > > > > > > > > > > > > > > > > > > > > > > > > Would it be worth returning > > > transactional.id.expiration.ms > > > > > in > > > > > > > the > > > > > > > > > > > > DescribeProducersResponse? > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > > > > > > > > > > > > Lucas > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino < > > > > > > > rndg...@gmail.com> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Yes, that definitely sounds reasonable. Thanks, > > Jason! > > > > > > > > > > > > > > > > > > > > > > > > > > Ron > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson < > > > > > > > > > ja...@confluent.io> > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Hey Ron, > > > > > > > > > > > > > > > > > > > > > > > > > > > > We do not typically backport new APIs to older > > > > versions. > > > > > I > > > > > > > > think > > > > > > > > > we > > > > > > > > > > > can > > > > > > > > > > > > > > however make the --abort command compatible with > > > older > > > > > > > > versions. > > > > > > > > > It > > > > > > > > > > > > would > > > > > > > > > > > > > > require a user to do some analysis on their own > to > > > > > > identify a > > > > > > > > > > hanging > > > > > > > > > > > > > > transaction, but then they can use the tool from > a > > > new > > > > > > > release > > > > > > > > to > > > > > > > > > > > > > recover. > > > > > > > > > > > > > > For example, users could detect a hanging > > transaction > > > > > > through > > > > > > > > the > > > > > > > > > > > > > existing > > > > > > > > > > > > > > "LastStableOffsetLag" metric and then collect the > > > > needed > > > > > > > > > > information > > > > > > > > > > > > > from a > > > > > > > > > > > > > > dump of the log (or producer snapshot). It's more > > > work, > > > > > but > > > > > > > at > > > > > > > > > > least > > > > > > > > > > > > it's > > > > > > > > > > > > > > possible. Does that sound fair? > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > Jason > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino < > > > > > > > > > rndg...@gmail.com> > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Jason. Thanks for the excellently-written > > KIP. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Will the implementation be backported to prior > > > Kafka > > > > > > > > versions? > > > > > > > > > > The > > > > > > > > > > > > > > reason > > > > > > > > > > > > > > > I ask is because if it is not backported and > > > similar > > > > > > > > > > functionality > > > > > > > > > > > is > > > > > > > > > > > > > not > > > > > > > > > > > > > > > otherwise made available for older versions, > then > > > the > > > > > > only > > > > > > > > > > recourse > > > > > > > > > > > > > > (aside > > > > > > > > > > > > > > > from deleting and recreating the topic as you > > > pointed > > > > > > out) > > > > > > > > may > > > > > > > > > be > > > > > > > > > > > to > > > > > > > > > > > > > > > upgrade to 2.7 (or whatever version ends up > > getting > > > > > this > > > > > > > > > > > > > functionality). > > > > > > > > > > > > > > > Such an upgrade may not be desirable, > especially > > if > > > > the > > > > > > > > number > > > > > > > > > of > > > > > > > > > > > > > > > intermediate versions is considerable. I > > understand > > > > the > > > > > > > > mantra > > > > > > > > > of > > > > > > > > > > > > > "never > > > > > > > > > > > > > > > fall too many versions behind" but the reality > of > > > it > > > > is > > > > > > > that > > > > > > > > it > > > > > > > > > > > isn't > > > > > > > > > > > > > > > always the case. Even if the version is > > relatively > > > > > > recent, > > > > > > > > an > > > > > > > > > > > > upgrade > > > > > > > > > > > > > > may > > > > > > > > > > > > > > > still not be possible for some time, and a > > quicker > > > > > > > resolution > > > > > > > > > may > > > > > > > > > > > be > > > > > > > > > > > > > > > necessary. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Ron > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason > Gustafson < > > > > > > > > > > > ja...@confluent.io> > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I've added a proposal to handle the problem > of > > > > > hanging > > > > > > > > > > > > transactions: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions > > > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > > In theory, this should never happen. In > > practice, > > > > we > > > > > > have > > > > > > > > hit > > > > > > > > > > one > > > > > > > > > > > > bug > > > > > > > > > > > > > > > where > > > > > > > > > > > > > > > > it was possible and there are few good > options > > > > today > > > > > to > > > > > > > > > > recover. > > > > > > > > > > > > > Take a > > > > > > > > > > > > > > > > look and let me know what you think. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > Jason > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > -- > > > > -- Guozhang > > > > > > > > > >