Hi Jason,

Thanks for the written KIP. I think this is going to be a very useful tool
for operational improvements since with eos in its current stage, we cannot
confidently assert that we are bug-free, and even in the future when we are
confident this is still going to be leveraged by older versioned brokers.
Regarding the solution, I've also debated myself whether Kafka should
"self-heal" automatically when detected in such situations, or should we
instead build into ecosystem tooling to let operators do it. And I've also
convinced myself that the latter should be a better solution to keep Kafka
software itself simpler.

Regarding the KIP itself, I have a few meta comments below:

1. I'd like to clarify how we can make "--abort" work with old brokers,
since without the additional field "Partitions" the tool needs to set the
coordinator epoch correctly instead of "-1"? Arguably that's still doable
but would require different call paths, and it's not clear whether that's
worth doing for old versions.

2. Why do we have to enforce "DescribeProducers" to be sent to only leaders
while ListTransactions can be sent to any brokers? Or is it really
"ListTransactions to be sent to coordinators only"? From the workflow
you've described, based on the results back from DescribeProducers, we
should just immediately send ListTransactions to the
corresponding coordinators based on the collected producer ids, instead of
trying to send to any brokers right?

Also I'm a bit concerned if "ListTransactions" could potentially return too
much data with "StateFilters" set to all states, including completed ones.
Do we expect users ever want to know transactions that are not pending? On
the other hand, maybe we can just require users to specify the "pids[]" in
this request too to further filter those un-interested transactions. This
also works well with the workflow: we know exactly from "DescribeProducers"
which pids are we diagnosing right now, so in the follow-up
"ListTransactions" we should also only care for those partitions only.

3. One thing I'm a bit hesitant about is that, is `Describe` permission on
the associated topic sufficient to allow any users to get all producer
information writing to the specific topic-partitions including last
timestamp, txn-start-timestamp etc, which may be considered sensitive?
Should we require "ClusterAction" to only allow operators only?

Below are more detailed comments:

4. From the example it seems "TxnStartOffset" should be included in the
DescribeTransaction response schema? Otherwise the user would not get it in
the following WriteTxnMarker request.

5. It is a bit easier for readers to highlight the added fields in the
existing WriteTxnMarkerRequest (btw I read is that we are only adding
"Partitions" with the starting offset, right?). Also as for its response it
seems we do not make any schema changes except adding one more potential
error code "INVALID_TXN_STATE" to it, right? If that's the case we can just
state that explicitly.

6. It is not clear to me for the overloaded function that the following
option classes are not specified, what should be the default options?

* ListTransactionsOptions
* DescribeTransactionsOptions
* DescribeProducersOptions

Also, it seems AbortTransactionOptions would just be empty? If yes do we
really need this option class for now?

7. A couple questions from the cmd tool examples:
7.1 Is "--broker" a required or optional (in that case I presume we would
just query all brokers iteratively) in "--find-hanging"?
7.2 Seems "list-producers" is not exposed as a standalone feature in the
cmd but only used in the wrapping "--find-hanging", is that intentional?
Personally I feel exposing a "--list-producers" may be useful too: if we
believe the user has the right ACL, it is legitimate to return the producer
information to her anyways. But that is debatable in the meta point 3)
above.
7.3 "Describing Transactions": we should also explain how that would be
executed, e.g. at least we should clarify that we would first find the
coordinator based on the transactional.id and hence users do not need to
specify one.
7.4. In "Aborting Transactions", should we also specify the "--broker" node
as a required option? Otherwise we would not know which broker to send to.


Overall, nice written one, thanks Jason.

Guozhang


On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <lu...@confluent.io>
wrote:

> >> Would it be worth returning transactional.id.expiration.ms in the
> DescribeProducersResponse?
>
> > That's an interesting thought as well. Are you trying to avoid the need
> to
> specify it through the command line? The tool could also query the value
> with DescribeConfigs I suppose.
>
> Basically. I'm not sure how useful this will be in practice, though it
> might help when debugging.
>
> Lucas
>
> On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hey Lucas,
> >
> > Thanks for the comments. Responses below:
> >
> > > Given that it's possible for replica producer states to diverge from
> each
> > other, it would be very useful if DescribeProducers(Request,Response) and
> > tooling is able to query all partition replicas for their producers
> >
> > Yes, it makes sense to me to let DescribeProducers work on both followers
> > and leaders. In fact, I'm encouraged that there are use cases for this
> work
> > other than detecting hanging transactions. That was indeed the hope, but
> I
> > didn't have anything specific in mind. I will update the proposal.
> >
> > > Would it be worth returning transactional.id.expiration.ms in the
> > DescribeProducersResponse?
> >
> > That's an interesting thought as well. Are you trying to avoid the need
> to
> > specify it through the command line? The tool could also query the value
> > with DescribeConfigs I suppose.
> >
> > Thanks,
> > Jason
> >
> > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <lu...@confluent.io>
> > wrote:
> >
> > > Hi Jason,
> > >
> > > This looks like a very useful tool, thanks for writing it up.
> > >
> > > Given that it's possible for replica producer states to diverge from
> each
> > > other, it would be very useful if DescribeProducers(Request,Response)
> and
> > > tooling is able to query all partition replicas for their producers.
> One
> > > way I can see this being used immediately is in kafka's system tests,
> > > especially the ones that inject failures. At the end of the test we can
> > > query all replicas and make sure that their states have not diverged. I
> > can
> > > also see it being useful when debugging production clusters too.
> > >
> > > Would it be worth returning transactional.id.expiration.ms in the
> > > DescribeProducersResponse?
> > >
> > > Cheers,
> > >
> > > Lucas
> > >
> > >
> > >
> > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <rndg...@gmail.com>
> > wrote:
> > >
> > > > Yes, that definitely sounds reasonable.  Thanks, Jason!
> > > >
> > > > Ron
> > > >
> > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <ja...@confluent.io>
> > > > wrote:
> > > >
> > > > > Hey Ron,
> > > > >
> > > > > We do not typically backport new APIs to older versions. I think we
> > can
> > > > > however make the --abort command compatible with older versions. It
> > > would
> > > > > require a user to do some analysis on their own to identify a
> hanging
> > > > > transaction, but then they can use the tool from a new release to
> > > > recover.
> > > > > For example, users could detect a hanging transaction through the
> > > > existing
> > > > > "LastStableOffsetLag" metric and then collect the needed
> information
> > > > from a
> > > > > dump of the log (or producer snapshot). It's more work, but at
> least
> > > it's
> > > > > possible. Does that sound fair?
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > > >
> > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <rndg...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi Jason.  Thanks for the excellently-written KIP.
> > > > > >
> > > > > > Will the implementation be backported to prior Kafka versions?
> The
> > > > > reason
> > > > > > I ask is because if it is not backported and similar
> functionality
> > is
> > > > not
> > > > > > otherwise made available for older versions, then the only
> recourse
> > > > > (aside
> > > > > > from deleting and recreating the topic as you pointed out) may be
> > to
> > > > > > upgrade to 2.7 (or whatever version ends up getting this
> > > > functionality).
> > > > > > Such an upgrade may not be desirable, especially if the number of
> > > > > > intermediate versions is considerable. I understand the mantra of
> > > > "never
> > > > > > fall too many versions behind" but the reality of it is that it
> > isn't
> > > > > > always the case.  Even if the version is relatively recent, an
> > > upgrade
> > > > > may
> > > > > > still not be possible for some time, and a quicker resolution may
> > be
> > > > > > necessary.
> > > > > >
> > > > > > Ron
> > > > > >
> > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <
> > ja...@confluent.io>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > I've added a proposal to handle the problem of hanging
> > > transactions:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > > > .
> > > > > > > In theory, this should never happen. In practice, we have hit
> one
> > > bug
> > > > > > where
> > > > > > > it was possible and there are few good options today to
> recover.
> > > > Take a
> > > > > > > look and let me know what you think.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jason
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


-- 
-- Guozhang

Reply via email to