Re: [DISCUSS] KIP-185: Make exactly once in order delivery per partition the default producer setting

Manikumar Fri, 18 Aug 2017 00:07:54 -0700

I agree with Jason/Apruva comments. we can wait some more time and enable
it in 1.1.0 release.
In general, we are in agreement with the change.


On Fri, Aug 18, 2017 at 10:32 AM, Apurva Mehta <[email protected]> wrote:

> Hi Jason,
>
> You make some good points.
>
> First, I agree that we could have false positives which result in an
> OutOfOrderSequence exception for topics with low retention where messages
> are removed intentionally. I 100% agree that this should be tightened up.
>
> One solution is to return the log start offset in each produce response and
> also introduce 'UnknownProducer' error code when the broker receives a
> non-zero sequence number but has no record of the producer. These two
> pieces of information would enable to us to solve the false positive
> problem: If the producer receives an 'UknownProducer' error and the log
> start offset is in front of the last acknowledged offset, then it is a
> false positive.
>
> Whether we should explore solutions to this problem before making this the
> default is a good question. I will look into solutions and try to estimate
> the level of effort to fix it. If we can make this in the 1.0.0 release,
> then that would be ideal.
>
> If it can't make the 1.0.0 release (feature freeze is only a month away, so
> it will be tight), we should recall that the false positive would only
> occur with fairly inactive producers writing to topics with low retention,
> so in practice we won't commonly hit this. Additionally, this is already
> something people may hit if they enable idempotence. From this point of
> view, it may make sense to make the leap and ship Kafka with stronger
> default semantics in the 1.0.0 release and fix forward.
>
> It seems that the community so far is in agreement that making idempotence
> the default is generally a good idea.
>
> The question now is around timing. Is it too soon? Should we let the
> feature bake for one more release cycle, fix known holes, etc., and then
> enable it in the 1.1.0 release? At this point I am leaning toward waiting
> till 1.1.0 if the false positive problem is not solved in the 1.0.0
> timeframe: I would like the feature to be as solid as possible so that
> people build trust around the guarantees.
>
> What does the rest of the community think?
>
> Thanks,
> Apurva
>
>
>
> On Thu, Aug 17, 2017 at 11:40 AM, Jason Gustafson <[email protected]>
> wrote:
>
> > Some additional points for discussion:
> >
> > 1. The idempotent producer is still a new feature and I doubt it has
> gotten
> > much use yet (in particular since it depends on a message format
> upgrade).
> > Do we feel it has reached a level of stability where we are comfortable
> > making it the default?
> >
> > 2. I'm still a little uncomfortable with our handling of
> > OutOfOrderSequence. I'm honestly not sure what a user should do to handle
> > this error. If they care about delivery, should they "rewind" to some
> > previous point and start again? One of the problems is that this error
> may
> > be a false positive if the last sequence from the producer was removed
> from
> > the log to enforce the retention policy. It would be nice if we could
> > tighten this up so that we only send OutOfOrderSequence for actual data
> > loss, and in that case, if we can tell the user (say) which offset was
> last
> > successfully written so that they know what was lost. We discussed a few
> > ideas offline to address this, but do you think fixing it should be a
> > prerequisite for making idempotence the default?
> >
> > 3. Also, in the current handling, when we receive OutOfOrderSequence,
> there
> > is no way for the user to retry and preserve order because we do not fail
> > all of the queued batches for the same partition. My understanding is
> that
> > you are looking to change this behavior in your patch to support more
> > in-flight requests. Is that correct?
> >
> > Overall I'm definitely supportive of making idempotence the default
> > eventually, but I think it might be a tad premature now.
> >
> > Thanks,
> > Jason
> >
> > On Wed, Aug 16, 2017 at 8:58 PM, Apurva Mehta <[email protected]>
> wrote:
> >
> > > Thanks for the followup Becket. It sounds we are on agreement on the
> > scope
> > > of this KIP, and the discussion has definitely clarified a lot of the
> > > subtle points.
> > >
> > > Apurva
> > >
> > > On Tue, Aug 15, 2017 at 10:49 PM, Becket Qin <[email protected]>
> > wrote:
> > >
> > > > Hi Apurva,
> > > >
> > > > Thanks for the clarification of the definition. The definitions are
> > clear
> > > > and helpful.
> > > >
> > > > It seems the scope of this KIP is just about the producer side
> > > > configuration change, but not attempting to achieve the exactly once
> > > > semantic with all default settings out of the box. The broker still
> > needs
> > > > to be configured appropriately to achieve the exactly once semantic.
> If
> > > so,
> > > > the current proposal sounds reasonable to me. Apologies if I
> > > misunderstood
> > > > the goal of this KIP.
> > > >
> > > > Regarding the max.in.flight.requests.per.connection, I don't think
> we
> > > have
> > > > to support infinite number of in flight requests. But admittedly
> there
> > > are
> > > > use cases that people would want to have reasonably high in flight
> > > > requests. Given that we need to make code changes to support
> > idempotence
> > > > and in.flight.request > 1, it would be nice to see if we can cover
> > those
> > > > use cases instead of doing that later. We can discuss this in a
> > separate
> > > > thread.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > >
> > > > On Tue, Aug 15, 2017 at 1:46 PM, Guozhang Wang <[email protected]>
> > > wrote:
> > > >
> > > > > Hi Jay,
> > > > >
> > > > > I chatted with Apurva offline, and we think the key of the
> discussion
> > > is
> > > > > that, as summarized in the updated KIP wiki, whether we should
> > consider
> > > > > replication as a necessary condition of at-least-once, and of
> course
> > > also
> > > > > exactly-once. Originally I think replication is not a necessary
> > > condition
> > > > > for at-least-once, since the scope of failures that we should be
> > > covering
> > > > > is different in my definition; if we claim that "even for
> > > at-least-once,
> > > > > you should have replication factor larger than 2, let alone
> > > exactly-once"
> > > > > then I agree that having acks=all on the client side should also
> be a
> > > > > necessary condition for at-least-once, and for exactly-once as
> well.
> > > Then
> > > > > this KIP would be just providing what is necessary but not
> sufficient
> > > > > conditions, from client-side configs to achieve EOS, while you also
> > > need
> > > > > the broker-side configs together to really support it.
> > > > >
> > > > > Guozhang
> > > > >
> > > > >
> > > > > On Tue, Aug 15, 2017 at 1:15 PM, Jay Kreps <[email protected]>
> wrote:
> > > > >
> > > > > > Hey Guozhang,
> > > > > >
> > > > > > I think the argument is that with acks=1 the message could be
> lost
> > > and
> > > > > > hence you aren't guaranteeing exactly once delivery.
> > > > > >
> > > > > > -Jay
> > > > > >
> > > > > > On Mon, Aug 14, 2017 at 1:36 PM, Guozhang Wang <
> [email protected]
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Just want to clarify that regarding 1), I'm fine with changing
> it
> > > to
> > > > > > `all`
> > > > > > > but just wanted to argue it is not necessarily correlate with
> the
> > > > > > > exactly-once semantics, but rather on persistence v.s.
> > availability
> > > > > > > trade-offs, so I'd like to discuss them separately.
> > > > > > >
> > > > > > > Regarding 2), one minor concern I had is that the enforcement
> is
> > on
> > > > the
> > > > > > > client side while the parts it affects is on the broker side.
> > I.e.
> > > > the
> > > > > > > broker code would assume at most 5 in.flight when idempotent is
> > > > turned
> > > > > > on,
> > > > > > > but this is not enforced at the broker but relying at the
> client
> > > > side's
> > > > > > > sanity. So other implementations of the client that may not
> obey
> > > this
> > > > > may
> > > > > > > likely break the broker code. If we do enforce this we'd better
> > > > enforce
> > > > > > it
> > > > > > > at the broker side. Also, I'm wondering if we have considered
> the
> > > > > > approach
> > > > > > > for brokers to read the logs in order to get the starting
> offset
> > > when
> > > > > it
> > > > > > > does not about it in its snapshot, that whether it is
> worthwhile
> > if
> > > > we
> > > > > > > assume that such issues are very rare to happen?
> > > > > > >
> > > > > > >
> > > > > > > Guozhang
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Aug 14, 2017 at 11:01 AM, Apurva Mehta <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > I just want to summarize where we are in this discussion
> > > > > > > >
> > > > > > > > There are two major points of contention: should we have
> acks=1
> > > or
> > > > > > > acsk=all
> > > > > > > > by default? and how to cap max.in.flight.requests.per.
> > > connection?
> > > > > > > >
> > > > > > > > 1) acks=1 vs acks=all1
> > > > > > > >
> > > > > > > > Here are the tradeoffs of each:
> > > > > > > >
> > > > > > > > If you have replication-factor=N, your data is resilient N-1
> to
> > > > disk
> > > > > > > > failures. For N>1, here is the tradeoff between acks=1 and
> > > > acks=all.
> > > > > > > >
> > > > > > > > With proposed defaults and acks=all, the stock Kafka producer
> > and
> > > > the
> > > > > > > > default broker settings would guarantee that ack'd messages
> > would
> > > > be
> > > > > in
> > > > > > > the
> > > > > > > > log exactly once.
> > > > > > > >
> > > > > > > > With the proposed defaults and acks=1, the stock Kafka
> producer
> > > and
> > > > > the
> > > > > > > > default broker settings would guarantee that 'retained ack'd
> > > > messages
> > > > > > > would
> > > > > > > > be in the log exactly once. But all ack'd messages may not be
> > > > > > retained'.
> > > > > > > >
> > > > > > > > If you leave replication-factor=1, acks=1 and acks=all have
> > > > identical
> > > > > > > > semantics and performance, but you are resilient to 0 disk
> > > > failures.
> > > > > > > >
> > > > > > > > I think the measured cost (again the performance details are
> in
> > > the
> > > > > > wiki)
> > > > > > > > of acks=all is well worth the much clearer semantics. What
> does
> > > the
> > > > > > rest
> > > > > > > of
> > > > > > > > the community think?
> > > > > > > >
> > > > > > > > 2) capping max.in.flight at 5 when idempotence is enabled.
> > > > > > > >
> > > > > > > > We need to limit the max.in.flight for the broker to
> > de-duplicate
> > > > > > > messages
> > > > > > > > properly. The limitation would only apply when idempotence is
> > > > > enabled.
> > > > > > > The
> > > > > > > > shared numbers show that when the client-broker latency is
> low,
> > > > there
> > > > > > is
> > > > > > > no
> > > > > > > > performance gain for max.inflight > 2.
> > > > > > > >
> > > > > > > > Further, it is highly debatable that max.in.flight=500 is
> > > > > significantly
> > > > > > > > better than max.in.flight=5  for a really high latency
> > > > client-broker
> > > > > > > link,
> > > > > > > > and so far there are no hard numbers one way or another.
> > However,
> > > > > > > assuming
> > > > > > > > that max.in.flight=500 is significantly better than
> > > max.inflight=5
> > > > in
> > > > > > > some
> > > > > > > > niche use case, the user would have to sacrifice idempotence
> > for
> > > > > > > > throughput. In this extreme corner case, I think it is an
> > > > acceptable
> > > > > > > > tradeoff.
> > > > > > > >
> > > > > > > > What does the community think?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Apurva
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > -- Guozhang
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-185: Make exactly once in order delivery per partition the default producer setting

Reply via email to