Please see the replies inline.

If we are going to have a separate configuration for expiry, I prefer my
> proposal of max.message.delivery.wait.ms and its semantics.
>
OK. I hope others will voice their preference too.


>
> However, one thing which has not come out of the JIRA discussion is the
> actual use cases for batch expiry.

There are two usecases I can think of for batch expiry mechanism
irrespective of how we try to bound the time (batch.expiry.ms or
max.message.delivery.wait.ms). Let's call it X.

1. A real-time app (e.g., periodic healthcheck producer, temperature sensor
producer) has a soft upper bound on both message delivery and failure
notification of message delivery. In both cases, it wants to know. Such an
app does not close the producer on the first error reported (due to batch
expiry) because there's data lined up right behind. It's ok to lose a few
samples of temperature measurement (IoT scenario). So it simply drops it
and moves on. May be when drop rate is like 70% it would close it. Such an
app may use acks=0. In this case, X will have some value in single digit
minutes. But X=MAX_LONG is not suitable.

2. Today we run KMM in Linkedin as if batch.expiry==MAX_LONG. We expire
under the condition: (!muted.contains(tp) && (isMetadataStale ||
cluster.leaderFor(tp) == null)) In essence, as long as the partition is
making progress (even if it's a trickle), the producer keeps on going.
We've other internal systems to detect whether a pipeline is making
*sufficient* progress or not. We're not dependent on the producer to tell
us that it's not making progress on a certain partition.

This is less than ideal though. We would be happy to configure
batch.expiry.ms to 1800,000 or so and upon notification of expiry restart
the process and what not. It can tell also tell us which specific
partitions of a specific topic is falling behind. We achieve a similar
effect via alternative mechanisms.

In the absence of a general out-of-band mechanism for discovering slowness
(or nonprogress), KIP-91 is an attempt to allow the producer to report
non-progress without using request.timeout.ms. Hence batch.expiry.ms.


> Also, the KIP document states the
> following:
>
> *The per message timeout is easy to compute - linger.ms
> > <http://linger.ms/> + (retries + 1) * request.timeout.ms
> > <http://request.timeout.ms/>". *This is false.
>

> Why is the statement false? Doesn't that provide an accurate upperbound on
> the timeout for a produce request today?
>
The KIP-91 write-up describes the reasons why. Just reiterating the reason:
"the condition that if the metadata for a partition is known then we do not
expire its batches even if they are ready".  Do you not agree with the
explanation? If not, what part?

>
> Another point: the kip document keeps mentioning that the current timeouts
> are not intuitive, but for whom? In general, batch expiry as a notion is
> not intuitive and I am not sure the new settings change that fact.
>
Yeah, that's subjective.


>
> In this spirit, it might make sense to clarify the use case that motivates
> this additional setting. For instance, with this new configuration, how
> would your existing application handle a batch expired exception?

Again, a real-time app would just move on. KMM would halt. Any
order-sensitive app which needs to provide durability guarantees would
halt.


> How is it
> different from the way it handles the exception today?

It's not about how the existing applications will change their behavior.
It's about controlling *when*.


> Is the expiry
> exception a proxy for another piece of information like 'partition X is
> unavailable'?
>
Intriguing thought. If BatchExpiredException extends TimeoutException and
includes some context, such as TopicPartition, broker-id, an app may
provide differentiated service based on a topic name or availability zone
of a broker-id. KIP-91 does not propose anything like that. It's a very
niche usecase though.

Regards,
Sumant

>
>
>
>
>
> On Thu, Aug 3, 2017 at 4:35 PM, Sumant Tambe <suta...@gmail.com> wrote:
>
> > I don't want to list the alternatives in the JIRA as rejected just yet
> > because they are still being discussed. I would encourage the respective
> > proposers to do that. It's a wiki after all.
> >
> > As per my current understanding, there are two alternatives being
> proposed.
> > The original kip-91 approach #1 and #2 from Apurva. Apurva, correct me if
> > I'm wrong.
> >
> > #1. The batch.expiry.ms proposal: In this proposal the config is meant
> to
> > control ONLY the accumulator timeout. See the second diagram in kip-91.
> The
> > question "would the the clock for batch expiry be reset every time the
> > batch is requeued after failure?" does not arise here. There's no
> automatic
> > reenque. An application calls send again if it needs to in response to an
> > expired batch notification.
> >
> > #2. The max.message.delivery.wait.ms proposal: From Apurva's comment:
> > "...  if `T + max.message.delivery.wait.ms` has elapsed and the message
> > has
> > still not been successfully acknowledged..." This seems to suggest that
> the
> > config is meant to span time in the accumulator AND time spent during
> > network-level retries (if any). KIP-91 calls this approach "end-to-end
> > timeout model" and includes it as rejected for the reasons explained.
> >
> > There are small variations proposed further back in the JIRA discussion.
> > I'll let the respective proposers decide whether those options are
> relevant
> > at this point.
> >
> > -Sumant
> >
> > On 3 August 2017 at 15:26, Jason Gustafson <ja...@confluent.io> wrote:
> >
> > > Thanks for the KIP. Just a quick comment. Can you list the alternatives
> > > mentioned in the JIRA discussion in the rejected alternatives section?
> > >
> > > -Jason
> > >
> > > On Thu, Aug 3, 2017 at 3:09 PM, Sumant Tambe <suta...@gmail.com>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > KIP-91 [1] is another attempt to get better control on producer side
> > > > timeouts. In essence we're proposing a new config named
> > batch.expiry.ms
> > > >  that will cause batches in the accumulator to expire after the
> > > configured
> > > > timeout.
> > > >
> > > > Recently, the discussion on KAFKA-5621 [2] has shed new light on the
> > > > proposal and some alternatives.
> > > >
> > > > Please share your thoughts here on the mailing list.
> > > >
> > > > Regards,
> > > > Sumant
> > > >
> > > > [1]
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 91+Provide+Intuitive+User+Timeouts+in+The+Producer
> > > > [2] https://issues.apache.org/jira/browse/KAFKA-5621
> > > >
> > >
> >
>

Reply via email to