Re: Throwing away prefetched records optimisation.

Zahari Dichev Wed, 17 Oct 2018 23:46:37 -0700

Thanks a lot Jan,

I will read it.


Zahari

On Thu, Oct 18, 2018 at 9:31 AM Jan Filipiak <jan.filip...@trivago.com>
wrote:

> especially my suggestions ;)
>
> On 18.10.2018 08:30, Jan Filipiak wrote:
> > Hi Zahari,
> >
> > would you be willing to scan through the KIP-349 discussion a little?
> > I think it has suggestions that could be interesting for you
> >
> > Best Jan
> >
> > On 16.10.2018 09:29, Zahari Dichev wrote:
> >> Hi there Kafka developers,
> >>
> >> I am currently trying to find a solution to an issue that has been
> >> manifesting itself in the Akka streams implementation of the Kafka
> >> connector. When it comes to consuming messages, the implementation
> relies
> >> heavily on the fact that we can pause and resume partitions. In some
> >> situations when a single consumer instance is shared among several
> >> streams,
> >> we might end up with frequently pausing and unpausing a set of topic
> >> partitions, which is the main facility that allows us to implement back
> >> pressure. This however has certain disadvantages, especially when
> >> there are
> >> two consumers that differ in terms of processing speed.
> >>
> >> To articulate the issue more clearly, imagine that a consumer maintains
> >> assignments for two topic partitions *TP1* and *TP2*. This consumer is
> >> shared by two streams - S1 and S2. So effectively when we have demand
> >> from
> >> only one of the streams - *S1*, we will pause one of the topic
> partitions
> >> *TP2* and call *poll()* on the consumer to only retrieve the records for
> >> the demanded topic partition - *TP1*. The result of that is all the
> >> records
> >> that have been prefetched for *TP2* are now thrown away by the fetcher
> >> ("*Not
> >> returning fetched records for assigned partition TP2 since it is no
> >> longer
> >> fetchable"*). If we extrapolate that to multiple streams sharing the
> same
> >> consumer, we might quickly end up in a situation where we throw
> >> prefetched
> >> data quite often. This does not seem like the most efficient approach
> and
> >> in fact produces quite a lot of overlapping fetch requests as
> illustrated
> >> in the following issue:
> >>
> >> https://github.com/akka/alpakka-kafka/issues/549
> >>
> >> I am writing this email to get some initial opinion on a KIP I was
> >> thinking
> >> about. What if we give the clients of the Consumer API a bit more
> control
> >> of what to do with this prefetched data. Two options I am wondering
> >> about:
> >>
> >> 1. Introduce a configuration setting, such as*
> >> "return-prefetched-data-for-paused-topic-partitions = false"* (have to
> >> think of a better name), which when set to true will return what is
> >> prefetched instead of throwing it away on calling *poll()*. Since this
> is
> >> amount of data that is bounded by the maximum size of the prefetch, we
> >> can
> >> control what is the most amount of records returned. The client of the
> >> consumer API can then be responsible for keeping that data around and
> use
> >> it when appropriate (i.e. when demand is present)
> >>
> >> 2. Introduce a facility to pass in a buffer into which the prefetched
> >> records are drained when poll is called and paused partitions have some
> >> prefetched records.
> >>
> >> Any opinions on the matter are welcome. Thanks a lot !
> >>
> >> Zahari Dichev
> >>
>

Re: Throwing away prefetched records optimisation.

Reply via email to