Re: [DISCUSS] KIP-408: Add Asynchronous Processing to Kafka Streams

Richard Yu Thu, 03 Jan 2019 13:43:34 -0800

Hi all,

Just bumping this KIP. Would be great if we got some discussion.



On Sun, Dec 30, 2018 at 5:13 PM Richard Yu <yohan.richard...@gmail.com>
wrote:

> Hi all,
>
> I made some recent changes to the KIP. It should be more relevant with the
> issue now (involves Processor API in detail).
> It would be great if you could comment.
>
> Thanks,
> Richard
>
> On Wed, Dec 26, 2018 at 10:01 PM Richard Yu <yohan.richard...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> Just changing the title of the KIP. Discovered it wasn't right.
>> Thats about it. :)
>>
>> On Mon, Dec 24, 2018 at 7:57 PM Richard Yu <yohan.richard...@gmail.com>
>> wrote:
>>
>>> Sorry, just making a correction.
>>>
>>> Even if we are processing records out of order, we will still have to
>>> checkpoint offset ranges.
>>> So it doesn't really change anything even if we are doing in-order
>>> processing.
>>>
>>> Thinking this over, I'm leaning slightly towards maintaining the
>>> ordering guarantee.
>>> Although when implementing this change, there might be some kinks that
>>> we have not thought about which could throw a monkey wrench into the works.
>>>
>>> But definitely worth trying out,
>>> Richard
>>>
>>> On Mon, Dec 24, 2018 at 6:51 PM Richard Yu <yohan.richard...@gmail.com>
>>> wrote:
>>>
>>>> Hi Boyang,
>>>>
>>>> I could see where you are going with this. Well, I suppose I should
>>>> have added this to alternatives, but I might as well mention it now.
>>>>
>>>> It had crossed my mind that we consider returning in-order even if
>>>> there are multiple threads processing on the same thread. But for this to
>>>> happen, we must block for the offsets in-between which have not been
>>>> processed yet. For example, offsets 1-50 are being processed by thread1,
>>>> while the offsets 51 - 100 are being processed by thread2. We will have to
>>>> wait for thread1 to finish processing its offsets first before we return
>>>> the records processed by thread2. So in other words, once thread1 is done,
>>>> thread2's work up to that point will be returned in one go, but not before
>>>> that.
>>>>
>>>> I suppose this could work, but the client will have to wait some time
>>>> before the advantages of multithreaded processing can be seen (i.e. the
>>>> first thread has to finish processing its segment of the records first
>>>> before any others are returned to guarantee ordering). Another point I
>>>> would like to make is that the threads are *asynchronous. *So for us
>>>> to know when a thread is done processing a certain segment, we will
>>>> probably have a similar policy to how getMetadataAsync() works (i.e. have a
>>>> parent thread be notified of when the children threads are done).
>>>> [image: image.png]
>>>> Just pulling this from the KIP. But instead, we would apply this to
>>>> metadata segments instead of just a callback.
>>>> I don't know whether or not the tradeoffs are acceptable to the client.
>>>> Ordering could be guaranteed, but it would be hard to do. For example, if
>>>> there was a crash, we might lose track of which offsets numbers and ranges
>>>> we are processing for each child thread, so somehow we need to find a way
>>>> to checkpoint those as well (like committing them to a Kafka topic).
>>>>
>>>> Let me know your thoughts on this approach. It would work, but the
>>>> implementation details could be a mess.
>>>>
>>>> Cheers,
>>>> Richard
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Dec 24, 2018 at 4:59 PM Boyang Chen <bche...@outlook.com>
>>>> wrote:
>>>>
>>>>> Hey Richard,
>>>>>
>>>>> thanks for the explanation! After some thinking, I do understand more
>>>>> about this KIP. The motivation was to increase the throughput and put 
>>>>> heavy
>>>>> lifting RPC calls or IO operations to the background. While I feel the
>>>>> ordering is hard to guarantee for async task, it is better to be
>>>>> configurable for the end users.
>>>>>
>>>>> An example use case I could think of is: for every 500 records
>>>>> processed, we need an RPC to external storage that takes non-trivial time,
>>>>> and before its finishing all 499 records before it shouldn't be visible to
>>>>> the end user. In such case, we need to have fine-grained control on the
>>>>> visibility of downstream consumer so that our async task is planting a
>>>>> barrier while still make 499 records non-blocking process and send to
>>>>> downstream. So eventually when the heavy RPC is done, we commit this 
>>>>> record
>>>>> to remove the barrier and make all 500 records available for downstream. 
>>>>> So
>>>>> here we still need to guarantee the ordering within 500 records, while in
>>>>> the same time consumer semantic has nothing to change.
>>>>>
>>>>> Am I making the point clear here? Just want have more discussion on
>>>>> the ordering guarantee since I feel it wouldn't be a good idea to break
>>>>> consumer ordering guarantee by default.
>>>>>
>>>>> Best,
>>>>> Boyang
>>>>>
>>>>> ________________________________
>>>>> From: Richard Yu <yohan.richard...@gmail.com>
>>>>> Sent: Saturday, December 22, 2018 9:08 AM
>>>>> To: dev@kafka.apache.org
>>>>> Subject: Re: KIP-408: Add Asynchronous Processing to Kafka Streams
>>>>>
>>>>> Hi Boyang,
>>>>>
>>>>> Thanks for pointing out the possibility of skipping bad records (never
>>>>> crossed my mind). I suppose we could make it an option for the user if
>>>>> they
>>>>> could skip a bad record. It was never the intention of this KIP though
>>>>> on
>>>>> whether or not to do that. I could log a JIRA on such an issue, but I
>>>>> think
>>>>> this is out of the KIP's scope.
>>>>>
>>>>> As for the ordering guarantees, if you are using the standard Kafka
>>>>> design
>>>>> of one thread per task. Then everything will pretty much remain the
>>>>> same.
>>>>> However, if we are talking about using multiple threads per task
>>>>> (which is
>>>>> something that this KIP proposes), then we should probably expect the
>>>>> behavior to be somewhat similar to Samza's Async Task as stated in the
>>>>> JIRA
>>>>> for this KIP (second-last comment).
>>>>> Ordering would no longer be possible (so yeah, basically no guarantee
>>>>> at
>>>>> all).
>>>>>
>>>>> And how the user handles out-of-order messages is not something I'm
>>>>> well
>>>>> versed in. I guess they can try to put the messages back in order some
>>>>> time
>>>>> later on. But I honestly don't know what they will do.
>>>>> It would be good if you could give me some insight into this.
>>>>>
>>>>> Cheers,
>>>>> Richard
>>>>>
>>>>>
>>>>> On Fri, Dec 21, 2018 at 4:24 PM Boyang Chen <bche...@outlook.com>
>>>>> wrote:
>>>>>
>>>>> > Thanks Richard for proposing this feature! We also have encountered
>>>>> some
>>>>> > similar feature request that we want to define a generic async
>>>>> processing
>>>>> > API<https://issues.apache.org/jira/browse/KAFKA-7566>.
>>>>> >
>>>>> > However I guess the motivation here is that we should skip big
>>>>> records
>>>>> > during normal processing, or let a separate task handle those
>>>>> records who
>>>>> > takes P99 processing time. Since my feeling is that if some edge
>>>>> cases
>>>>> > happen, could we just skip the bad record and continue processing
>>>>> next
>>>>> > record?
>>>>> >
>>>>> > Also I want to understand what kind of ordering guarantee we are
>>>>> gonna
>>>>> > provide with this new API, or there is no ordering guarantee at
>>>>> all?  Could
>>>>> > we discuss any potential issues if consumer needs to process
>>>>> out-of-order
>>>>> > messages?
>>>>> >
>>>>> > Best,
>>>>> > Boyang
>>>>> > ________________________________
>>>>> > From: Richard Yu <yohan.richard...@gmail.com>
>>>>> > Sent: Saturday, December 22, 2018 2:00 AM
>>>>> > To: dev@kafka.apache.org
>>>>> > Subject: KIP-408: Add Asynchronous Processing to Kafka Streams
>>>>> >
>>>>> > Hi all,
>>>>> >
>>>>> > Lately, there has been considerable interest in adding asynchronous
>>>>> > processing to Kafka Streams.
>>>>> > Here is the KIP for such an addition:
>>>>> >
>>>>> >
>>>>> >
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-408%3A+Add+Asynchronous+Processing+To+Kafka+Streams
>>>>> >
>>>>> > I wish to discuss the best ways to approach this problem.
>>>>> >
>>>>> > Thanks,
>>>>> > Richard Yu
>>>>> >
>>>>>
>>>>

Re: [DISCUSS] KIP-408: Add Asynchronous Processing to Kafka Streams

Reply via email to