Re: [DISCUSSION] KIP-15 close(timeout) for producer

Jiangjie Qin Mon, 16 Mar 2015 14:30:21 -0700

Thanks Guozhang. It wouldn’t be as thoroughly considered without
discussing with you :)


Jiangjie (Becket) Qin

On 3/16/15, 1:07 PM, "Guozhang Wang" <wangg...@gmail.com> wrote:

>Thanks Jiangjie,
>
>After talking to you offline on this, I have been convinced and changed my
>preference to blocking. The immediate shutdown approach does have some
>unsafeness in some cases.
>
>Guozhang
>
>On Mon, Mar 16, 2015 at 11:50 AM, Jiangjie Qin <j...@linkedin.com.invalid>
>wrote:
>
>> It looks that the problem we want to solve and the purpose we want to
>> achieve is:
>> If user uses close() in callback, we want to let user be aware that they
>> should use close(0) instead of close() in the callback.
>>
>> We have agreed that we will have an error log to inform user about this
>> mis-usage. The options differ in the way how we can force user to take a
>> look at that error log.
>> There are two scenarios:
>> 1. User does not expect the program to exit.
>> 2. User expect the program to exit.
>>
>> For scenario 1), blocking will probably delay the discovery of the
>> problem. Calling close(0) exposes the problem quicker. In this scenario
>> producer just encounter a send failure when running normally.
>> For scenario 2), blocking will expose the problem quick. Calling
>>close(-1)
>> might hide the problem. This scenario might include: a) Unit test for a
>> send failure. b) Message sending during a close() call from a user
>>thread.
>>
>> So as a summary table:
>>
>>                   Scenario 1)                         Scenario 2)
>>
>> Blocking      Delay problem discovery         Guaranteed problem
>>discovery
>>
>> Close(-1)     Immediate problem discovery     Problem might be hidden
>>
>>
>> Personally I prefer blocking because it seems providing more guarantees
>> and safer.
>>
>> Thanks.
>>
>> Jiangjie (Becket) Qin
>>
>>
>> On 3/16/15, 10:11 AM, "Guozhang Wang" <wangg...@gmail.com> wrote:
>>
>> >HI Jiangjie,
>> >
>> >As far as I understand calling close() in the ioThread is not common,
>>as
>> >it
>> >may only trigger when we saw some non-retriable error. Hence when user
>>run
>> >their program it is unlikely that close() will be triggered and problem
>> >will be detected. So it seems to me that from the error detection
>>aspect
>> >these two options seems to be the same as people will usually detect it
>> >from the producer metrics all dropping to 0.
>> >
>> >Guozhang
>> >
>> >On Mon, Mar 16, 2015 at 9:52 AM, Jiangjie Qin
>><j...@linkedin.com.invalid>
>> >wrote:
>> >
>> >> It seems there are two options we can choose from when close() is
>>called
>> >> from sender thread (callback):
>> >> 1. Log an error and close the producer using close(-1)
>> >> 2. Log an error and block.
>> >> (Throwing an exception will not work because we catch all the
>>exception
>> >> thrown from user callback. It will just lead to an error log.)
>> >>
>> >> My concern for the first option is that the producer will be closed
>>even
>> >> if we logged and error. I am wondering if some user would not even
>>take
>> >>a
>> >> look at the log if producer is closed normally. Because from the
>> >>programs
>> >> behavior, everything looks good. If that is the case, the error
>>message
>> >>we
>> >> logged probably will just be ignored until some day when people check
>> >>the
>> >> log and see it.
>> >>
>> >> As for the second option, because producer does not close but blocks.
>> >>User
>> >> will notice this the first time they run the program. They probably
>>will
>> >> look at the log to see why producer could not be closed and they will
>> >>see
>> >> the error log we put there. So they will get informed about this
>> >>mis-usage
>> >> of close() in sender thread the first time they run the code instead
>>of
>> >> some time later.
>> >>
>> >> Personally I prefer the second one because it is more obvious that
>> >> something was wrong.
>> >>
>> >> Jiangjie (Becket) Qin
>> >>
>> >> On 3/15/15, 4:27 PM, "Guozhang Wang" <wangg...@gmail.com> wrote:
>> >>
>> >> >Yeah I agree we should not silently change the behavior of the
>>function
>> >> >with the given parameters; and I would prefer
>> >>error-logging-and-shutdown
>> >> >over blocking when close(>0) is used, since as Neha suggested
>>blocking
>> >> >would also not proceed with sending any data, bu will just let
>>users to
>> >> >realize the issue later than sooner.
>> >> >
>> >> >On Sun, Mar 15, 2015 at 3:25 PM, Neha Narkhede <n...@confluent.io>
>> >>wrote:
>> >> >
>> >> >> >
>> >> >> > And I also agree it is better if we can make producer block when
>> >> >> > close() is called from sender thread so user will notice
>>something
>> >> >>went
>> >> >> > wrong.
>> >> >>
>> >> >>
>> >> >> This isn't a great experience either. Why can't we just throw an
>> >> >>exception
>> >> >> for a behavior we know is incorrect and we'd like the user to
>>know.
>> >> >> Blocking as a means of doing that seems wrong and annoying.
>> >> >>
>> >> >> On Sun, Mar 15, 2015 at 11:56 AM, Jay Kreps <jay.kr...@gmail.com>
>> >> wrote:
>> >> >>
>> >> >> > Cool.
>> >> >> >
>> >> >> > I think blocking is good or alternately throwing an exception
>> >>directly
>> >> >> from
>> >> >> > close(). Basically I would just worry about subtly doing
>>something
>> >> >> slightly
>> >> >> > different from what the user asked for as it will be hard to
>>notice
>> >> >>that
>> >> >> > behavior difference.
>> >> >> >
>> >> >> > -Jay
>> >> >> >
>> >> >> > On Sat, Mar 14, 2015 at 5:48 PM, Jiangjie Qin
>> >> >><j...@linkedin.com.invalid
>> >> >> >
>> >> >> > wrote:
>> >> >> >
>> >> >> > > Hi Jay,
>> >> >> > >
>> >> >> > > I have modified the KIP as you suggested. I thinks as long as
>>we
>> >> >>have
>> >> >> > > consistent define for timeout across Kafka interface, there
>>would
>> >> >>be no
>> >> >> > > problem. And I also agree it is better if we can make producer
>> >>block
>> >> >> when
>> >> >> > > close() is called from sender thread so user will notice
>> >>something
>> >> >>went
>> >> >> > > wrong.
>> >> >> > >
>> >> >> > > Thanks.
>> >> >> > >
>> >> >> > > Jiangjie (Becket) Qin
>> >> >> > >
>> >> >> > > On 3/14/15, 11:37 AM, "Jay Kreps" <jay.kr...@gmail.com> wrote:
>> >> >> > >
>> >> >> > > >Hey Jiangjie,
>> >> >> > > >
>> >> >> > > >I think this is going to be very confusing that
>> >> >> > > >  close(0) waits indefinitely and
>> >> >> > > >  close(-1) waits for 0.
>> >> >> > > >I understand this appears in other apis, but it is a constant
>> >> >>cause of
>> >> >> > > >bugs. Let's not repeat that mistake.
>> >> >> > > >
>> >> >> > > >Let's make close(0) wait for 0. We don't need a way to wait
>> >> >> indefinitely
>> >> >> > > >as
>> >> >> > > >we already have close() so having a magical constant for
>>that is
>> >> >> > > >redundant.
>> >> >> > > >
>> >> >> > > >Calling close() from the I/O thread was already possible and
>> >>would
>> >> >> block
>> >> >> > > >indefinitely. I think trying to silently change the behavior
>>is
>> >> >> probably
>> >> >> > > >not right. I.e. if the user calls close() in the callback
>>there
>> >>is
>> >> >> > > >actually
>> >> >> > > >some misunderstanding and they need to think more, silently
>> >>making
>> >> >> this
>> >> >> > > >not
>> >> >> > > >block will hide the problem from them which is the opposite
>>of
>> >> >>what we
>> >> >> > > >want.
>> >> >> > > >
>> >> >> > > >-Jay
>> >> >> > > >
>> >> >> > > >On Thu, Mar 12, 2015 at 1:49 AM, Jiangjie Qin
>> >> >> <j...@linkedin.com.invalid
>> >> >> > >
>> >> >> > > >wrote:
>> >> >> > > >
>> >> >> > > >> Hey Joe & Jay,
>> >> >> > > >>
>> >> >> > > >> Thanks for the comments on the voting thread. Since it
>>seems
>> >>we
>> >> >> > probably
>> >> >> > > >> will have more discussion on this, I am just replying from
>>the
>> >> >> > > >>discussion
>> >> >> > > >> thread here.
>> >> >> > > >> I’ve updated the KIP page to make it less like half-baked,
>> >> >>apologize
>> >> >> > for
>> >> >> > > >> the rush...
>> >> >> > > >>
>> >> >> > > >> The contract in current KIP is:
>> >> >> > > >>   1. close() - wait until all requests either are sent or
>> >>reach
>> >> >> > request
>> >> >> > > >> timeout.
>> >> >> > > >>   2. close(-1, TimeUnit.MILLISECONDS) - close immediately
>> >> >> > > >>   3. close(0, TimeUnit.MILLISECONDS) - equivalent to
>>close(),
>> >> >>i.e.
>> >> >> > Wait
>> >> >> > > >> until all requests are sent or reach request timeout
>> >> >> > > >>   4. close(5, TimeUnit.MILLISECONDS) - try the best to
>>finish
>> >> >> sending
>> >> >> > > >>in 5
>> >> >> > > >> milliseconds, if something went wrong, just shutdown the
>> >>producer
>> >> >> > > >>anyway,
>> >> >> > > >> my callback will handle the failures.
>> >> >> > > >>
>> >> >> > > >> About how we define what timeout value stands for, I
>>actually
>> >> >> > struggled
>> >> >> > > >>a
>> >> >> > > >> little bit when wrote the patch. Intuitively, close(0)
>>should
>> >> >>mean
>> >> >> > > >> immediately, however it seems that all the existing java
>>class
>> >> >>have
>> >> >> > this
>> >> >> > > >> convention of timeout=0 means no timeout or never timeout
>> >> >> > > >>(Thread.join(0),
>> >> >> > > >> Object.wait(0), etc.) So here the dilemma is either we
>>follow
>> >>the
>> >> >> > > >> intuition or we follow the convention. What I chose is to
>> >>follow
>> >> >>the
>> >> >> > > >> convention but document the interface to let user be aware
>>of
>> >>the
>> >> >> > usage.
>> >> >> > > >> The reason is that I think producer.close() is a public
>> >> >>interface so
>> >> >> > it
>> >> >> > > >> might be better to follow java convention. Whereas
>>selector is
>> >> >>not a
>> >> >> > > >> public interface that used by user, so as long as it makes
>> >>sense
>> >> >>to
>> >> >> > us,
>> >> >> > > >>it
>> >> >> > > >> is less a problem to be different from java convention.
>>That
>> >>said
>> >> >> > since
>> >> >> > > >> consumer.poll(timeout) is also a public interface, I think
>>it
>> >> >>also
>> >> >> > makes
>> >> >> > > >> sense to make producer.close() to have the same definition
>>of
>> >> >> > > >> consumer.poll(timeout).
>> >> >> > > >>
>> >> >> > > >> The main argument for keeping a timeout in close would be
>> >> >>separating
>> >> >> > the
>> >> >> > > >> close timeout from request timeout, which probably makes
>> >>sense. I
>> >> >> > would
>> >> >> > > >> guess typically the request timeout would be long (e.g. 60
>> >> >>seconds)
>> >> >> > > >> because we might want to consider retries with back off
>>time.
>> >>If
>> >> >>we
>> >> >> > have
>> >> >> > > >> multiple batches in accumulator, in worst case that could
>>take
>> >> >>up to
>> >> >> > > >> several minutes to complete all the requests. But when we
>> >>close a
>> >> >> > > >> producer, we might not want to wait for that long as it
>>might
>> >> >>cause
>> >> >> > some
>> >> >> > > >> other problem like deployment tool timeout.
>> >> >> > > >>
>> >> >> > > >> There is also a subtle difference between close(timeout)
>>and
>> >> >> > > >> flush(timeout). The only purpose for flush() is to write
>>data
>> >>to
>> >> >>the
>> >> >> > > >> broker, so it makes perfect sense to wait until request
>> >>timeout.
>> >> >>I
>> >> >> > think
>> >> >> > > >> that is why flush(timeout) looks strange. On the other
>>hand,
>> >>the
>> >> >>top
>> >> >> > > >> priority for close() is to close the producer rather than
>> >>flush()
>> >> >> > data,
>> >> >> > > >>so
>> >> >> > > >> close(timeout) gives guarantee on bounded waiting for its
>>main
>> >> >>job.
>> >> >> > > >>
>> >> >> > > >> Sorry for the confusion about forceClose flag. It is not a
>> >>public
>> >> >> > > >> interface. I mentioned it in Proposed Changes section
>>which I
>> >> >> thought
>> >> >> > > >>was
>> >> >> > > >> supposed to provide implementation details.
>> >> >> > > >>
>> >> >> > > >> Thanks again for all the comments and suggestions!
>> >> >> > > >>
>> >> >> > > >> Jiangjie (Becket) Qin
>> >> >> > > >>
>> >> >> > > >> On 3/10/15, 8:57 PM, "Jiangjie Qin" <j...@linkedin.com>
>> wrote:
>> >> >> > > >>
>> >> >> > > >> >The KIP page has been updated per Jay¹s comments.
>> >> >> > > >> >I¹d like to initiate the voting process if no further
>> >>comments
>> >> >>are
>> >> >> > > >> >received by tomorrow.
>> >> >> > > >> >
>> >> >> > > >> >Jiangjie (Becket) Qin
>> >> >> > > >> >
>> >> >> > > >> >On 3/8/15, 9:45 AM, "Jay Kreps" <jay.kr...@gmail.com>
>>wrote:
>> >> >> > > >> >
>> >> >> > > >> >>Hey Jiangjie,
>> >> >> > > >> >>
>> >> >> > > >> >>Can you capture the full motivation and use cases for the
>> >> >>feature?
>> >> >> > > >>This
>> >> >> > > >> >>mentions your interest in having a way of aborting from
>> >>inside
>> >> >>the
>> >> >> > > >> >>Callback. But it doesn't really explain that usage or why
>> >>other
>> >> >> > people
>> >> >> > > >> >>would want to do that. It also doesn't list the primary
>>use
>> >> >>case
>> >> >> for
>> >> >> > > >> >>having
>> >> >> > > >> >>close with a bounded timeout which was to avoid blocking
>>too
>> >> >>long
>> >> >> on
>> >> >> > > >> >>shutdown.
>> >> >> > > >> >>
>> >> >> > > >> >>-Jay
>> >> >> > > >> >>
>> >> >> > > >> >>
>> >> >> > > >> >>
>> >> >> > > >> >>On Sat, Mar 7, 2015 at 12:25 PM, Jiangjie Qin
>> >> >> > > >><j...@linkedin.com.invalid
>> >> >> > > >> >
>> >> >> > > >> >>wrote:
>> >> >> > > >> >>
>> >> >> > > >> >>> Hi,
>> >> >> > > >> >>>
>> >> >> > > >> >>> I just created a KIP for adding a close(timeout) to new
>> >> >> producer.
>> >> >> > > >>Most
>> >> >> > > >> >>>of
>> >> >> > > >> >>> the previous discussions are in KAFKA-1660 where Parth
>> >> >> Brahmbhatt
>> >> >> > > >>has
>> >> >> > > >> >>> already done a lot of work.
>> >> >> > > >> >>> Since this is an interface change so we are going
>>through
>> >>the
>> >> >> KIP
>> >> >> > > >> >>>process.
>> >> >> > > >> >>> Here is the KIP link:
>> >> >> > > >> >>>
>> >> >> > > >> >>>
>> >> >> > > >>
>> >> >> > >
>> >> >>
>> >>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=5373978
>> >> >> > > >> >>>2
>> >> >> > > >> >>>
>> >> >> > > >> >>> Thanks.
>> >> >> > > >> >>>
>> >> >> > > >> >>> Jiangjie (Becket) Qin
>> >> >> > > >> >>>
>> >> >> > > >> >
>> >> >> > > >>
>> >> >> > > >>
>> >> >> > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Thanks,
>> >> >> Neha
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> >--
>> >> >-- Guozhang
>> >>
>> >>
>> >
>> >
>> >--
>> >-- Guozhang
>>
>>
>
>
>-- 
>-- Guozhang

Re: [DISCUSSION] KIP-15 close(timeout) for producer

Reply via email to