Re: [DISCUSSION] KIP-15 close(timeout) for producer

Guozhang Wang Mon, 16 Mar 2015 13:09:11 -0700

Thanks Jiangjie,

After talking to you offline on this, I have been convinced and changed my
preference to blocking. The immediate shutdown approach does have some
unsafeness in some cases.


Guozhang

On Mon, Mar 16, 2015 at 11:50 AM, Jiangjie Qin <[email protected]>
wrote:

> It looks that the problem we want to solve and the purpose we want to
> achieve is:
> If user uses close() in callback, we want to let user be aware that they
> should use close(0) instead of close() in the callback.
>
> We have agreed that we will have an error log to inform user about this
> mis-usage. The options differ in the way how we can force user to take a
> look at that error log.
> There are two scenarios:
> 1. User does not expect the program to exit.
> 2. User expect the program to exit.
>
> For scenario 1), blocking will probably delay the discovery of the
> problem. Calling close(0) exposes the problem quicker. In this scenario
> producer just encounter a send failure when running normally.
> For scenario 2), blocking will expose the problem quick. Calling close(-1)
> might hide the problem. This scenario might include: a) Unit test for a
> send failure. b) Message sending during a close() call from a user thread.
>
> So as a summary table:
>
>                   Scenario 1)                         Scenario 2)
>
> Blocking      Delay problem discovery         Guaranteed problem discovery
>
> Close(-1)     Immediate problem discovery     Problem might be hidden
>
>
> Personally I prefer blocking because it seems providing more guarantees
> and safer.
>
> Thanks.
>
> Jiangjie (Becket) Qin
>
>
> On 3/16/15, 10:11 AM, "Guozhang Wang" <[email protected]> wrote:
>
> >HI Jiangjie,
> >
> >As far as I understand calling close() in the ioThread is not common, as
> >it
> >may only trigger when we saw some non-retriable error. Hence when user run
> >their program it is unlikely that close() will be triggered and problem
> >will be detected. So it seems to me that from the error detection aspect
> >these two options seems to be the same as people will usually detect it
> >from the producer metrics all dropping to 0.
> >
> >Guozhang
> >
> >On Mon, Mar 16, 2015 at 9:52 AM, Jiangjie Qin <[email protected]>
> >wrote:
> >
> >> It seems there are two options we can choose from when close() is called
> >> from sender thread (callback):
> >> 1. Log an error and close the producer using close(-1)
> >> 2. Log an error and block.
> >> (Throwing an exception will not work because we catch all the exception
> >> thrown from user callback. It will just lead to an error log.)
> >>
> >> My concern for the first option is that the producer will be closed even
> >> if we logged and error. I am wondering if some user would not even take
> >>a
> >> look at the log if producer is closed normally. Because from the
> >>programs
> >> behavior, everything looks good. If that is the case, the error message
> >>we
> >> logged probably will just be ignored until some day when people check
> >>the
> >> log and see it.
> >>
> >> As for the second option, because producer does not close but blocks.
> >>User
> >> will notice this the first time they run the program. They probably will
> >> look at the log to see why producer could not be closed and they will
> >>see
> >> the error log we put there. So they will get informed about this
> >>mis-usage
> >> of close() in sender thread the first time they run the code instead of
> >> some time later.
> >>
> >> Personally I prefer the second one because it is more obvious that
> >> something was wrong.
> >>
> >> Jiangjie (Becket) Qin
> >>
> >> On 3/15/15, 4:27 PM, "Guozhang Wang" <[email protected]> wrote:
> >>
> >> >Yeah I agree we should not silently change the behavior of the function
> >> >with the given parameters; and I would prefer
> >>error-logging-and-shutdown
> >> >over blocking when close(>0) is used, since as Neha suggested blocking
> >> >would also not proceed with sending any data, bu will just let users to
> >> >realize the issue later than sooner.
> >> >
> >> >On Sun, Mar 15, 2015 at 3:25 PM, Neha Narkhede <[email protected]>
> >>wrote:
> >> >
> >> >> >
> >> >> > And I also agree it is better if we can make producer block when
> >> >> > close() is called from sender thread so user will notice something
> >> >>went
> >> >> > wrong.
> >> >>
> >> >>
> >> >> This isn't a great experience either. Why can't we just throw an
> >> >>exception
> >> >> for a behavior we know is incorrect and we'd like the user to know.
> >> >> Blocking as a means of doing that seems wrong and annoying.
> >> >>
> >> >> On Sun, Mar 15, 2015 at 11:56 AM, Jay Kreps <[email protected]>
> >> wrote:
> >> >>
> >> >> > Cool.
> >> >> >
> >> >> > I think blocking is good or alternately throwing an exception
> >>directly
> >> >> from
> >> >> > close(). Basically I would just worry about subtly doing something
> >> >> slightly
> >> >> > different from what the user asked for as it will be hard to notice
> >> >>that
> >> >> > behavior difference.
> >> >> >
> >> >> > -Jay
> >> >> >
> >> >> > On Sat, Mar 14, 2015 at 5:48 PM, Jiangjie Qin
> >> >><[email protected]
> >> >> >
> >> >> > wrote:
> >> >> >
> >> >> > > Hi Jay,
> >> >> > >
> >> >> > > I have modified the KIP as you suggested. I thinks as long as we
> >> >>have
> >> >> > > consistent define for timeout across Kafka interface, there would
> >> >>be no
> >> >> > > problem. And I also agree it is better if we can make producer
> >>block
> >> >> when
> >> >> > > close() is called from sender thread so user will notice
> >>something
> >> >>went
> >> >> > > wrong.
> >> >> > >
> >> >> > > Thanks.
> >> >> > >
> >> >> > > Jiangjie (Becket) Qin
> >> >> > >
> >> >> > > On 3/14/15, 11:37 AM, "Jay Kreps" <[email protected]> wrote:
> >> >> > >
> >> >> > > >Hey Jiangjie,
> >> >> > > >
> >> >> > > >I think this is going to be very confusing that
> >> >> > > >  close(0) waits indefinitely and
> >> >> > > >  close(-1) waits for 0.
> >> >> > > >I understand this appears in other apis, but it is a constant
> >> >>cause of
> >> >> > > >bugs. Let's not repeat that mistake.
> >> >> > > >
> >> >> > > >Let's make close(0) wait for 0. We don't need a way to wait
> >> >> indefinitely
> >> >> > > >as
> >> >> > > >we already have close() so having a magical constant for that is
> >> >> > > >redundant.
> >> >> > > >
> >> >> > > >Calling close() from the I/O thread was already possible and
> >>would
> >> >> block
> >> >> > > >indefinitely. I think trying to silently change the behavior is
> >> >> probably
> >> >> > > >not right. I.e. if the user calls close() in the callback there
> >>is
> >> >> > > >actually
> >> >> > > >some misunderstanding and they need to think more, silently
> >>making
> >> >> this
> >> >> > > >not
> >> >> > > >block will hide the problem from them which is the opposite of
> >> >>what we
> >> >> > > >want.
> >> >> > > >
> >> >> > > >-Jay
> >> >> > > >
> >> >> > > >On Thu, Mar 12, 2015 at 1:49 AM, Jiangjie Qin
> >> >> <[email protected]
> >> >> > >
> >> >> > > >wrote:
> >> >> > > >
> >> >> > > >> Hey Joe & Jay,
> >> >> > > >>
> >> >> > > >> Thanks for the comments on the voting thread. Since it seems
> >>we
> >> >> > probably
> >> >> > > >> will have more discussion on this, I am just replying from the
> >> >> > > >>discussion
> >> >> > > >> thread here.
> >> >> > > >> I’ve updated the KIP page to make it less like half-baked,
> >> >>apologize
> >> >> > for
> >> >> > > >> the rush...
> >> >> > > >>
> >> >> > > >> The contract in current KIP is:
> >> >> > > >>   1. close() - wait until all requests either are sent or
> >>reach
> >> >> > request
> >> >> > > >> timeout.
> >> >> > > >>   2. close(-1, TimeUnit.MILLISECONDS) - close immediately
> >> >> > > >>   3. close(0, TimeUnit.MILLISECONDS) - equivalent to close(),
> >> >>i.e.
> >> >> > Wait
> >> >> > > >> until all requests are sent or reach request timeout
> >> >> > > >>   4. close(5, TimeUnit.MILLISECONDS) - try the best to finish
> >> >> sending
> >> >> > > >>in 5
> >> >> > > >> milliseconds, if something went wrong, just shutdown the
> >>producer
> >> >> > > >>anyway,
> >> >> > > >> my callback will handle the failures.
> >> >> > > >>
> >> >> > > >> About how we define what timeout value stands for, I actually
> >> >> > struggled
> >> >> > > >>a
> >> >> > > >> little bit when wrote the patch. Intuitively, close(0) should
> >> >>mean
> >> >> > > >> immediately, however it seems that all the existing java class
> >> >>have
> >> >> > this
> >> >> > > >> convention of timeout=0 means no timeout or never timeout
> >> >> > > >>(Thread.join(0),
> >> >> > > >> Object.wait(0), etc.) So here the dilemma is either we follow
> >>the
> >> >> > > >> intuition or we follow the convention. What I chose is to
> >>follow
> >> >>the
> >> >> > > >> convention but document the interface to let user be aware of
> >>the
> >> >> > usage.
> >> >> > > >> The reason is that I think producer.close() is a public
> >> >>interface so
> >> >> > it
> >> >> > > >> might be better to follow java convention. Whereas selector is
> >> >>not a
> >> >> > > >> public interface that used by user, so as long as it makes
> >>sense
> >> >>to
> >> >> > us,
> >> >> > > >>it
> >> >> > > >> is less a problem to be different from java convention. That
> >>said
> >> >> > since
> >> >> > > >> consumer.poll(timeout) is also a public interface, I think it
> >> >>also
> >> >> > makes
> >> >> > > >> sense to make producer.close() to have the same definition of
> >> >> > > >> consumer.poll(timeout).
> >> >> > > >>
> >> >> > > >> The main argument for keeping a timeout in close would be
> >> >>separating
> >> >> > the
> >> >> > > >> close timeout from request timeout, which probably makes
> >>sense. I
> >> >> > would
> >> >> > > >> guess typically the request timeout would be long (e.g. 60
> >> >>seconds)
> >> >> > > >> because we might want to consider retries with back off time.
> >>If
> >> >>we
> >> >> > have
> >> >> > > >> multiple batches in accumulator, in worst case that could take
> >> >>up to
> >> >> > > >> several minutes to complete all the requests. But when we
> >>close a
> >> >> > > >> producer, we might not want to wait for that long as it might
> >> >>cause
> >> >> > some
> >> >> > > >> other problem like deployment tool timeout.
> >> >> > > >>
> >> >> > > >> There is also a subtle difference between close(timeout) and
> >> >> > > >> flush(timeout). The only purpose for flush() is to write data
> >>to
> >> >>the
> >> >> > > >> broker, so it makes perfect sense to wait until request
> >>timeout.
> >> >>I
> >> >> > think
> >> >> > > >> that is why flush(timeout) looks strange. On the other hand,
> >>the
> >> >>top
> >> >> > > >> priority for close() is to close the producer rather than
> >>flush()
> >> >> > data,
> >> >> > > >>so
> >> >> > > >> close(timeout) gives guarantee on bounded waiting for its main
> >> >>job.
> >> >> > > >>
> >> >> > > >> Sorry for the confusion about forceClose flag. It is not a
> >>public
> >> >> > > >> interface. I mentioned it in Proposed Changes section which I
> >> >> thought
> >> >> > > >>was
> >> >> > > >> supposed to provide implementation details.
> >> >> > > >>
> >> >> > > >> Thanks again for all the comments and suggestions!
> >> >> > > >>
> >> >> > > >> Jiangjie (Becket) Qin
> >> >> > > >>
> >> >> > > >> On 3/10/15, 8:57 PM, "Jiangjie Qin" <[email protected]>
> wrote:
> >> >> > > >>
> >> >> > > >> >The KIP page has been updated per Jay¹s comments.
> >> >> > > >> >I¹d like to initiate the voting process if no further
> >>comments
> >> >>are
> >> >> > > >> >received by tomorrow.
> >> >> > > >> >
> >> >> > > >> >Jiangjie (Becket) Qin
> >> >> > > >> >
> >> >> > > >> >On 3/8/15, 9:45 AM, "Jay Kreps" <[email protected]> wrote:
> >> >> > > >> >
> >> >> > > >> >>Hey Jiangjie,
> >> >> > > >> >>
> >> >> > > >> >>Can you capture the full motivation and use cases for the
> >> >>feature?
> >> >> > > >>This
> >> >> > > >> >>mentions your interest in having a way of aborting from
> >>inside
> >> >>the
> >> >> > > >> >>Callback. But it doesn't really explain that usage or why
> >>other
> >> >> > people
> >> >> > > >> >>would want to do that. It also doesn't list the primary use
> >> >>case
> >> >> for
> >> >> > > >> >>having
> >> >> > > >> >>close with a bounded timeout which was to avoid blocking too
> >> >>long
> >> >> on
> >> >> > > >> >>shutdown.
> >> >> > > >> >>
> >> >> > > >> >>-Jay
> >> >> > > >> >>
> >> >> > > >> >>
> >> >> > > >> >>
> >> >> > > >> >>On Sat, Mar 7, 2015 at 12:25 PM, Jiangjie Qin
> >> >> > > >><[email protected]
> >> >> > > >> >
> >> >> > > >> >>wrote:
> >> >> > > >> >>
> >> >> > > >> >>> Hi,
> >> >> > > >> >>>
> >> >> > > >> >>> I just created a KIP for adding a close(timeout) to new
> >> >> producer.
> >> >> > > >>Most
> >> >> > > >> >>>of
> >> >> > > >> >>> the previous discussions are in KAFKA-1660 where Parth
> >> >> Brahmbhatt
> >> >> > > >>has
> >> >> > > >> >>> already done a lot of work.
> >> >> > > >> >>> Since this is an interface change so we are going through
> >>the
> >> >> KIP
> >> >> > > >> >>>process.
> >> >> > > >> >>> Here is the KIP link:
> >> >> > > >> >>>
> >> >> > > >> >>>
> >> >> > > >>
> >> >> > >
> >> >>
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=5373978
> >> >> > > >> >>>2
> >> >> > > >> >>>
> >> >> > > >> >>> Thanks.
> >> >> > > >> >>>
> >> >> > > >> >>> Jiangjie (Becket) Qin
> >> >> > > >> >>>
> >> >> > > >> >
> >> >> > > >>
> >> >> > > >>
> >> >> > >
> >> >> > >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Thanks,
> >> >> Neha
> >> >>
> >> >
> >> >
> >> >
> >> >--
> >> >-- Guozhang
> >>
> >>
> >
> >
> >--
> >-- Guozhang
>
>


-- 
-- Guozhang

Re: [DISCUSSION] KIP-15 close(timeout) for producer

Reply via email to