Re: [DISCUSSION] KIP-15 close(timeout) for producer

Jun Rao Mon, 16 Mar 2015 14:54:02 -0700

How does close(0) work if it's called from the sender thread? If close(0)
needs to wait for the sender thread to join, wouldn't this cause a deadlock?


Thanks,

Jun

On Mon, Mar 16, 2015 at 2:26 PM, Jiangjie Qin <[email protected]>
wrote:

> Thanks Guozhang. It wouldn’t be as thoroughly considered without
> discussing with you :)
>
> Jiangjie (Becket) Qin
>
> On 3/16/15, 1:07 PM, "Guozhang Wang" <[email protected]> wrote:
>
> >Thanks Jiangjie,
> >
> >After talking to you offline on this, I have been convinced and changed my
> >preference to blocking. The immediate shutdown approach does have some
> >unsafeness in some cases.
> >
> >Guozhang
> >
> >On Mon, Mar 16, 2015 at 11:50 AM, Jiangjie Qin <[email protected]
> >
> >wrote:
> >
> >> It looks that the problem we want to solve and the purpose we want to
> >> achieve is:
> >> If user uses close() in callback, we want to let user be aware that they
> >> should use close(0) instead of close() in the callback.
> >>
> >> We have agreed that we will have an error log to inform user about this
> >> mis-usage. The options differ in the way how we can force user to take a
> >> look at that error log.
> >> There are two scenarios:
> >> 1. User does not expect the program to exit.
> >> 2. User expect the program to exit.
> >>
> >> For scenario 1), blocking will probably delay the discovery of the
> >> problem. Calling close(0) exposes the problem quicker. In this scenario
> >> producer just encounter a send failure when running normally.
> >> For scenario 2), blocking will expose the problem quick. Calling
> >>close(-1)
> >> might hide the problem. This scenario might include: a) Unit test for a
> >> send failure. b) Message sending during a close() call from a user
> >>thread.
> >>
> >> So as a summary table:
> >>
> >>                   Scenario 1)                         Scenario 2)
> >>
> >> Blocking      Delay problem discovery         Guaranteed problem
> >>discovery
> >>
> >> Close(-1)     Immediate problem discovery     Problem might be hidden
> >>
> >>
> >> Personally I prefer blocking because it seems providing more guarantees
> >> and safer.
> >>
> >> Thanks.
> >>
> >> Jiangjie (Becket) Qin
> >>
> >>
> >> On 3/16/15, 10:11 AM, "Guozhang Wang" <[email protected]> wrote:
> >>
> >> >HI Jiangjie,
> >> >
> >> >As far as I understand calling close() in the ioThread is not common,
> >>as
> >> >it
> >> >may only trigger when we saw some non-retriable error. Hence when user
> >>run
> >> >their program it is unlikely that close() will be triggered and problem
> >> >will be detected. So it seems to me that from the error detection
> >>aspect
> >> >these two options seems to be the same as people will usually detect it
> >> >from the producer metrics all dropping to 0.
> >> >
> >> >Guozhang
> >> >
> >> >On Mon, Mar 16, 2015 at 9:52 AM, Jiangjie Qin
> >><[email protected]>
> >> >wrote:
> >> >
> >> >> It seems there are two options we can choose from when close() is
> >>called
> >> >> from sender thread (callback):
> >> >> 1. Log an error and close the producer using close(-1)
> >> >> 2. Log an error and block.
> >> >> (Throwing an exception will not work because we catch all the
> >>exception
> >> >> thrown from user callback. It will just lead to an error log.)
> >> >>
> >> >> My concern for the first option is that the producer will be closed
> >>even
> >> >> if we logged and error. I am wondering if some user would not even
> >>take
> >> >>a
> >> >> look at the log if producer is closed normally. Because from the
> >> >>programs
> >> >> behavior, everything looks good. If that is the case, the error
> >>message
> >> >>we
> >> >> logged probably will just be ignored until some day when people check
> >> >>the
> >> >> log and see it.
> >> >>
> >> >> As for the second option, because producer does not close but blocks.
> >> >>User
> >> >> will notice this the first time they run the program. They probably
> >>will
> >> >> look at the log to see why producer could not be closed and they will
> >> >>see
> >> >> the error log we put there. So they will get informed about this
> >> >>mis-usage
> >> >> of close() in sender thread the first time they run the code instead
> >>of
> >> >> some time later.
> >> >>
> >> >> Personally I prefer the second one because it is more obvious that
> >> >> something was wrong.
> >> >>
> >> >> Jiangjie (Becket) Qin
> >> >>
> >> >> On 3/15/15, 4:27 PM, "Guozhang Wang" <[email protected]> wrote:
> >> >>
> >> >> >Yeah I agree we should not silently change the behavior of the
> >>function
> >> >> >with the given parameters; and I would prefer
> >> >>error-logging-and-shutdown
> >> >> >over blocking when close(>0) is used, since as Neha suggested
> >>blocking
> >> >> >would also not proceed with sending any data, bu will just let
> >>users to
> >> >> >realize the issue later than sooner.
> >> >> >
> >> >> >On Sun, Mar 15, 2015 at 3:25 PM, Neha Narkhede <[email protected]>
> >> >>wrote:
> >> >> >
> >> >> >> >
> >> >> >> > And I also agree it is better if we can make producer block when
> >> >> >> > close() is called from sender thread so user will notice
> >>something
> >> >> >>went
> >> >> >> > wrong.
> >> >> >>
> >> >> >>
> >> >> >> This isn't a great experience either. Why can't we just throw an
> >> >> >>exception
> >> >> >> for a behavior we know is incorrect and we'd like the user to
> >>know.
> >> >> >> Blocking as a means of doing that seems wrong and annoying.
> >> >> >>
> >> >> >> On Sun, Mar 15, 2015 at 11:56 AM, Jay Kreps <[email protected]>
> >> >> wrote:
> >> >> >>
> >> >> >> > Cool.
> >> >> >> >
> >> >> >> > I think blocking is good or alternately throwing an exception
> >> >>directly
> >> >> >> from
> >> >> >> > close(). Basically I would just worry about subtly doing
> >>something
> >> >> >> slightly
> >> >> >> > different from what the user asked for as it will be hard to
> >>notice
> >> >> >>that
> >> >> >> > behavior difference.
> >> >> >> >
> >> >> >> > -Jay
> >> >> >> >
> >> >> >> > On Sat, Mar 14, 2015 at 5:48 PM, Jiangjie Qin
> >> >> >><[email protected]
> >> >> >> >
> >> >> >> > wrote:
> >> >> >> >
> >> >> >> > > Hi Jay,
> >> >> >> > >
> >> >> >> > > I have modified the KIP as you suggested. I thinks as long as
> >>we
> >> >> >>have
> >> >> >> > > consistent define for timeout across Kafka interface, there
> >>would
> >> >> >>be no
> >> >> >> > > problem. And I also agree it is better if we can make producer
> >> >>block
> >> >> >> when
> >> >> >> > > close() is called from sender thread so user will notice
> >> >>something
> >> >> >>went
> >> >> >> > > wrong.
> >> >> >> > >
> >> >> >> > > Thanks.
> >> >> >> > >
> >> >> >> > > Jiangjie (Becket) Qin
> >> >> >> > >
> >> >> >> > > On 3/14/15, 11:37 AM, "Jay Kreps" <[email protected]>
> wrote:
> >> >> >> > >
> >> >> >> > > >Hey Jiangjie,
> >> >> >> > > >
> >> >> >> > > >I think this is going to be very confusing that
> >> >> >> > > >  close(0) waits indefinitely and
> >> >> >> > > >  close(-1) waits for 0.
> >> >> >> > > >I understand this appears in other apis, but it is a constant
> >> >> >>cause of
> >> >> >> > > >bugs. Let's not repeat that mistake.
> >> >> >> > > >
> >> >> >> > > >Let's make close(0) wait for 0. We don't need a way to wait
> >> >> >> indefinitely
> >> >> >> > > >as
> >> >> >> > > >we already have close() so having a magical constant for
> >>that is
> >> >> >> > > >redundant.
> >> >> >> > > >
> >> >> >> > > >Calling close() from the I/O thread was already possible and
> >> >>would
> >> >> >> block
> >> >> >> > > >indefinitely. I think trying to silently change the behavior
> >>is
> >> >> >> probably
> >> >> >> > > >not right. I.e. if the user calls close() in the callback
> >>there
> >> >>is
> >> >> >> > > >actually
> >> >> >> > > >some misunderstanding and they need to think more, silently
> >> >>making
> >> >> >> this
> >> >> >> > > >not
> >> >> >> > > >block will hide the problem from them which is the opposite
> >>of
> >> >> >>what we
> >> >> >> > > >want.
> >> >> >> > > >
> >> >> >> > > >-Jay
> >> >> >> > > >
> >> >> >> > > >On Thu, Mar 12, 2015 at 1:49 AM, Jiangjie Qin
> >> >> >> <[email protected]
> >> >> >> > >
> >> >> >> > > >wrote:
> >> >> >> > > >
> >> >> >> > > >> Hey Joe & Jay,
> >> >> >> > > >>
> >> >> >> > > >> Thanks for the comments on the voting thread. Since it
> >>seems
> >> >>we
> >> >> >> > probably
> >> >> >> > > >> will have more discussion on this, I am just replying from
> >>the
> >> >> >> > > >>discussion
> >> >> >> > > >> thread here.
> >> >> >> > > >> I’ve updated the KIP page to make it less like half-baked,
> >> >> >>apologize
> >> >> >> > for
> >> >> >> > > >> the rush...
> >> >> >> > > >>
> >> >> >> > > >> The contract in current KIP is:
> >> >> >> > > >>   1. close() - wait until all requests either are sent or
> >> >>reach
> >> >> >> > request
> >> >> >> > > >> timeout.
> >> >> >> > > >>   2. close(-1, TimeUnit.MILLISECONDS) - close immediately
> >> >> >> > > >>   3. close(0, TimeUnit.MILLISECONDS) - equivalent to
> >>close(),
> >> >> >>i.e.
> >> >> >> > Wait
> >> >> >> > > >> until all requests are sent or reach request timeout
> >> >> >> > > >>   4. close(5, TimeUnit.MILLISECONDS) - try the best to
> >>finish
> >> >> >> sending
> >> >> >> > > >>in 5
> >> >> >> > > >> milliseconds, if something went wrong, just shutdown the
> >> >>producer
> >> >> >> > > >>anyway,
> >> >> >> > > >> my callback will handle the failures.
> >> >> >> > > >>
> >> >> >> > > >> About how we define what timeout value stands for, I
> >>actually
> >> >> >> > struggled
> >> >> >> > > >>a
> >> >> >> > > >> little bit when wrote the patch. Intuitively, close(0)
> >>should
> >> >> >>mean
> >> >> >> > > >> immediately, however it seems that all the existing java
> >>class
> >> >> >>have
> >> >> >> > this
> >> >> >> > > >> convention of timeout=0 means no timeout or never timeout
> >> >> >> > > >>(Thread.join(0),
> >> >> >> > > >> Object.wait(0), etc.) So here the dilemma is either we
> >>follow
> >> >>the
> >> >> >> > > >> intuition or we follow the convention. What I chose is to
> >> >>follow
> >> >> >>the
> >> >> >> > > >> convention but document the interface to let user be aware
> >>of
> >> >>the
> >> >> >> > usage.
> >> >> >> > > >> The reason is that I think producer.close() is a public
> >> >> >>interface so
> >> >> >> > it
> >> >> >> > > >> might be better to follow java convention. Whereas
> >>selector is
> >> >> >>not a
> >> >> >> > > >> public interface that used by user, so as long as it makes
> >> >>sense
> >> >> >>to
> >> >> >> > us,
> >> >> >> > > >>it
> >> >> >> > > >> is less a problem to be different from java convention.
> >>That
> >> >>said
> >> >> >> > since
> >> >> >> > > >> consumer.poll(timeout) is also a public interface, I think
> >>it
> >> >> >>also
> >> >> >> > makes
> >> >> >> > > >> sense to make producer.close() to have the same definition
> >>of
> >> >> >> > > >> consumer.poll(timeout).
> >> >> >> > > >>
> >> >> >> > > >> The main argument for keeping a timeout in close would be
> >> >> >>separating
> >> >> >> > the
> >> >> >> > > >> close timeout from request timeout, which probably makes
> >> >>sense. I
> >> >> >> > would
> >> >> >> > > >> guess typically the request timeout would be long (e.g. 60
> >> >> >>seconds)
> >> >> >> > > >> because we might want to consider retries with back off
> >>time.
> >> >>If
> >> >> >>we
> >> >> >> > have
> >> >> >> > > >> multiple batches in accumulator, in worst case that could
> >>take
> >> >> >>up to
> >> >> >> > > >> several minutes to complete all the requests. But when we
> >> >>close a
> >> >> >> > > >> producer, we might not want to wait for that long as it
> >>might
> >> >> >>cause
> >> >> >> > some
> >> >> >> > > >> other problem like deployment tool timeout.
> >> >> >> > > >>
> >> >> >> > > >> There is also a subtle difference between close(timeout)
> >>and
> >> >> >> > > >> flush(timeout). The only purpose for flush() is to write
> >>data
> >> >>to
> >> >> >>the
> >> >> >> > > >> broker, so it makes perfect sense to wait until request
> >> >>timeout.
> >> >> >>I
> >> >> >> > think
> >> >> >> > > >> that is why flush(timeout) looks strange. On the other
> >>hand,
> >> >>the
> >> >> >>top
> >> >> >> > > >> priority for close() is to close the producer rather than
> >> >>flush()
> >> >> >> > data,
> >> >> >> > > >>so
> >> >> >> > > >> close(timeout) gives guarantee on bounded waiting for its
> >>main
> >> >> >>job.
> >> >> >> > > >>
> >> >> >> > > >> Sorry for the confusion about forceClose flag. It is not a
> >> >>public
> >> >> >> > > >> interface. I mentioned it in Proposed Changes section
> >>which I
> >> >> >> thought
> >> >> >> > > >>was
> >> >> >> > > >> supposed to provide implementation details.
> >> >> >> > > >>
> >> >> >> > > >> Thanks again for all the comments and suggestions!
> >> >> >> > > >>
> >> >> >> > > >> Jiangjie (Becket) Qin
> >> >> >> > > >>
> >> >> >> > > >> On 3/10/15, 8:57 PM, "Jiangjie Qin" <[email protected]>
> >> wrote:
> >> >> >> > > >>
> >> >> >> > > >> >The KIP page has been updated per Jay¹s comments.
> >> >> >> > > >> >I¹d like to initiate the voting process if no further
> >> >>comments
> >> >> >>are
> >> >> >> > > >> >received by tomorrow.
> >> >> >> > > >> >
> >> >> >> > > >> >Jiangjie (Becket) Qin
> >> >> >> > > >> >
> >> >> >> > > >> >On 3/8/15, 9:45 AM, "Jay Kreps" <[email protected]>
> >>wrote:
> >> >> >> > > >> >
> >> >> >> > > >> >>Hey Jiangjie,
> >> >> >> > > >> >>
> >> >> >> > > >> >>Can you capture the full motivation and use cases for the
> >> >> >>feature?
> >> >> >> > > >>This
> >> >> >> > > >> >>mentions your interest in having a way of aborting from
> >> >>inside
> >> >> >>the
> >> >> >> > > >> >>Callback. But it doesn't really explain that usage or why
> >> >>other
> >> >> >> > people
> >> >> >> > > >> >>would want to do that. It also doesn't list the primary
> >>use
> >> >> >>case
> >> >> >> for
> >> >> >> > > >> >>having
> >> >> >> > > >> >>close with a bounded timeout which was to avoid blocking
> >>too
> >> >> >>long
> >> >> >> on
> >> >> >> > > >> >>shutdown.
> >> >> >> > > >> >>
> >> >> >> > > >> >>-Jay
> >> >> >> > > >> >>
> >> >> >> > > >> >>
> >> >> >> > > >> >>
> >> >> >> > > >> >>On Sat, Mar 7, 2015 at 12:25 PM, Jiangjie Qin
> >> >> >> > > >><[email protected]
> >> >> >> > > >> >
> >> >> >> > > >> >>wrote:
> >> >> >> > > >> >>
> >> >> >> > > >> >>> Hi,
> >> >> >> > > >> >>>
> >> >> >> > > >> >>> I just created a KIP for adding a close(timeout) to new
> >> >> >> producer.
> >> >> >> > > >>Most
> >> >> >> > > >> >>>of
> >> >> >> > > >> >>> the previous discussions are in KAFKA-1660 where Parth
> >> >> >> Brahmbhatt
> >> >> >> > > >>has
> >> >> >> > > >> >>> already done a lot of work.
> >> >> >> > > >> >>> Since this is an interface change so we are going
> >>through
> >> >>the
> >> >> >> KIP
> >> >> >> > > >> >>>process.
> >> >> >> > > >> >>> Here is the KIP link:
> >> >> >> > > >> >>>
> >> >> >> > > >> >>>
> >> >> >> > > >>
> >> >> >> > >
> >> >> >>
> >> >>
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=5373978
> >> >> >> > > >> >>>2
> >> >> >> > > >> >>>
> >> >> >> > > >> >>> Thanks.
> >> >> >> > > >> >>>
> >> >> >> > > >> >>> Jiangjie (Becket) Qin
> >> >> >> > > >> >>>
> >> >> >> > > >> >
> >> >> >> > > >>
> >> >> >> > > >>
> >> >> >> > >
> >> >> >> > >
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Thanks,
> >> >> >> Neha
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> >--
> >> >> >-- Guozhang
> >> >>
> >> >>
> >> >
> >> >
> >> >--
> >> >-- Guozhang
> >>
> >>
> >
> >
> >--
> >-- Guozhang
>
>

Re: [DISCUSSION] KIP-15 close(timeout) for producer

Reply via email to