+1 -------- Original message --------From: Ismael Juma <ism...@juma.me.uk> Date: 6/11/18 5:43 PM (GMT-08:00) To: dev <dev@kafka.apache.org> Subject: Re: [VOTE] KIP-266: Add TimeoutException for KafkaConsumer#position Sounds good to me.
Ismael On Mon, Jun 11, 2018 at 5:38 PM Jason Gustafson <ja...@confluent.io> wrote: > Hey All, > > I wanted to get to some closure on this issue before the release. I think > the config `default.api.timeout.ms` sounds good to me. Guozhang and I had > actually discussed using this name before we saw Colin's comment. > > As for the default value, the main reason that the current ` > request.timeout.ms` is so high for the consumer is that we have to handle > the special case of the JoinGroup, which can currently sit in purgatory for > as long as `max.poll.interval.ms`. I'd like to propose making that a > special case so that the default value can be changed to something more > reasonable. In other words, the timeout for JoinGroup will be tied to ` > max.poll.interval.ms` direction and we will use `request.timeout.ms` for > everything else. For the default values, I would suggest 30s for ` > request.timeout.ms` and 60s for `default.api.timeout.ms`. > > How does that sound? > > Thanks, > Jason > > > > On Fri, Jun 8, 2018 at 10:39 AM, Colin McCabe <cmcc...@apache.org> wrote: > > > On Wed, Jun 6, 2018, at 13:10, Guozhang Wang wrote: > > > The reason that I'm hesitant to use the term "timeout" is that it's > being > > > over-used for multiple semantics: request RPC timeout, consumer session > > > heartbeat liveness "timeout", and API blocking timeout. We can argue > that > > > in English both of them are indeed called a "timeout" value, but > > personally > > > I'm afraid for normal users having the same word `timeout` would be > > > confusing, and hence I'm proposing for using a slight different term. > > > > Hmm. I can see why you want to have a different-sounding name from the > > existing timeouts. However, I think it could be less clear to omit the > > word timeout. If your operation times out, and you get a > TimeoutException, > > what configuration do you raise? The timeout. If the configuration name > > doesn't tell you that it's a timeout, it's harder to understand what > needs > > to be changed. > > > > For example, if "group.min.session.timeout.ms" was called something > like " > > group.min.session.block.ms" or "group.min.session.heartbeat.ms", it > would > > not be as clear. > > > > > Comparing with adding a new config, I'm actually more concerned about > > > leveraging the request.timeout value for a default blocking timeout, > > since > > > the default value is hard to decide, since for different blocking > calls, > > it > > > may have different rpc round trips behind the scene, so simply setting > it > > > as request.timeout + a delta may not be always good enough. > > > > Yes, I agree that we need a new configuration key. I don't think we > > should try to hard-code this. > > > > I think we should just bite the bullet and create a new configuration key > > like "default.api.timeout.ms" that sets the default timeout for API > > calls. The hard part is adding the new configuration in a way that > doesn't > > disrupt existing configurations. > > > > There are at least a few cases to worry about: > > > > 1. Someone uses the default (pretty long) timeouts for everything. > > 2. Someone has configured a short request.timeout.ms, in an effort to > see > > failures more quickly > > 3. Someone has configured a very long (or maybe infinite) > > request.timeout.ms > > > > Case #2 is probably the one which is hardest to support well. We could > > probably do it with logic like this: > > > > A. If default.api.timeout.ms is explicitly set, we use that value. > > otherwise... > > B. If request.timeout.ms is longer than 2 minutes, we set > > default.api.timeout.ms to request.timeout.ms + 1500. otherwise... > > we set default.api.timeout.ms to request.timeout.ms > > > > best, > > Colin > > > > > > > > > > > Guozhang > > > > > > > > > On Tue, Jun 5, 2018 at 5:18 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > > > I see where the 0.5 in your previous response came about. > > > > > > > > The reason I wrote 'request.timeout.ms + 15000' was that I treat > this > > > > value > > > > in place of the default.block.ms > > > > According to your earlier description: > > > > > > > > bq. request.timeout.ms controls something different: the amount of > > time > > > > we're willing to wait for an RPC to complete. > > > > > > > > Basically we're in agreement. It is just that figuring out good > > default is > > > > non-trivial. > > > > > > > > On Tue, Jun 5, 2018 at 4:44 PM, Colin McCabe <cmcc...@apache.org> > > wrote: > > > > > > > > > On Tue, Jun 5, 2018, at 16:35, Ted Yu wrote: > > > > > > bq. could probably come up with a good default > > > > > > > > > > > > That's the difficult part. > > > > > > > > > > > > bq. max(1000, 0.5 * request.timeout.ms) > > > > > > > > > > > > Looking at some existing samples: > > > > > > In tests/kafkatest/tests/connect/templates/connect-distributed. > > > > properties > > > > > , > > > > > > we have: > > > > > > request.timeout.ms=30000 > > > > > > > > > > > > Isn't the above formula putting an upper bound 15000 for the RPC > > > > timeout > > > > > ? > > > > > > > > > > Correct. It would put a 15 second default on the RPC timeout in > this > > > > > case. If that's too short, the user has the option to change it. > > > > > > > > > > If we feel that 15 seconds is too short, we could put a floor of 30 > > > > > seconds or whatever on the RPC timeout, instead of 1 second. > > > > > > > > > > > > > > > > > By fixed duration, I meant something like > > > > > > request.timeout.ms + 15000 > > > > > > > > > > The RPC timeout should be shorter than the request timeout, so that > > we > > > > get > > > > > multiple tries if the RPC hangs due to network issues. It should > > not be > > > > > longer. > > > > > > > > > > best, > > > > > Colin > > > > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > On Tue, Jun 5, 2018 at 4:27 PM, Colin McCabe <cmcc...@apache.org > > > > > > wrote: > > > > > > > > > > > > > I don't think it can be fixed. The RPC duration is something > > that > > > > you > > > > > > > might reasonably want to tune. For example, if it's too low, > you > > > > > might not > > > > > > > be able to make progress at all on a heavily loaded server. > > > > > > > > > > > > > > We could probably come up with a good default, however. > > > > > rpc.timeout.ms > > > > > > > could be set to something like > > > > > > > max(1000, 0.5 * request.timeout.ms) > > > > > > > > > > > > > > best, > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > On Tue, Jun 5, 2018, at 16:21, Ted Yu wrote: > > > > > > > > bq. we must make the API timeout longer than the RPC timeout > > > > > > > > > > > > > > > > I agree with the above. > > > > > > > > > > > > > > > > How about adding a fixed duration on top of > request.timeout.ms > > ? > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > On Tue, Jun 5, 2018 at 4:18 PM, Colin McCabe < > > cmcc...@apache.org> > > > > > wrote: > > > > > > > > > > > > > > > > > As Jason noted earlier, request.timeout.ms controls > > something > > > > > > > different: > > > > > > > > > the amount of time we're willing to wait for an RPC to > > complete. > > > > > > > > > > > > > > > > > > Empirically, RPCs sometimes hang for a long time. If the > API > > > > > timeout > > > > > > > is > > > > > > > > > the same as the RPC timeout, we are not robust against this > > > > failure > > > > > > > > > condition. The whole call fails rather than trying another > > > > server > > > > > or > > > > > > > a new > > > > > > > > > socket. > > > > > > > > > > > > > > > > > > In order to fix this, we must make the API timeout longer > > than > > > > the > > > > > RPC > > > > > > > > > timeout. > > > > > > > > > > > > > > > > > > However, we have a lot of users who have been trained to > use > > > > > > > > > request.timeout.ms to make their clients time out earlier. > > If > > > > we > > > > > > > > > introduce a new config to do this instead, it's kind of a > > > > breaking > > > > > > > change > > > > > > > > > for them. Perhaps we should go the other direction and > > create a > > > > > new > > > > > > > > > configuration like rpc.timeout.ms to do what > > request.timeout.ms > > > > is > > > > > > > doing > > > > > > > > > now. Then request.timeout.ms can become what users > already > > > > think > > > > > it > > > > > > > is: > > > > > > > > > the timeout for their API calls. > > > > > > > > > > > > > > > > > > best, > > > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jun 5, 2018, at 15:29, Ted Yu wrote: > > > > > > > > > > bq. we were already doing with request.timeout.ms > > > > > > > > > > > > > > > > > > > > I would vote for using existing config. > > > > > > > > > > > > > > > > > > > > Any new config parameter needs to go thru long process of > > > > > digestion: > > > > > > > > > > documentation, etc in order for users to understand and > > > > > familiarize. > > > > > > > > > > > > > > > > > > > > The existing config would have lower mismatch of > impedance. > > > > > > > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > > > On Tue, Jun 5, 2018 at 3:17 PM, Jason Gustafson < > > > > > ja...@confluent.io> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Thanks for the comments. I'm not sure I understand why > we > > > > want > > > > > to > > > > > > > > > avoid the > > > > > > > > > > > term "timeout." Semantically, that's what it is. If we > > don't > > > > > want > > > > > > > > > another > > > > > > > > > > > timeout config, we could avoid it and hard-code a > > reasonable > > > > > > > default > > > > > > > > > or try > > > > > > > > > > > to wrap the behavior into one of the other timeouts > > (which is > > > > > sort > > > > > > > of > > > > > > > > > what > > > > > > > > > > > we were already doing with request.timeout.ms). But > I'm > > not > > > > > too > > > > > > > sure > > > > > > > > > > > calling it something else addresses the problem. > > > > > > > > > > > > > > > > > > > > > > -Jason > > > > > > > > > > > > > > > > > > > > > > On Tue, Jun 5, 2018 at 2:59 PM, Dhruvil Shah < > > > > > dhru...@confluent.io > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > I agree that using `default.timeout.ms` could cause > > > > > confusion > > > > > > > since > > > > > > > > > we > > > > > > > > > > > > already have other timeout configurations in the > > consumer. > > > > > > > > > > > > > > > > > > > > > > > > +1 for using `default.block.ms`. > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Dhruvil > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jun 5, 2018 at 11:48 AM, Bill Bejeck < > > > > > bbej...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Hi Jason, > > > > > > > > > > > > > > > > > > > > > > > > > > At first, I thought the same name between the > > producer > > > > and > > > > > the > > > > > > > > > consumer > > > > > > > > > > > > was > > > > > > > > > > > > > ideal. > > > > > > > > > > > > > > > > > > > > > > > > > > But your comment makes me realize consistent names > > with > > > > > > > different > > > > > > > > > > > > semantics > > > > > > > > > > > > > is even more confusing. > > > > > > > > > > > > > > > > > > > > > > > > > > I'm +1 for not using `max.block.ms`. I like > > Guozhang's > > > > > > > > > suggestion of > > > > > > > > > > > ` > > > > > > > > > > > > > default.block.ms` for the name. > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > Bill > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jun 5, 2018 at 1:33 PM, Guozhang Wang < > > > > > > > wangg...@gmail.com> > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Jason, > > > > > > > > > > > > > > > > > > > > > > > > > > > > Yeah I agree that "max.block.ms" makes people > > thinking > > > > > of > > > > > > > the > > > > > > > > > > > > producer's > > > > > > > > > > > > > > config with the same name, but their semantics > are > > > > > different. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On the other hand, I'm a bit concerned with the > > reusing > > > > > of > > > > > > > the > > > > > > > > > term > > > > > > > > > > > > > > `timeout` as we already have `session.timeout.ms > ` > > and > > > > ` > > > > > > > > > > > > > request.timeout.ms` > > > > > > > > > > > > > > in ConsumerConfig.. How about using the name ` > > > > > > > > > default.api.block.ms` > > > > > > > > > > > or > > > > > > > > > > > > > > simply `default.block.ms`? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jun 5, 2018 at 8:57 AM, Jason Gustafson < > > > > > > > > > ja...@confluent.io> > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hey All, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > One more minor follow-up. As I was reviewing > the > > > > change > > > > > > > > > mentioned > > > > > > > > > > > > > above, > > > > > > > > > > > > > > I > > > > > > > > > > > > > > > felt the name `max.block.ms` was a little bit > > > > > misleading > > > > > > > > > since it > > > > > > > > > > > > only > > > > > > > > > > > > > > > applies to methods which do not have an > explicit > > > > > timeout. A > > > > > > > > > clearer > > > > > > > > > > > > > name > > > > > > > > > > > > > > > given its usage might be `default.timeout.ms`. > > It is > > > > > the > > > > > > > > > default > > > > > > > > > > > > > timeout > > > > > > > > > > > > > > > for any blocking API which does not have a > > timeout. > > > > I'm > > > > > > > leaning > > > > > > > > > > > > toward > > > > > > > > > > > > > > > using this name since the current one seems > > likely to > > > > > cause > > > > > > > > > > > > confusion. > > > > > > > > > > > > > > Any > > > > > > > > > > > > > > > thoughts? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > Jason > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, May 31, 2018 at 6:09 PM, Dong Lin < > > > > > > > lindon...@gmail.com > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the KIP! I am in favor of the > > option 1. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +1 as well. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, May 31, 2018 at 6:00 PM, Jason > > Gustafson < > > > > > > > > > > > > ja...@confluent.io > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks everyone for the feedback. I've > > updated > > > > the > > > > > KIP > > > > > > > and > > > > > > > > > > > added > > > > > > > > > > > > > > > > > KAFKA-6979. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -Jason > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 30, 2018 at 3:50 PM, Guozhang > > Wang < > > > > > > > > > > > > wangg...@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks Jason. I'm in favor of option 1 as > > well. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 30, 2018 at 1:37 PM, Bill > > Bejeck < > > > > > > > > > > > > bbej...@gmail.com> > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > For what it's worth I'm +1 on Option 1 > > and > > > > the > > > > > > > default > > > > > > > > > > > value > > > > > > > > > > > > > for > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > timeout. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In addition to reasons outlined above > by > > > > > Jason, I > > > > > > > > > think it > > > > > > > > > > > > will > > > > > > > > > > > > > > > help > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > reason about consumer behavior (with > > respect > > > > to > > > > > > > > > blocking) > > > > > > > > > > > > > having > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > configuration and default value aligned > > with > > > > > the > > > > > > > > > producer. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -Bill > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 30, 2018 at 3:43 PM, Ismael > > Juma > > > > < > > > > > > > > > > > > > ism...@juma.me.uk> > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sounds good to me, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 30, 2018 at 12:40 PM > Jason > > > > > Gustafson > > > > > > > < > > > > > > > > > > > > > > > > ja...@confluent.io > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Perhaps one minute? That is the > > default > > > > > used > > > > > > > by the > > > > > > > > > > > > > producer. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -Jason > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 30, 2018 at 9:50 AM, > > Ismael > > > > > Juma < > > > > > > > > > > > > > > > ism...@juma.me.uk> > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Option 1 sounds good to me > provided > > > > that > > > > > we > > > > > > > can > > > > > > > > > come > > > > > > > > > > > up > > > > > > > > > > > > > > with > > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > good > > > > > > > > > > > > > > > > > > > > > > default. What would you suggest? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Ismael > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 30, 2018 at 9:41 AM > > Jason > > > > > > > Gustafson < > > > > > > > > > > > > > > > > > > ja...@confluent.io> > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Everyone, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > There remains some > inconsistency > > in > > > > the > > > > > > > timeout > > > > > > > > > > > > > behavior > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > consumer > > > > > > > > > > > > > > > > > > > > > > > APIs which do not accept a > > timeout. > > > > > Some of > > > > > > > > > them > > > > > > > > > > > > block > > > > > > > > > > > > > > > > forever > > > > > > > > > > > > > > > > > > > (e.g. > > > > > > > > > > > > > > > > > > > > > > > position()) and some of them > use > > > > > > > > > > > request.timeout.ms > > > > > > > > > > > > > > (e.g. > > > > > > > > > > > > > > > > > > > > > > > parititonsFor()). > > > > > > > > > > > > > > > > > > > > > > > I think we'd probably all agree > > that > > > > > > > blocking > > > > > > > > > > > forever > > > > > > > > > > > > > is > > > > > > > > > > > > > > > not > > > > > > > > > > > > > > > > > > useful > > > > > > > > > > > > > > > > > > > > > > > behavior and using > > > > request.timeout.ms > > > > > has > > > > > > > > > always > > > > > > > > > > > > been > > >