Hey Jay, Thanks for the comments. I'd be sorely tempted to default to an infinite value for max.poll.interval.ms if we offered a separate rebalance timeout, but it seems a little dangerous as long as we use the same setting for both. In the worst case, a live-locked process could indefinitely keep the entire group from rebalancing. I agree with Ewen as well that users really ought to be thinking about liveness. But I think it would be reasonable to increase the default I've suggested (one minute) up to 2-3 minutes. That combined with a sensible default for max.poll.records (which is also part of this KIP) should provide reasonable out-of-the-box behavior for most users.
Thanks, Jason On Sun, Jun 19, 2016 at 7:28 PM, Ewen Cheslack-Postava <e...@confluent.io> wrote: > +1 on the KIP. > > On Sat, Jun 18, 2016 at 11:52 AM, Ismael Juma <ism...@juma.me.uk> wrote: > > > If we do that, shouldn't `max.poll.records` remain with the current > default > > of `Integer.MAX_VALUE`? > > > > On Sat, Jun 18, 2016 at 5:18 PM, Jay Kreps <j...@confluent.io> wrote: > > > > > +1 > > > > > > One small but important thing I think we should consider changing: I > > think > > > we should consider setting the default for max.poll.interval to > infinite. > > > Previously our definition of alive was "polls within session timeout". > > Now > > > our definition of alive is "pings from b/g thread w/in session > timeout". > > > The reality is that anything we set here as a default is going to be > too > > > low for some people and those people will be confused. Previously that > > was > > > okay because the definition of liveness required you to understand and > > > think about this setting, so getting an error meant you needed to think > > > harder. But now this is really an optional setting for people who care > to > > > enforce some bound, so introducing a possible failure/instability mode > by > > > guess that bound seems wrong. Instead I think users that know and care > > > about this should set a thoughtful max.poll.interval, but we should not > > try > > > to guess for those who don't. > > > > > > This seems minor but I think we've found that defaults matter more than > > > configs so it's worth being thoughtful. > > > > This is very, very not minor. This is a really important default and I > strongly disagree that setting it to infinite is a good idea. Liveness > isn't something you can ignore. People like to ignore it because it makes > writing code easier, but that just means they write broken code. We can't > avoid giving them the rope to hang themselves (they can always override the > setting to a very large value), but we shouldn't encourage them to do it. > > Looking at various connectors (beyond just Kafka itself, as I was looking > at other frameworks), there was quite a bit of code structured roughly as: > > while(true) { > try { > conn = open(url); > records = consumer.poll(); > for (ConsumerRecord record : records) { > sendRecord(conn, record); > } > } catch (ConnectionException e) { > //ignore and retry > } > } > > In other words, code which will never make progress under a configuration > or deployment error. Handling errors by ignoring them isn't rare, which > means you'll probably get no real indication of liveness unless you make it > explicit (i.e. you *must* call something periodically). Of course, users > can always continue to screw themselves over by also ignoring errors *we* > produce and still keeping their app alive, but at that point we've done all > we can and at least we'll have attempted to shift work to another instance. > > I think setting a *larger* default timeout could make sense -- there's no > reasonable way to set a default since the message size, number of messages, > and message processing time will all vary widely by application. But really > the important distinction is (reasonable) finite timeout vs infinite. We > shouldn't default to something that never lets you figure out that > something is wrong. (If we could come up with a finite maximum value, I > would absolutely want to add that as a strict maximum, but there's no such > value.) If someone exceeds what we choose as a default maximum, it is > *really* in their interest to understand why they need such a large timeout > and whether there are better solutions to their problem. > > I'm sure it comes across as condescending, but we actually do know better > than many application developers. There are implications of just dropping > these timeouts that don't end well for the app. Ignoring those error cases > and liveness issues works fine 99% of the time, but it's important if you > really care about resiliency of your app and when things break they'll ask > why we set a default value that leads to such bad results. I staunchly > believe that it's better to explain why there is complexity and how to deal > with it than to try to hide it when it can't really be hidden. > > -Ewen > > P.S. I would invoke checked exceptions as another case where framework devs > try to encourage app developers to avoid hanging themselves yet app devs > are given enough rope and end up hanging themselves, but the Kafka codebase > eschews checked exceptions, so.... > > > > > > > > > -Jay > > > > > > On Thu, Jun 16, 2016 at 11:44 AM, Jason Gustafson <ja...@confluent.io> > > > wrote: > > > > > > > Hi All, > > > > > > > > I'd like to open the vote for KIP-62. This proposal attempts to > address > > > one > > > > of the recurring usability problems that users of the new consumer > have > > > > faced with as little impact as possible. You can read the full > details > > > > here: > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-62%3A+Allow+consumer+to+send+heartbeats+from+a+background+thread > > > > . > > > > > > > > After some discussion on this list, I think we were in agreement that > > > this > > > > change addresses a major part of the problem and we've left the door > > open > > > > for further improvements, such as adding a heartbeat() API or a > > > separately > > > > configured rebalance timeout. Thanks in advance to everyone who > helped > > > > review the proposal. > > > > > > > > -Jason > > > > > > > > > > > > > -- > Thanks, > Ewen >