Thanks, that gives us some more to look at. That is unfortunately a small section of the log file. When we hit this problem (which is not every time,) it will continue like that for hours.
We also still have developers creating topics semi-regularly, which it seems like can cause the high level consumer to disconnect? On Fri, Sep 25, 2015 at 6:16 PM Todd Palino <tpal...@gmail.com> wrote: > That rebalance cycle doesn't look endless. I see that you started 23 > consumers, and I see 23 rebalances finishing successfully, which is > correct. You will see rebalance messages from all of the consumers you > started. It all happens within about 2 seconds, which is fine. I agree that > there is a lot of log messages, but I'm not seeing anything that is > particularly a problem here. After the segment of pot you provided, your > consumers will be running properly. Now, given you have a topic with 16 > partitions, and you're running 23 consumers, 7 of those consumer threads > are going to be idle because they do not own partitions. > > -Todd > > > On Fri, Sep 25, 2015 at 3:27 PM, noah <iamn...@gmail.com> wrote: > >> We're seeing this the most on developer machines that are starting up >> multiple high level consumers on the same topic+group as part of service >> startup. The consumers do not seem to get a chance to consume anything >> before they disconnect. >> >> These are developer topics, so it is possible/likely that there isn't >> anything for them to consume in the topic, but the same service will start >> producing, so I would expect them to not be idle for long. >> >> Could it be the way we are bring up multiple consumers at the same time >> is hitting some sort of endless rebalance cycle? And/or the resulting >> thrashing is causing them to time out, rebalance, etc.? >> >> I've tried attaching the logs again. Thanks! >> >> On Fri, Sep 25, 2015 at 3:33 PM Todd Palino <tpal...@gmail.com> wrote: >> >>> I don't see the logs attached, but what does the GC look like in your >>> applications? A lot of times this is caused (at least on the consumer >>> side) >>> by the Zookeeper session expiring due to excessive GC activity, which >>> causes the consumers to go into a rebalance and change up their >>> connections. >>> >>> -Todd >>> >>> >>> On Fri, Sep 25, 2015 at 1:25 PM, Gwen Shapira <g...@confluent.io> wrote: >>> >>> > How busy are the clients? >>> > >>> > The brokers occasionally close idle connections, this is normal and >>> > typically not something to worry about. >>> > However, this shouldn't happen to consumers that are actively reading >>> data. >>> > >>> > I'm wondering if the "consumers not making any progress" could be due >>> to a >>> > different issue, and because they are idle, the connection closes (vs >>> the >>> > other way around). >>> > >>> > On Thu, Sep 24, 2015 at 2:32 PM, noah <iamn...@gmail.com> wrote: >>> > >>> > > We are having issues with producers and consumers frequently fully >>> > > disconnecting (from both the brokers and ZK) and reconnecting >>> without any >>> > > apparent cause. On our production systems it can happen anywhere from >>> > every >>> > > 10-15 seconds to 15-20 minutes. On our less beefy test systems and >>> > > developer laptops, it can happen almost constantly. >>> > > >>> > > We see no errors in the logs (sample attached), just a message for >>> each >>> > of >>> > > our our consumers and producers disconnecting, then reconnecting. On >>> the >>> > > systems where it happens constantly, the consumers are not making any >>> > > progress. >>> > > >>> > > The logs on the brokers are equally unhelpful, they show only >>> frequent >>> > > connects and reconnects, without any apparent cause. >>> > > >>> > > What could be causing this behavior? >>> > > >>> > > >>> > >>> >> >