Hi Sai, For your second note on rebalancing taking a long time, we have just improved the situation in trunk after fixing this JIRA: https://issues.apache.org/jira/browse/KAFKA-3559 <https://issues.apache.org/jira/browse/KAFKA-3559>. Feel free to give it a go if rebalancing time continues to be a problem.
Thanks Eno > On 31 Oct 2016, at 19:44, saiprasad mishra <saiprasadmis...@gmail.com> wrote: > > Hey Guys > > I have noticed similar issues when network goes down on starting of kafka > stream apps especially the store has initialized but the task > initialization is not complete and when the network comes back the > rebalance fails with the above error and I had to restart. as i run many > partitions and have many tasks get initialized. > > Otherwise if the kafka streams app is started successfully does recover > from network issues always as far as what I have seen so far and also > stores do remain available. > > Which means some of these initialization exceptions can be categorized as > recoverable and should be always retried. > > I think task 0_0 in your case was not initialized properly in the first > place and then rebalance happened bcoz of network connectivity and it > resulted in the above exception. > > On a separate note rebalance takes longer time as i have some > intermeidiary topics and thinking it might be worse if network is slow and > was thinking of something like store may be available for querying quickly > without waiting for the full initialization of tasks > > Regards > Sai > > > > > > > Regards > Sai > > On Mon, Oct 31, 2016 at 3:51 AM, Damian Guy <damian....@gmail.com> wrote: > >> Hi Frank, >> >> This usually means that another StreamThread has the lock for the state >> directory. So it would seem that one of the StreamThreads hasn't shut down >> cleanly. If it happens again can you please take a Thread Dump so we can >> see what is happening? >> >> Thanks, >> Damian >> >> On Sun, 30 Oct 2016 at 10:52 Frank Lyaruu <flya...@gmail.com> wrote: >> >>> I have a remote Kafka cluster, to which I connect using a VPN and a >>> not-so-great WiFi network. >>> That means that sometimes the Kafka Client loses briefly loses >>> connectivity. >>> When it regains a connection after a while, I see: >>> >>> org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot >> be >>> completed since the group has already rebalanced and assigned the >>> partitions to another member. This means that the time between subsequent >>> calls to poll() was longer than the configured max.poll.interval.ms, >> which >>> typically implies that the poll loop is spending too much time message >>> processing. You can address this either by increasing the session timeout >>> or by reducing the maximum size of batches returned in poll() with >>> max.poll.records. >>> >>> ... >>> >>> Which makes sense I suppose, but this shouldn't be fatal. >>> >>> But then I see: >>> [StreamThread-1] ERROR >>> org.apache.kafka.streams.processor.internals.StreamThread - >> stream-thread >>> [StreamThread-1] Failed to create an active task %s: >>> org.apache.kafka.streams.errors.ProcessorStateException: task [0_0] >> Error >>> while creating the state manager >>> >>> at >>> >>> org.apache.kafka.streams.processor.internals.AbstractTask.<init>( >> AbstractTask.java:72) >>> at >>> >>> org.apache.kafka.streams.processor.internals. >> StreamTask.<init>(StreamTask.java:89) >>> at >>> >>> org.apache.kafka.streams.processor.internals. >> StreamThread.createStreamTask(StreamThread.java:633) >>> at >>> >>> org.apache.kafka.streams.processor.internals. >> StreamThread.addStreamTasks(StreamThread.java:660) >>> at >>> >>> org.apache.kafka.streams.processor.internals.StreamThread.access$100( >> StreamThread.java:69) >>> at >>> >>> org.apache.kafka.streams.processor.internals.StreamThread$1. >> onPartitionsAssigned(StreamThread.java:124) >>> at >>> >>> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator. >> onJoinComplete(ConsumerCoordinator.java:228) >>> at >>> >>> org.apache.kafka.clients.consumer.internals.AbstractCoordinator. >> joinGroupIfNeeded(AbstractCoordinator.java:313) >>> at >>> >>> org.apache.kafka.clients.consumer.internals.AbstractCoordinator. >> ensureActiveGroup(AbstractCoordinator.java:277) >>> at >>> >>> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll( >> ConsumerCoordinator.java:259) >>> at >>> >>> org.apache.kafka.clients.consumer.KafkaConsumer. >> pollOnce(KafkaConsumer.java:1013) >>> at >>> >>> org.apache.kafka.clients.consumer.KafkaConsumer.poll( >> KafkaConsumer.java:979) >>> at >>> >>> org.apache.kafka.streams.processor.internals.StreamThread.runLoop( >> StreamThread.java:407) >>> at >>> >>> org.apache.kafka.streams.processor.internals. >> StreamThread.run(StreamThread.java:242) >>> >>> Caused by: java.io.IOException: task [0_0] Failed to lock the state >>> directory: >>> >>> /Users/frank/git/dexels.repository/com.dexels.kafka. >> streams/kafka-streams/develop3-person/0_0 >>> >>> at >>> >>> org.apache.kafka.streams.processor.internals. >> ProcessorStateManager.<init>(ProcessorStateManager.java:101) >>> at >>> >>> org.apache.kafka.streams.processor.internals.AbstractTask.<init>( >> AbstractTask.java:69) >>> >>> ... 13 more >>> >>> And my stream applications is dead. >>> >>> So I'm guessing that either the store wasn't closed properly or some >> things >>> happen out of order. >>> >>> Any ideas? >>> >>> I'm using the trunk of Kafka 0.10.2.0-SNAPSHOT, Java 1.8.0_66 on MacOS >>> 10.11.6 >>> >>> regards, Frank >>> >>