Re: Kafka Streams fails permanently when used with an unstable network

Eno Thereska Wed, 02 Nov 2016 13:34:31 -0700

Hi Sai,

For your second note on rebalancing taking a long time, we have just improved 
the situation in trunk after fixing this JIRA: 
https://issues.apache.org/jira/browse/KAFKA-3559 
<https://issues.apache.org/jira/browse/KAFKA-3559>. Feel free to give it a go 
if rebalancing time continues to be a problem.


Thanks
Eno

> On 31 Oct 2016, at 19:44, saiprasad mishra <saiprasadmis...@gmail.com> wrote:
> 
> Hey Guys
> 
> I have noticed similar issues when network goes down on starting of kafka
> stream apps especially the store has initialized but the task
> initialization is not complete and when the network comes back the
> rebalance fails with the above error and I had to restart. as i run many
> partitions and have many tasks get initialized.
> 
> Otherwise if the kafka streams app is started successfully does recover
> from network issues always as far as what I have seen so far and also
> stores do remain available.
> 
> Which means some of these initialization exceptions can be categorized as
> recoverable and should be always retried.
> 
> I think task 0_0 in your case was not initialized properly in the first
> place and then rebalance happened bcoz of network connectivity and it
> resulted in the above exception.
> 
> On a separate note rebalance takes longer time  as i have some
> intermeidiary topics and thinking it might be worse if network is slow and
> was thinking of something like store may be available for querying quickly
> without waiting for the full initialization of tasks
> 
> Regards
> Sai
> 
> 
> 
> 
> 
> 
> Regards
> Sai
> 
> On Mon, Oct 31, 2016 at 3:51 AM, Damian Guy <damian....@gmail.com> wrote:
> 
>> Hi Frank,
>> 
>> This usually means that another StreamThread has the lock for the state
>> directory. So it would seem that one of the StreamThreads hasn't shut down
>> cleanly. If it happens again can you please take a Thread Dump so we can
>> see what is happening?
>> 
>> Thanks,
>> Damian
>> 
>> On Sun, 30 Oct 2016 at 10:52 Frank Lyaruu <flya...@gmail.com> wrote:
>> 
>>> I have a remote Kafka cluster, to which I connect using a VPN and a
>>> not-so-great WiFi network.
>>> That means that sometimes the Kafka Client loses briefly loses
>>> connectivity.
>>> When it regains a connection after a while, I see:
>>> 
>>> org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot
>> be
>>> completed since the group has already rebalanced and assigned the
>>> partitions to another member. This means that the time between subsequent
>>> calls to poll() was longer than the configured max.poll.interval.ms,
>> which
>>> typically implies that the poll loop is spending too much time message
>>> processing. You can address this either by increasing the session timeout
>>> or by reducing the maximum size of batches returned in poll() with
>>> max.poll.records.
>>> 
>>> ...
>>> 
>>> Which makes sense I suppose, but this shouldn't be fatal.
>>> 
>>> But then I see:
>>> [StreamThread-1] ERROR
>>> org.apache.kafka.streams.processor.internals.StreamThread -
>> stream-thread
>>> [StreamThread-1] Failed to create an active task %s:
>>> org.apache.kafka.streams.errors.ProcessorStateException: task [0_0]
>> Error
>>> while creating the state manager
>>> 
>>> at
>>> 
>>> org.apache.kafka.streams.processor.internals.AbstractTask.<init>(
>> AbstractTask.java:72)
>>> at
>>> 
>>> org.apache.kafka.streams.processor.internals.
>> StreamTask.<init>(StreamTask.java:89)
>>> at
>>> 
>>> org.apache.kafka.streams.processor.internals.
>> StreamThread.createStreamTask(StreamThread.java:633)
>>> at
>>> 
>>> org.apache.kafka.streams.processor.internals.
>> StreamThread.addStreamTasks(StreamThread.java:660)
>>> at
>>> 
>>> org.apache.kafka.streams.processor.internals.StreamThread.access$100(
>> StreamThread.java:69)
>>> at
>>> 
>>> org.apache.kafka.streams.processor.internals.StreamThread$1.
>> onPartitionsAssigned(StreamThread.java:124)
>>> at
>>> 
>>> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.
>> onJoinComplete(ConsumerCoordinator.java:228)
>>> at
>>> 
>>> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.
>> joinGroupIfNeeded(AbstractCoordinator.java:313)
>>> at
>>> 
>>> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.
>> ensureActiveGroup(AbstractCoordinator.java:277)
>>> at
>>> 
>>> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(
>> ConsumerCoordinator.java:259)
>>> at
>>> 
>>> org.apache.kafka.clients.consumer.KafkaConsumer.
>> pollOnce(KafkaConsumer.java:1013)
>>> at
>>> 
>>> org.apache.kafka.clients.consumer.KafkaConsumer.poll(
>> KafkaConsumer.java:979)
>>> at
>>> 
>>> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(
>> StreamThread.java:407)
>>> at
>>> 
>>> org.apache.kafka.streams.processor.internals.
>> StreamThread.run(StreamThread.java:242)
>>> 
>>> Caused by: java.io.IOException: task [0_0] Failed to lock the state
>>> directory:
>>> 
>>> /Users/frank/git/dexels.repository/com.dexels.kafka.
>> streams/kafka-streams/develop3-person/0_0
>>> 
>>> at
>>> 
>>> org.apache.kafka.streams.processor.internals.
>> ProcessorStateManager.<init>(ProcessorStateManager.java:101)
>>> at
>>> 
>>> org.apache.kafka.streams.processor.internals.AbstractTask.<init>(
>> AbstractTask.java:69)
>>> 
>>> ... 13 more
>>> 
>>> And my stream applications is dead.
>>> 
>>> So I'm guessing that either the store wasn't closed properly or some
>> things
>>> happen out of order.
>>> 
>>> Any ideas?
>>> 
>>> I'm using the trunk of Kafka 0.10.2.0-SNAPSHOT, Java 1.8.0_66 on MacOS
>>> 10.11.6
>>> 
>>> regards, Frank
>>> 
>>

Re: Kafka Streams fails permanently when used with an unstable network

Reply via email to