Re: 3 machines 12 threads all fail within 2 hours of starting the streams application

Eno Thereska Mon, 27 Mar 2017 14:04:48 -0700

Also just a heads up that a PR that increases resiliency is being currently 
reviewed and should hopefully hit trunk soon: 
https://github.com/apache/kafka/pull/2719 
<https://github.com/apache/kafka/pull/2719>. This covers certain broker failure 
scenarios as well as a (hopefully last) case when state locking fails. It also 
includes the retries config I had mentioned earlier.


Thanks
Eno


> On 25 Mar 2017, at 18:14, Sachin Mittal <sjmit...@gmail.com> wrote:
> 
> Hi,
> The broker is a three machine cluster. The replication factor for input and
> also internal topics is 3.
> Brokers don't seem to fail. I always see their instances running.
> 
> Also note that when an identical streams application with single thread on
> a single instance is pulling data from some other non partitioned identical
> topic, the application never fails. Note there too replication factor is 3
> for input and internal topics.
> 
> Please let us know if you have something for other errors. Also what ways
> we can make the steams resilient. I do feel we need hooks to start new
> stream threads just in case some thread shuts down due to unhandled
> exception, or streams application itself doing a better job in handling
> such and not shutting down the threads.
> 
> Thanks
> Sachin
> 
> 
> On Sat, Mar 25, 2017 at 11:03 PM, Eno Thereska <eno.there...@gmail.com>
> wrote:
> 
>> Hi Sachin,
>> 
>> See my previous email on the NotLeaderForPartitionException error.
>> 
>> What is your Kafka configuration, how many brokers are you using? Also
>> could you share the replication level (if different from 1) of your streams
>> topics? Are there brokers failing while Streams is running?
>> 
>> Thanks
>> Eno
>> 
>> On 25/03/2017, 11:00, "Sachin Mittal" <sjmit...@gmail.com> wrote:
>> 
>>    Hi All,
>>    I am revisiting the ongoing issue of getting a multi instance multi
>>    threaded kafka streams cluster to work.
>> 
>>    Scenario is that we have a 12 partition source topic. (note our server
>>    cluster replication factor is 3).
>>    We have a 3 machines client cluster with one instance on each. Each
>>    instances uses 4 thread.
>>    Streams version is 0.10.2 with latest deadlock fix and rocks db
>>    optimization from trunk.
>> 
>>    We also have an identical single partition topic and another single
>>    threaded instance doing identical processing as the above one. This
>> uses
>>    version 0.10.1.1
>>    This streams application never goes down.
>> 
>>    The above application used to go down frequently with high cpu wait
>> time
>>    and also we used to get frequent deadlock issues. However since
>> including
>>    the fixes we see very little cpu wait time and now application does not
>>    enter into deadlock. The threads simply get uncaught exception thrown
>> from
>>    the streams application and they die one by one eventually shutting
>> down
>>    the entire client cluster.
>>    So we now need to understand what could be causing these exceptions
>> and how
>>    we can fix those.
>> 
>>    Here is the summary
>>    instance 84
>>    All four thread die due to
>>    org.apache.kafka.common.errors.NotLeaderForPartitionException: This
>> server
>>    is not the leader for that topic-partition.
>> 
>>    So is this something we can handle at streams level and not get it
>> thrown
>>    all the way to the thread.
>> 
>> 
>>    instance 85
>>    two again dies due to
>>    org.apache.kafka.common.errors.NotLeaderForPartitionException: This
>> server
>>    is not the leader for that topic-partition.
>> 
>>    other two die due to
>>    Caused by: org.rocksdb.RocksDBException: ~
>>    I know this is some known rocksdb issue. Is there a way we can handle
>> it at
>>    stream side. What do you suggest to avoid this or what can be causing
>> it.
>> 
>> 
>>    instance 87
>>    two again die due to
>>    org.apache.kafka.common.errors.NotLeaderForPartitionException: This
>> server
>>    is not the leader for that topic-partition.
>> 
>>    one dies due to
>>    org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s)
>> for
>>    new-part-advice-key-table-changelog-11: 30015 ms has passed since last
>>    append
>> 
>>    I have really not understood what this means and any idea what could
>> be the
>>    issue here?
>> 
>>    last one dies due to
>>    Caused by: java.lang.IllegalStateException: task [0_9] Log end offset
>> of
>>    new-part-advice-key-table-changelog-9 should not change while
>> restoring:
>>    old end offset 647352, current offset 647632
>> 
>>    I feel this should not be thrown to the stream thread too and handled
>> at
>>    streams level.
>> 
>>    The complete logs can be found at:
>>    https://www.dropbox.com/s/2t4ysfdqbtmcusq/complete_84_
>> 85_87_log.zip?dl=0
>> 
>>    So I feel basically the streams application should be more resilient
>> and
>>    should not fail due to exceptions but should have a way to handle them.
>>    or provide programmers the hooks that even in case a stream thread is
>> shut
>>    down there is a way to start a new thread so that we have a running
>> streams
>>    application.
>> 
>>    The popular reason seems to me
>>    org.apache.kafka.common.errors.NotLeaderForPartitionException, and
>> this
>>    along with few others should get handled.
>> 
>>    Let us know what are your thoughts.
>> 
>> 
>>    Thanks
>>    Sachin
>> 
>> 
>> 
>>

Re: 3 machines 12 threads all fail within 2 hours of starting the streams application

Reply via email to