Also just a heads up that a PR that increases resiliency is being currently reviewed and should hopefully hit trunk soon: https://github.com/apache/kafka/pull/2719 <https://github.com/apache/kafka/pull/2719>. This covers certain broker failure scenarios as well as a (hopefully last) case when state locking fails. It also includes the retries config I had mentioned earlier.
Thanks Eno > On 25 Mar 2017, at 18:14, Sachin Mittal <sjmit...@gmail.com> wrote: > > Hi, > The broker is a three machine cluster. The replication factor for input and > also internal topics is 3. > Brokers don't seem to fail. I always see their instances running. > > Also note that when an identical streams application with single thread on > a single instance is pulling data from some other non partitioned identical > topic, the application never fails. Note there too replication factor is 3 > for input and internal topics. > > Please let us know if you have something for other errors. Also what ways > we can make the steams resilient. I do feel we need hooks to start new > stream threads just in case some thread shuts down due to unhandled > exception, or streams application itself doing a better job in handling > such and not shutting down the threads. > > Thanks > Sachin > > > On Sat, Mar 25, 2017 at 11:03 PM, Eno Thereska <eno.there...@gmail.com> > wrote: > >> Hi Sachin, >> >> See my previous email on the NotLeaderForPartitionException error. >> >> What is your Kafka configuration, how many brokers are you using? Also >> could you share the replication level (if different from 1) of your streams >> topics? Are there brokers failing while Streams is running? >> >> Thanks >> Eno >> >> On 25/03/2017, 11:00, "Sachin Mittal" <sjmit...@gmail.com> wrote: >> >> Hi All, >> I am revisiting the ongoing issue of getting a multi instance multi >> threaded kafka streams cluster to work. >> >> Scenario is that we have a 12 partition source topic. (note our server >> cluster replication factor is 3). >> We have a 3 machines client cluster with one instance on each. Each >> instances uses 4 thread. >> Streams version is 0.10.2 with latest deadlock fix and rocks db >> optimization from trunk. >> >> We also have an identical single partition topic and another single >> threaded instance doing identical processing as the above one. This >> uses >> version 0.10.1.1 >> This streams application never goes down. >> >> The above application used to go down frequently with high cpu wait >> time >> and also we used to get frequent deadlock issues. However since >> including >> the fixes we see very little cpu wait time and now application does not >> enter into deadlock. The threads simply get uncaught exception thrown >> from >> the streams application and they die one by one eventually shutting >> down >> the entire client cluster. >> So we now need to understand what could be causing these exceptions >> and how >> we can fix those. >> >> Here is the summary >> instance 84 >> All four thread die due to >> org.apache.kafka.common.errors.NotLeaderForPartitionException: This >> server >> is not the leader for that topic-partition. >> >> So is this something we can handle at streams level and not get it >> thrown >> all the way to the thread. >> >> >> instance 85 >> two again dies due to >> org.apache.kafka.common.errors.NotLeaderForPartitionException: This >> server >> is not the leader for that topic-partition. >> >> other two die due to >> Caused by: org.rocksdb.RocksDBException: ~ >> I know this is some known rocksdb issue. Is there a way we can handle >> it at >> stream side. What do you suggest to avoid this or what can be causing >> it. >> >> >> instance 87 >> two again die due to >> org.apache.kafka.common.errors.NotLeaderForPartitionException: This >> server >> is not the leader for that topic-partition. >> >> one dies due to >> org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) >> for >> new-part-advice-key-table-changelog-11: 30015 ms has passed since last >> append >> >> I have really not understood what this means and any idea what could >> be the >> issue here? >> >> last one dies due to >> Caused by: java.lang.IllegalStateException: task [0_9] Log end offset >> of >> new-part-advice-key-table-changelog-9 should not change while >> restoring: >> old end offset 647352, current offset 647632 >> >> I feel this should not be thrown to the stream thread too and handled >> at >> streams level. >> >> The complete logs can be found at: >> https://www.dropbox.com/s/2t4ysfdqbtmcusq/complete_84_ >> 85_87_log.zip?dl=0 >> >> So I feel basically the streams application should be more resilient >> and >> should not fail due to exceptions but should have a way to handle them. >> or provide programmers the hooks that even in case a stream thread is >> shut >> down there is a way to start a new thread so that we have a running >> streams >> application. >> >> The popular reason seems to me >> org.apache.kafka.common.errors.NotLeaderForPartitionException, and >> this >> along with few others should get handled. >> >> Let us know what are your thoughts. >> >> >> Thanks >> Sachin >> >> >> >>