Forgot to mention is that the Kafka version we're using is from Aug's Trunk branch---which has the SSL support.
Thanks again, Qi On Mon, Nov 23, 2015 at 2:29 PM, Qi Xu <shkir...@gmail.com> wrote: > Loop another guy from our team. > > On Mon, Nov 23, 2015 at 2:26 PM, Qi Xu <shkir...@gmail.com> wrote: > >> Hi folks, >> We have a 10 node cluster and have several topics. Each topic has about >> 256 partitions with 3 replica factor. Now we run into an issue that in some >> topic, a few partition (< 10)'s leader is -1 and all of them has only one >> synced partition. >> >> From the Kafka manager, here's the snapshot: >> [image: Inline image 2] >> >> [image: Inline image 1] >> >> here's the state log: >> [2015-11-23 21:57:58,598] ERROR Controller 1 epoch 435499 initiated state >> change for partition [userlogs,84] from OnlinePartition to OnlinePartition >> failed (state.change.logger) >> kafka.common.StateChangeFailedException: encountered error while electing >> leader for partition [userlogs,84] due to: Preferred replica 0 for >> partition [userlogs,84] is either not alive or not in the isr. Current >> leader and ISR: [{"leader":-1,"leader_epoch":203,"isr":[1]}]. >> Caused by: kafka.common.StateChangeFailedException: Preferred replica 0 >> for partition [userlogs,84] is either not alive or not in the isr. Current >> leader and ISR: [{"leader":-1,"leader_epoch":203,"isr":[1]}] >> >> My question is: >> 1) how could this happen and how can I fix it or work around it? >> 2) Is 256 partitions too big? We have about 200+ cores for spark >> streaming job. >> >> Thanks, >> Qi >> >> >