Firstly, 1.0.1 is out and I'd strongly advise you to use that as the upgrade path over 1.0.0 if you can because it contains a lot of bugfixes. Some critical.
With unclean leader elections it should have resolved itself when the affected broker came back online and all partitions were available. So probably there was an issue there. Personally I had a lot of struggles upgrading off of 0.10 with bugged large consumer offset partitions (10s and 100s of GBs) that had stopped compacting and should have been in the MBs. The largest ones took 45 minutes to compact which spread out the rolling upgrade time significantly. Also occasionally even with a clean shutdown there was corruption detected on broker start and it took time for the repair -- a /lot/ of time. In both cases it was easily seen in the logs, and significantly increased disk IO metrics on boot (and metrics for FD use gradually returning to previous levels). Was it all with the one broker, or across multiple? Did you follow the rolling upgrade procedure? At what point in the rolling process did the first issue appear? https://kafka.apache.org/10/documentation/#upgrade (that's for 1.0.x) On Mon, Apr 23, 2018 at 4:04 PM, Mika Linnanoja <mika.linnan...@rovio.com> wrote: > Hello, > > Last week I upgraded one relatively large kafka (EC2, 10 brokers, ~30 TB > data, 100-300 Mbps in/out per instance) 0.10.0.1 > <http://0.10.0.1> cluster to > 1.0, and saw > some issues. > > Out of ~100 topics with 2..20 partitions each, 9 partitions in 8 topics > become "unavailable" across 3 brokers. The leader was shown as -1 and ISR > was empty. Java service using 0.10.0.1 > <http://0.10.0.1> clients was > unable to send any data > to these partitions so it got dropped. > > The partitions were shown on the `kafka/bin/kafka-topics.sh --zookeeper > <zk's> --unavailable-partitions --describe` output. Nothing special about > these partitions, among them were big ones (hundreds of gigs) and tiny ones > (megabytes). > > The fix was to set up the unclean leader elections and restart one of the > affected brokers in each partition: `kafka/bin/kafka-configs.sh --zookeeper > <zk's> --entity-type topics --entity-name <topicname> --add-config > unclean.leader.election.enable=true --alter`. > > Anyone seen something like this, how to avoid it when next upgrading > perchance? Maybe it would be better if said cluster got no traffic during > upgrade, but we cannot have a maintenance break as everything is up 24/7. > Cluster is for analytics data, some of which is consumed in real-time > applications, mostly by secor. > > BR, > Mika > > -- > *Mika Linnanoja* > Senior Cloud Engineer > Games Technology > Rovio Entertainment Corp > Keilaranta 7 > <https://maps.google.com/?q=Keilaranta+7&entry=gmail&source=g>, FIN - > 02150 Espoo, Finland > mika.linnan...@rovio.com > www.rovio.com <http://www.rovio.com> > -- Brett Rann Senior DevOps Engineer Zendesk International Ltd 395 Collins Street, Melbourne VIC 3000 Australia Mobile: +61 (0) 418 826 017