Our 2 kafka brokers ( 1 & 5) were rebooted due to hypervisor going down
and I think we encountered a similar
issue that was discussed in thread "Problem with node after restart no
partitions?". The resulting JIRA
<https://issues.apache.org/jira/browse/KAFKA-2108> is closed without
conclusions or
recovery steps.
Our Brokers 5 and 1 were also running zookeeper of our cluster (along
with broker 2),
we are running kafka version 0.8.2.1
After doing a controlled restarts over all brokers a few times our
cluster seems ok now.
But there are a some topics that have replicas out of sync with Leaders.
Partition 2 below has Leader 5 and replicas order should be 5,1
Topic:2015-01-12 PartitionCount:3 ReplicationFactor:2 Configs:
Topic: 2015-01-12 Partition: 0 Leader: 4
Replicas: 4,3 Isr: 3,4
Topic: 2015-01-12 Partition: 1 Leader: 0
Replicas: 0,4 Isr: 0,4
Topic: 2015-01-12 Partition: 2 Leader: 5
Replicas: 1,5 Isr: 5
I tried reassigning partition 2 replicas to broker 5 (leader) and broker : 0
Now partition reassignment is stuck for more than a day.
%) /usr/local/kafka/bin/kafka-reassign-partitions.sh --zookeeper
kafka-trgt05:2182 --reassignment-json-file 2015-01-12_2.json --verify
Status of partition reassignment:
Reassignment of partition [2015-01-12,2] is still in progress
And In zookeeper, reassign_partitions is empty..
[zk: kafka-trgt05:2182(CONNECTED) 2] ls /admin/reassign_partitions
[]
Any thoughts on how to recover from this scenario?
Cheers,
/Manish
Our server.properties :
broker.id=0
port=9192
num.network.threads=12
num.io.threads=12
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
queued.max.requests=16
auto.leader.rebalance.enable=true
controlled.shutdown.enable=true
controlled.shutdown.retry.backoff.ms=30000
fetch.purgatory.purge.interval.requests=100
producer.purgatory.purge.interval.requests=100
controller.socket.timeout.ms=30000
controller.message.queue.size=10000
log.dirs=/opt/kafka/data/logs
num.partitions=5
default.replication.factor=2
delete.topic.enable=true
num.replica.fetchers=8
replica.fetch.max.bytes=1048576
replica.fetch.wait.max.ms=5000
replica.socket.timeout.ms=30000
replica.socket.receive.buffer.bytes=1048576
replica.lag.time.max.ms=10000
replica.lag.max.messages=4000
replica.fetch.min.bytes=10240
log.flush.interval.messages=10000
log.flush.interval.ms=1000
log.retention.hours=72
log.segment.bytes=536870912
log.retention.check.interval.ms=60000
log.cleaner.enable=true
zookeeper.connect=kafka-trgt05:2182,kafka-trgt01:2182,kafka-trgt02:2182
zookeeper.connection.timeout.ms=1000000