Re: Migrating a cluster from 0.8.0 to 0.8.1

Drew Goya Mon, 23 Dec 2013 15:22:47 -0800

Hey All, another thing to report for my 0.8.1 migration.  I am seeing these
errors occasionally right after a I run a leader election.  This looks to
be related to KAFKA-860 as it is the same exception.  I see this issue was
closed a while go though and I should be running a commit with the fix in.
 I'm on trunk/87efda.


I also see there is a more recent issue with replica threads dying out
while becoming followers (KAFKA-1178) but I'm not seeing that exception.
 I'm going to roll updates through the cluster and bring my brokers up to
trunk/b23cf1 and see how that goes.

[2013-12-23 22:54:38,389] ERROR [ReplicaFetcherThread-0-11], Error due to
 (kafka.server.ReplicaFetcherThread)
kafka.common.KafkaException: error processing data for partition
[Events2,113] offset 1077499310
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$1$$anonfun$apply$mcV$sp$2.apply(AbstractFetcherThread.scala:139)
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$1$$anonfun$apply$mcV$sp$2.apply(AbstractFetcherThread.scala:111)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:105)
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$1.apply$mcV$sp(AbstractFetcherThread.scala:111)
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$1.apply(AbstractFetcherThread.scala:111)
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$1.apply(AbstractFetcherThread.scala:111)
at kafka.utils.Utils$.inLock(Utils.scala:538)
at
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:110)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
Caused by: java.lang.RuntimeException: Offset mismatch: fetched offset =
1077499310, log end offset = 1077499313.
at
kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:49)
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$1$$anonfun$apply$mcV$sp$2.apply(AbstractFetcherThread.scala:130)
... 9 more


On Mon, Dec 23, 2013 at 2:50 PM, Drew Goya <d...@gradientx.com> wrote:

> We are running on an Amazon Linux AMI, this is our specific version:
>
> Linux version 2.6.32-220.23.1.el6.centos.plus.x86_64 (
> mockbu...@c6b5.bsys.dev.centos.org) (gcc version 4.4.6 20110731 (Red Hat
> 4.4.6-3) (GCC) ) #1 SMP Tue Jun 19 04:14:37 BST 2012
>
>
> On Mon, Dec 23, 2013 at 11:24 AM, Guozhang Wang <wangg...@gmail.com>wrote:
>
>> Hi Drew,
>>
>> I tried the kafka-server-stop script and it worked for me. Wondering which
>> OS are you using?
>>
>> Guozhang
>>
>>
>> On Mon, Dec 23, 2013 at 10:57 AM, Drew Goya <d...@gradientx.com> wrote:
>>
>> > Occasionally I do have to hard kill brokers, the kafka-server-stop.sh
>> > script stopped working for me a few months ago.  I saw another thread in
>> > the mailing list mentioning the issue too.  I'll change the signal back
>> to
>> > SIGTERM and run that way for a while, hopefully the problem goes away.
>> >
>> > This is the commit where it changed:
>> >
>> >
>> >
>> https://github.com/apache/kafka/commit/51de7c55d2b3107b79953f401fc8c9530bd0eea0
>> >
>> >
>> > On Mon, Dec 23, 2013 at 10:09 AM, Neha Narkhede <
>> neha.narkh...@gmail.com
>> > >wrote:
>> >
>> > > Are you hard killing the brokers? And is this issue reproducible?
>> > >
>> > >
>> > > On Sat, Dec 21, 2013 at 11:39 AM, Drew Goya <d...@gradientx.com>
>> wrote:
>> > >
>> > > > Hey guys, another small issue to report for 0.8.1.  After a couple
>> > days 3
>> > > > of my brokers had fallen off the ISR list for a 2-3 of their
>> > partitions.
>> > > >
>> > > > I didn't see anything unusual in the log and I just restarted one.
>>  It
>> > > came
>> > > > up fine but as it loaded its logs I these messages showed up:
>> > > >
>> > > > [2013-12-21 19:25:19,968] WARN [ReplicaFetcherThread-0-2], Replica 1
>> > for
>> > > > partition [Events2,58] reset its fetch offset to current leader 2's
>> > start
>> > > > offset 1042738519 (kafka.server.ReplicaFetcherThread)
>> > > > [2013-12-21 19:25:19,969] WARN [ReplicaFetcherThread-0-14], Replica
>> 1
>> > for
>> > > > partition [Events2,28] reset its fetch offset to current leader 14's
>> > > start
>> > > > offset 1043415514 (kafka.server.ReplicaFetcherThread)
>> > > > [2013-12-21 19:25:20,012] WARN [ReplicaFetcherThread-0-2], Current
>> > offset
>> > > > 1011209589 for partition [Events2,58] out of range; reset offset to
>> > > > 1042738519 (kafka.server.ReplicaFetcherThread)
>> > > > [2013-12-21 19:25:20,013] WARN [ReplicaFetcherThread-0-14], Current
>> > > offset
>> > > > 1010086751 for partition [Events2,28] out of range; reset offset to
>> > > > 1043415514 (kafka.server.ReplicaFetcherThread)
>> > > > [2013-12-21 19:25:20,036] WARN [ReplicaFetcherThread-0-14], Replica
>> 1
>> > for
>> > > > partition [Events2,71] reset its fetch offset to current leader 14's
>> > > start
>> > > > offset 1026871415 (kafka.server.ReplicaFetcherThread)
>> > > > [2013-12-21 19:25:20,036] WARN [ReplicaFetcherThread-0-2], Replica 1
>> > for
>> > > > partition [Events2,44] reset its fetch offset to current leader 2's
>> > start
>> > > > offset 1052372907 (kafka.server.ReplicaFetcherThread)
>> > > > [2013-12-21 19:25:20,036] WARN [ReplicaFetcherThread-0-14], Current
>> > > offset
>> > > > 993879706 for partition [Events2,71] out of range; reset offset to
>> > > > 1026871415 (kafka.server.ReplicaFetcherThread)
>> > > > [2013-12-21 19:25:20,036] WARN [ReplicaFetcherThread-0-2], Current
>> > offset
>> > > > 1020715056 for partition [Events2,44] out of range; reset offset to
>> > > > 1052372907 (kafka.server.ReplicaFetcherThread)
>> > > >
>> > > > Judging by the network traffic and disk usage changes after the
>> reboot
>> > > > (both jumped up) a couple of the partition replicas had fallen
>> behind
>> > and
>> > > > are now catching up.
>> > > >
>> > > >
>> > > > On Thu, Dec 19, 2013 at 4:37 PM, Neha Narkhede <
>> > neha.narkh...@gmail.com
>> > > > >wrote:
>> > > >
>> > > > > Hi Drew,
>> > > > >
>> > > > > That problem will be fixed by
>> > > > > https://issues.apache.org/jira/browse/KAFKA-1074. I think we are
>> > close
>> > > > to
>> > > > > checking that in to trunk.
>> > > > >
>> > > > > Thanks,
>> > > > > Neha
>> > > > >
>> > > > >
>> > > > > On Wed, Dec 18, 2013 at 9:02 AM, Drew Goya <d...@gradientx.com>
>> > wrote:
>> > > > >
>> > > > > > Thanks Neha, I rolled upgrades and completed a rebalance!
>> > > > > >
>> > > > > > I ran into a few small issues I figured I would share.
>> > > > > >
>> > > > > > On a few Brokers, there were some log directories left over from
>> > some
>> > > > > > failed rebalances which prevented the 0.8.1 brokers from
>> starting
>> > > once
>> > > > I
>> > > > > > completed the upgrade.  These directories contained an index
>> file
>> > > and a
>> > > > > > zero size log file, once I cleaned those out the brokers were
>> able
>> > to
>> > > > > start
>> > > > > > up fine.  If anyone else runs into the same problem, and is
>> running
>> > > > RHEL,
>> > > > > > this is the bash script I used to clean them out:
>> > > > > >
>> > > > > > du --max-depth=1 -h /data/kafka/logs | grep K | sed s/.*K.// |
>> sudo
>> > > rm
>> > > > -r
>> > > > > >
>> > > > > >
>> > > > > > On Tue, Dec 17, 2013 at 10:42 AM, Neha Narkhede <
>> > > > neha.narkh...@gmail.com
>> > > > > > >wrote:
>> > > > > >
>> > > > > > > There are no compatibility issues. You can roll upgrades
>> through
>> > > the
>> > > > > > > cluster one node at a time.
>> > > > > > >
>> > > > > > > Thanks
>> > > > > > > Neha
>> > > > > > >
>> > > > > > >
>> > > > > > > On Tue, Dec 17, 2013 at 9:15 AM, Drew Goya <
>> d...@gradientx.com>
>> > > > wrote:
>> > > > > > >
>> > > > > > > > So I'm going to be going through the process of upgrading a
>> > > cluster
>> > > > > > from
>> > > > > > > > 0.8.0 to the trunk (0.8.1).
>> > > > > > > >
>> > > > > > > > I'm going to be expanding this cluster several times and the
>> > > > problems
>> > > > > > > with
>> > > > > > > > reassigning partitions in 0.8.0 mean I have to move to
>> > > trunk(0.8.1)
>> > > > > > asap.
>> > > > > > > >
>> > > > > > > > Will it be safe to roll upgrades through the cluster one by
>> > one?
>> > > > > > > >
>> > > > > > > > Also are there any client compatibility issues I need to
>> worry
>> > > > about?
>> > > > > > >  Am I
>> > > > > > > > going to need to pause/upgrade all my consumers/producers at
>> > once
>> > > > or
>> > > > > > can
>> > > > > > > I
>> > > > > > > > roll upgrades through the cluster and then upgrade my
>> clients
>> > one
>> > > > by
>> > > > > > one?
>> > > > > > > >
>> > > > > > > > Thanks in advance!
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> -- Guozhang
>>
>
>

Re: Migrating a cluster from 0.8.0 to 0.8.1

Reply via email to