We’ve seen this similar in our setup and as you noticed it does happen 
infrequently. Based on my debugging there are few things that might be causing 
this issue , one of them would be
1. replica.lag.time.max.ms set to 10secs by default
2. replica.socket.timeout.ms set to 30secs by default

In situations where the broker is busy with lots of clients , a follower making 
a replica request and if this request takes longer or times out i.e waits for 
30 secs and didn’t get any response. ReplicaManager thread calls maybeShrinkISR 
and shrinks the ISR if there no call from a follower with in 
replica.lag.time.max.ms which is possible in cases of heavy load and given the 
socket timeout itself takes 30secs it can be marked as not in ISR.

What we’ve seen is shrinkISR and expandISR happening back to back i.e one call 
is getting timed out and subsequent call making it part of ISR.  One option to 
try is to lower the socket timeout to be lower and increase the lag.time.max.ms 
.

Thanks,
Harsha
On Jan 27, 2019, 8:48 AM -0800, Ashish Karalkar 
<ashish_karal...@yahoo.com.INVALID>, wrote:
> Hi Harsha,
> Thanks for the reply.
> Issue is resolved as of now and the root cause was a runaway application 
> spawning many instances of kafkacat and hammering kafka brokers. I am still 
> wondering that what could be reason for shrink and expand is a client hammers 
> a broker  .
> --Ashish
> On Thursday, January 24, 2019, 8:53:10 AM PST, Harsha Chintalapani 
> <ka...@harsha.io> wrote:
>
> Hi Ashish,
>            Whats your replica.lag.time.max.ms set to and do you see any 
> network issues between brokers.
> -Harsha
>
>
>
> On Jan 22, 2019, 10:09 PM -0800, Ashish Karalkar 
> <ashish_karal...@yahoo.com.INVALID>, wrote:
> > Hi All,
> > We just upgraded from 0.10.x to 1.1 and enabled rack awareness on an 
> > existing clusters which has about 20 nodes in 4 rack . After this we see 
> > that few brokers goes on continuous expand and shrink ISR to itself  cycle 
> > , it is also causing high time for serving meta data requests.
> > What is the impact of enabling rack awareness on existing cluster assuming 
> > replication factor is 3 and all existing replica may or may not be in 
> > different rack when rack awareness was enabled after which a rolling bounce 
> > was done.
> > Symptoms we are having are replica lag and slow metadata requests. Also in 
> > brokers log we continuously see disconnection from the broker where it is 
> > trying to expand.
> > Thanks for helping
> > --A

Reply via email to