Thanks for your answer, Neha. Currently we didn't save the GC log. I will add that option and keep monitoring the issue.
Regards, Libo -----Original Message----- From: Neha Narkhede [mailto:neha.narkh...@gmail.com] Sent: Wednesday, August 28, 2013 4:25 PM To: users@kafka.apache.org Subject: Re: zookeeper session time out Ah, you maybe hitting the GC due to IO issue. You can confirm if this is really the case by looking at the gc.log on the broker and check if you see a GC entry with a small user and sys time but high real time. We saw a similar IO-causing-GC pauses problem when compressing our request log4j files which happens every hour or so. Since these files are large and the gzip process hogs the IO bandwidth, the linux box hits the dirty_ratio threshold and the kernel stops all threads doing I/O until all the dirty pages are flushed to disk. We have seen GC pauses until 15-20 seconds when this happens. A workaround is to increase your zookeeper session timeout higher to prevent the session expiration and the leader re-elections that follow. As for your file deletion issue, we have seen that if you configure a Kafka broker with time based expiration, it ends up deleting possibly 100s of large segment files all at the same time. This puts pressure on file system journaling (we are using ext4 in data=ordered mode) and it slows down writes on the Kafka side. Kafka should throttle time based rolling as well as time based expiration to prevent this situation. With that said, we have never really seen this cause a GC pause like the one you described though. So it will be good to investigate the root cause of your GC pause anyway. Could you check your gc.log and send back the relevant part of the log that shows the pause? Thanks, Neha On Wed, Aug 28, 2013 at 1:09 PM, Yu, Libo <libo...@citi.com> wrote: > Hi team, > > We notice when the incoming throughput is very high, the broker has to > delete old log files to free up disk space. That caused some kind of > blocking > (latency) and > frequently the broker's zookeeper session times out. Currently our > zookeeper time out threshold is 4s. We can increase it. But if this > threshold is too large, what is the consequence? Thanks. > > > Libo > >