Re: Taking a long time to roll a new log segment (~1 min)

Stephen Powis Thu, 12 Jan 2017 07:38:23 -0800

Per my email to the list in Sept, when I reviewed GC logs then, I didn't
see anything out of the ordinary. (
http://mail-archives.apache.org/mod_mbox/kafka-users/201609.mbox/%3CCABQB-gS7h4Nuq3TKgHoAVeRHPWnBNs2B0Tz0kCjmdB9c0SDcLQ%40mail.gmail.com%3E
)


Reviewing the GC logs from this morning around the timestamp in my previous
email, I see GC running roughly every 10-12 seconds, with total times
similar to the following:

2017-01-12T07:16:46.867-0500: 46891.844: Total time for which application
> threads were stopped: 0.0141281 seconds, Stopping threads took: 0.0002171
> seconds
>
>
Here is a link to a GCEasy report:
http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTcvMDEvMTIvLS10b3RhbEdDLWthZmthMS00LmxvZy5nei0tMTUtMzQtNTk=


Currently using G1 gc with the following settings:

-Xmx12G -Xms12G -server -XX:MaxPermSize=48M -verbose:gc
-Xloggc:/var/log/kafka/gc.log -XX:+PrintGCDateStamps -XX:+PrintGCDetails
-XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime
-XX:+PrintTLAB -XX:+DisableExplicitGC -XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -XX:+UseCompressedOops
-XX:+AlwaysPreTouch -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/kafka/heapDump.log
-Xloggc:/opt/kafka/current/bin/../logs/kafkaServer-gc.log -verbose:gc
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps



On Thu, Jan 12, 2017 at 9:55 AM, Tauzell, Dave <dave.tauz...@surescripts.com
> wrote:

> Can you collect garbage collection stats and verify there isn't a long GC
> happening at the same time?
>
> -Dave
>
> -----Original Message-----
> From: Stephen Powis [mailto:spo...@salesforce.com]
> Sent: Thursday, January 12, 2017 8:34 AM
> To: users@kafka.apache.org
> Subject: Re: Taking a long time to roll a new log segment (~1 min)
>
> So per the kafka docs I up'd our FD limit to 100k, and we are no longer
> seeing the process die, which is good.
>
> Unfortunately we're still seeing very high log segment roll times, and I'm
> unsure if this is considered 'normal', as it tends to block producers
> during this period.
>
> We are running kafka 0.10.0.1, but I patched in some additionally timing
> statements into the kafka.log.log roll() method to narrow down exactly
> which part of that method is taking so long.
>
> Again, typically the process to roll a new log file takes only 1-2ms tops,
> but several times a day it takes 30-60+ seconds, across all of our
> brokers.  I've narrowed it down to this bit of code causing the issue:
> https://github.com/apache/kafka/blob/0.10.0/core/src/
> main/scala/kafka/log/Log.scala#L652-L658
>
> Here's an example of output w/ my additional timing log statements:
>
> [2017-01-12 07:17:58,199] INFO Rolled new log segment for 'MyTopic-4' in
> > 28028 ms. (kafka.log.Log)
>
> [2017-01-12 07:17:58,200] INFO LastEntry match time 'EventStream-4' in
> > 28027 ms. (kafka.log.Log)
> >
>
> Any thoughts?  Help?  This considered normal?
>
> Thanks
> Stephen
>
>
> On Tue, Jan 10, 2017 at 2:37 AM, Ewen Cheslack-Postava <e...@confluent.io>
> wrote:
>
> > I can't speak to the exact details of why fds would be kept open
> > longer in that specific case, but are you aware that the
> > recommendation for production clusters for open fd limits is much
> > higher? It's been suggested to be 100,000 as a starting point for quite
> awhile:
> > http://kafka.apache.org/documentation.html#os
> >
> > -Ewen
> >
> > On Mon, Jan 9, 2017 at 12:45 PM, Stephen Powis <spo...@salesforce.com>
> > wrote:
> >
> > > Hey!
> > >
> > > I've run into something concerning in our production cluster....I
> > > believe I've posted this question to the mailing list previously (
> > > http://mail-archives.apache.org/mod_mbox/kafka-users/201609.mbox/bro
> > > wser
> > )
> > > but the problem has become considerably more serious.
> > >
> > > We've been fighting issues where Kafka 0.10.0.1 hits its max file
> > > descriptor limit.  Our limit is set to ~16k, and under normal
> > > operation
> > it
> > > holds steady around 4k open files.
> > >
> > > But occasionally Kafka will roll a new log segment, which typically
> > > takes on the order of magnitude of a few milliseconds.
> > > However...sometimes it will take a considerable amount of time, any
> > > where from 40 seconds up to over a minute.  When this happens, it
> > > seems like connections are not released by kafka, and we end up with
> > > thousands of client connections
> > stuck
> > > in CLOSE_WAIT, which pile up and exceed our max file descriptor limit.
> > > This happens all in the span of about a minute.
> > >
> > > Our logs look like this:
> > >
> > > [2017-01-08 01:10:17,117] INFO Rolled new log segment for
> > > 'MyTopic-8' in
> > > > 41122 ms. (kafka.log.Log)
> > > > [2017-01-08 01:10:32,550] INFO Rolled new log segment for 'MyTopic-4'
> > in
> > > 1
> > > > ms. (kafka.log.Log)
> > > > [2017-01-08 01:11:10,039] INFO [Group Metadata Manager on Broker 4]:
> > > > Removed 0 expired offsets in 0 milliseconds.
> > > > (kafka.coordinator.GroupMetadataManager)
> > > > [2017-01-08 01:19:02,877] ERROR Error while accepting connection
> > > > (kafka.network.Acceptor)
> > > > java.io.IOException: Too many open files       at
> > > > sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> > > >
> > >         at
> > > > sun.nio.ch.ServerSocketChannelImpl.accept(
> > ServerSocketChannelImpl.java:
> > > 422)
> > > >         at
> > > > sun.nio.ch.ServerSocketChannelImpl.accept(
> > ServerSocketChannelImpl.java:
> > > 250)
> > > >         at kafka.network.Acceptor.accept(SocketServer.scala:323)
> > > >         at kafka.network.Acceptor.run(SocketServer.scala:268)
> > > >         at java.lang.Thread.run(Thread.java:745)
> > > > [2017-01-08 01:19:02,877] ERROR Error while accepting connection
> > > > (kafka.network.Acceptor)
> > > > java.io.IOException: Too many open files
> > > >         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> > > >         at
> > > > sun.nio.ch.ServerSocketChannelImpl.accept(
> > ServerSocketChannelImpl.java:
> > > 422)
> > > >         at
> > > > sun.nio.ch.ServerSocketChannelImpl.accept(
> > ServerSocketChannelImpl.java:
> > > 250)
> > > >         at kafka.network.Acceptor.accept(SocketServer.scala:323)
> > > >         at kafka.network.Acceptor.run(SocketServer.scala:268)
> > > >         at java.lang.Thread.run(Thread.java:745)
> > > > .....
> > > >
> > >
> > >
> > > And then kafka crashes.
> > >
> > > Has anyone seen this behavior of slow log segmented being rolled?
> > > Any ideas of how to track down what could be causing this?
> > >
> > > Thanks!
> > > Stephen
> > >
> >
> This e-mail and any files transmitted with it are confidential, may
> contain sensitive information, and are intended solely for the use of the
> individual or entity to whom they are addressed. If you have received this
> e-mail in error, please notify the sender by reply e-mail immediately and
> destroy all copies of the e-mail and any attachments.
>

Re: Taking a long time to roll a new log segment (~1 min)

Reply via email to