add more information. we found that the linux(centos) running kafka has a very high context switch. and at first we thought it's caused by corrupted message data, but I tried to delete all data in one node and restarted kafka, but cpu usage is still high.
following are outputs from sar and vmstat: [root@ ~]# vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 17 0 0 3189824 372180 10568836 0 0 0 3 0 0 4 4 92 0 0 17 0 0 3189816 372180 10568836 0 0 0 16 16702 4772818 28 25 47 0 0 16 0 0 3190224 372180 10568836 0 0 0 0 18956 5164116 31 29 40 0 0 16 0 0 3190404 372180 10568836 0 0 0 60 25567 4976697 29 28 43 0 0 15 0 0 3190424 372180 10568836 0 0 0 0 15033 4883211 29 28 43 0 0 16 0 0 3190672 372180 10568836 0 0 0 0 15340 4891077 29 29 42 0 0 14 0 0 3190796 372180 10568836 0 0 0 8 14408 4649112 27 26 47 0 0 16 0 0 3190832 372180 10568836 0 0 0 32 16415 5097257 30 31 39 0 0 [root@ ~]# sar -w 1 02:54:55 PM proc/s cswch/s 02:54:56 PM 0.00 6380752.63 02:54:57 PM 0.00 6295443.04 02:54:58 PM 0.00 6107574.36 02:54:59 PM 0.00 6512588.89 02:55:00 PM 0.00 5686784.52 02:55:01 PM 0.00 6081308.64 02:55:02 PM 0.00 6099097.53 02:55:03 PM 0.00 6085111.25 02:55:04 PM 0.00 6206551.90 02:55:05 PM 0.00 6026546.84 Average: 0.00 6145005.26 On Sat, Mar 19, 2016 at 11:32 PM, li jinyu <disorder...@gmail.com> wrote: > thanks, the new kafka cluster is stable now, I want to know the root cause. > > On Sat, Mar 19, 2016 at 3:34 PM, Stevo Slavić <ssla...@gmail.com> wrote: > >> Consider upgrading to 0.9.0.1 - our cluster is much more stable since >> upgrade. Clients can be upgraded later. >> >> On Fri, Mar 18, 2016, 23:22 li jinyu <disorder...@gmail.com> wrote: >> >> > I want to add more information. >> > >> > we've already setup a completely new Kafka cluster with 0.8.2.2, and the >> > old cluster is still there, but no consumer or producer, it's really >> wired >> > that the CPU usage is still high. >> > >> > also changed fetch.purgatory.purge.interval.requests and >> > producer.purgatory.purge.interval.requests >> > to 100, as I saw the high CPU usage with FetchRequestPurgatory.expire(), >> > but no luck. >> > >> > >> > On Fri, Mar 18, 2016 at 7:07 PM, li jinyu <disorder...@gmail.com> >> wrote: >> > >> > > hi all, >> > > >> > > we encountered problem with 0.8.1.1, as a lot of BadVersion error in >> log, >> > > we believe it's caused by issue >> > > https://issues.apache.org/jira/browse/KAFKA-1382, so we upgraded >> kafka >> > to >> > > 0.8.2.2. >> > > >> > > but we found that kafka is using high CPU. following is the result of >> > > 'top' command: >> > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> > > >> > > 16781 root 20 0 9072m 874m 17m S 975.5 5.5 471:27.56 java >> > > >> > > I tried to use HRPOF to measure the CPU usage, got the result: >> > > CPU SAMPLES BEGIN (total = 139166340) Fri Mar 18 16:30:28 2016 rank >> self >> > > accum count trace method 1 37.63% 37.63% 52372145 325245 >> > > sun.nio.ch.EPollArrayWrapper.epollWait 2 33.37% 71.00% 46441166 319805 >> > > sun.nio.ch.EPollArrayWrapper.epollWait 3 8.49% 79.50% 11819885 302339 >> > > java.net.PlainSocketImpl.socketAccept 4 4.25% 83.74% 5909221 302910 >> > > java.net.PlainSocketImpl.socketAccept 5 4.23% 87.98% 5891650 310256 >> > > sun.nio.ch.EPollArrayWrapper.epollWait 6 4.21% 92.19% 5859357 319905 >> > > sun.nio.ch.EPollArrayWrapper.epollWait 7 4.16% 96.34% 5782672 336990 >> > > java.net.SocketInputStream.socketRead0 8 1.99% 98.33% 2762683 320651 >> > > sun.misc.Unsafe.park 9 0.27% 98.60% 375233 319447 >> sun.misc.Unsafe.park 10 >> > > 0.25% 98.85% 351701 320536 sun.misc.Unsafe.park 11 0.24% 99.09% 329232 >> > > 329274 sun.misc.Unsafe.park 12 0.08% 99.17% 109428 325405 >> > > sun.nio.ch.EPollArrayWrapper.epollCtl 13 0.05% 99.22% 69762 329549 >> > > sun.nio.ch.EPollArrayWrapper.interrupt 14 0.03% 99.25% 45197 328439 >> > > kafka.network.RequestChannel$Request.updateRequestMetrics 15 0.03% >> 99.28% >> > > 38496 328333 sun.nio.ch.FileDispatcherImpl.write0 16 0.02% 99.30% >> 33307 >> > > 337909 java.lang.System.identityHashCode 17 0.02% 99.32% 31783 328763 >> > > sun.nio.ch.NativeThread.current 18 0.02% 99.34% 30917 322537 >> > > sun.misc.Unsafe.unpark 19 0.02% 99.37% 29943 325979 >> > > sun.nio.ch.EPollArrayWrapper.epollWait 20 0.02% 99.39% 27788 328474 >> > > sun.nio.ch.FileDispatcherImpl.write0 21 0.02% 99.40% 26448 328397 >> > > sun.nio.ch.FileDispatcherImpl.write0 22 0.02% 99.42% 25177 329364 >> > > scala.collection.mutable.HashTable$class.$init$ 23 0.02% 99.44% 24230 >> > > 328786 java.lang.Thread.currentThread 24 0.02% 99.46% 22683 328795 >> > > java.lang.Thread.currentThread 25 0.01% 99.47% 16264 328092 >> > > sun.nio.ch.IOUtil.drain >> > > >> > > as most of time, epollwait shouldn't be so expensive, it should be the >> > > problem of way of statistic. >> > > then I tried YourKit, found that most heavy method is in >> > > FetchRequestPurgatory.expire(), I uploaded the screenshot of YourKit: >> > > http://imgur.com/2tXjhBI >> > > >> > > does anyone have same problem? >> > > >> > > -- >> > > Don't schedule every day, make them disorder. >> > > >> > >> > >> > >> > -- >> > Don't schedule every day, make them disorder. >> > >> > > > > -- > Don't schedule every day, make them disorder. > -- Don't schedule every day, make them disorder.