Hi Lisheng, Here are the answers to your questions.
do you set sun.security.jgss.native = true? No if not, there are some items need to be check. 1. GC, but you say gc is not problem - I have verified GC multiple times and I don't see that to be an issue. 2. if you suspect network thread, how many thread did you set? - Currently there are 3 network threads per broker and 8 io threads 3. if you enable compression - No, compression is not enabled 4. did you change the value of batch.size at producer side? - No, there hasn't been any recent changes 5. do you think you can increase "fetch.min,bytes" at consumer side and "replica.fetch.min.bytes" at broker to test if cpu usage can be down ? - Haven't tried it. If this will decrease the CPU then we can give that a try. 6. you can check some metrics from jmx to analysis, e.g. checking "kafka.network:type=RequestMetrics, name=RequestsPerSec,request={Produce|FetchConsumer|FetchFollower}", if valus is high , that means cpu will be busy. - I don't see RequestsPerSec metrics in 2.3. I have the "kafka.network:type=RequestMetrics,name=TotalTimeMs" metric. ProducerTotalTimeMs - 1.25 ms FetchFollowerTotalTimeMs - 2.53 ms FetchConsumerToalTimeMs - 12.5 ms Thanks. On Wed, Jan 8, 2020 at 1:29 AM Lisheng Wang <wanglishen...@gmail.com> wrote: > Hi Navneeth > > like the bug you said above, do you set sun.security.jgss.native = true? > > if not, there are some items need to be check. > > 1. GC, but you say gc is not problem > 2. if you suspect network thread, how many thread did you set? > 3. if you enable compression > 4. did you change the value of batch.size at producer side? > 5. do you think you can increase "fetch.min,bytes" at consumer side and > "replica.fetch.min.bytes" at broker to test if cpu usage can be down ? > 6. you can check some metrics from jmx to analysis, e.g. checking > "kafka.network:type=RequestMetrics, > name=RequestsPerSec,request={Produce|FetchConsumer|FetchFollower}", if > valus is high , that means cpu will be busy. > > Best, > Lisheng > > > Navneeth Krishnan <reachnavnee...@gmail.com> 于2020年1月8日周三 下午3:39写道: > > > Hi All, > > > > Any suggestions, we are running into this issue in production and any > > help would be greatly appreciated. > > > > Thanks > > > > On Mon, Jan 6, 2020 at 9:26 PM Navneeth Krishnan < > reachnavnee...@gmail.com > > > > > wrote: > > > > > Hi, > > > > > > Thanks for the response. We were using version 0.11 previously and all > > our > > > producers/consumers have been upgraded to either 1.0 or to the latest > > 2.3. > > > > > > Is it normal for the network thread to consume more cpu? If you look at > > > it, the network thread consumes 50% of the overall cpu. > > > > > > Regards > > > > > > On Mon, Jan 6, 2020 at 7:04 PM Thunder Stumpges < > > > thunder.stump...@gmail.com> wrote: > > > > > >> Not sure what version your producers/consumers are, or if you upgraded > > >> from > > >> a previous version that used to work, or what, but maybe you're > hitting > > >> this? > > >> > > >> > > >> > > > https://kafka.apache.org/23/documentation.html#upgrade_10_performance_impact > > >> > > >> > > >> > > >> On Mon, Jan 6, 2020 at 12:48 PM Navneeth Krishnan < > > >> reachnavnee...@gmail.com> > > >> wrote: > > >> > > >> > Hi All, > > >> > > > >> > Any idea on what can be done? Not sure if we are running into this > > below > > >> > bug. > > >> > > > >> > https://issues.apache.org/jira/browse/KAFKA-7925 > > >> > > > >> > Thanks > > >> > > > >> > On Thu, Jan 2, 2020 at 4:18 PM Navneeth Krishnan < > > >> reachnavnee...@gmail.com> > > >> > wrote: > > >> > > > >> >> Hi All, > > >> >> > > >> >> We have a kafka cluster with 12 nodes and we are pretty much seeing > > 90% > > >> >> cpu usage on all the nodes. Here is all the information. Need some > > >> help on > > >> >> figuring out what the problem is and how to overcome this issue. > > >> >> > > >> >> *Cluster:* > > >> >> Kafka version: 2.3.0 > > >> >> Number of brokers in cluster: 12 > > >> >> Node type: 4 vCores 32GB mem > > >> >> Network In: 10Mbps per broker > > >> >> Network Out: 16Mbps per broker > > >> >> Topics: 10 (approximately) > > >> >> Partitions: 20 (Max), some has only partitions > > >> >> Replication Factor: 3 > > >> >> > > >> >> *CPU Usage:* > > >> >> [image: image.png] > > >> >> > > >> >> *VMStat* > > >> >> > > >> >> [root]# vmstat 1 10 > > >> >> > > >> >> procs -----------memory---------- ---swap-- -----io---- -system-- > > >> >> ------cpu----- > > >> >> > > >> >> r b swpd free buff cache si so bi bo in cs > us > > sy > > >> >> id wa st > > >> >> > > >> >> 8 0 0 234444 19064 24046980 0 0 17 2026 1 3 > > 38 > > >> 33 > > >> >> 28 0 1 > > >> >> > > >> >> 7 0 0 256444 19036 24023880 0 0 768 0 64027 > 22708 > > >> 44 > > >> >> 40 16 0 1 > > >> >> > > >> >> 7 0 0 245356 19052 24034560 0 0 256 472 63509 > 23276 > > >> 44 > > >> >> 39 17 0 1 > > >> >> > > >> >> 7 0 0 235096 19052 24046616 0 0 0 0 62277 > 22516 > > >> 46 > > >> >> 38 15 0 1 > > >> >> > > >> >> 8 0 0 260548 19036 24020084 0 0 516 49888 62364 > 22894 > > >> 43 > > >> >> 38 18 0 1 > > >> >> > > >> >> 5 0 0 249232 19036 24030924 0 0 512 0 61022 > 24589 > > >> 41 > > >> >> 39 20 0 1 > > >> >> > > >> >> 6 0 0 238072 19036 24042512 0 0 1024 0 63358 > 23063 > > >> 44 > > >> >> 38 17 0 0 > > >> >> > > >> >> 5 0 0 262904 19052 24017972 0 0 0 440 63078 > 23499 > > >> 46 > > >> >> 37 17 0 1 > > >> >> > > >> >> 7 0 0 250324 19052 24030008 0 0 0 0 64615 > 22617 > > >> 48 > > >> >> 38 14 0 1 > > >> >> > > >> >> 6 0 0 237920 19052 24042372 0 0 1024 48900 63223 > 23029 > > >> 42 > > >> >> 40 18 0 1 > > >> >> > > >> >> > > >> >> *IO Stat:* > > >> >> > > >> >> [root]# iostat -m > > >> >> > > >> >> Linux 4.14.72-73.55.amzn2.x86_64 ( > loc-kafka11.internal.dnaspaces.io) > > >> >> 01/02/2020 _x86_64_ (4 CPU) > > >> >> > > >> >> > > >> >> > > >> >> avg-cpu: %user %nice %system %iowait %steal %idle > > >> >> > > >> >> 38.11 0.00 33.09 0.11 0.61 28.08 > > >> >> > > >> >> > > >> >> > > >> >> Device: tps MB_read/s MB_wrtn/s MB_read > > MB_wrtn > > >> >> > > >> >> xvda 2.36 0.01 0.01 26760 > > 43360 > > >> >> > > >> >> nvme0n1 0.00 0.00 0.00 2 > > 0 > > >> >> > > >> >> xvdf 70.95 0.06 7.67 185908 > > 25205338 > > >> >> > > >> >> *Top Kafka broker threads:* > > >> >> [image: image.png] > > >> >> > > >> >> *Top 3:* > > >> >> > > >> >> > > >> > "data-plane-kafka-network-thread-10-ListenerName(PLAINTEXT)-PLAINTEXT-0" > > >> >> #60 prio=5 os_prio=0 tid=0x00007f8b1ab56000 nid=0x581f runnable > > >> >> [0x00007f8a886ce000] > > >> >> > > >> >> > > >> > "data-plane-kafka-network-thread-10-ListenerName(PLAINTEXT)-PLAINTEXT-2" > > >> >> #62 prio=5 os_prio=0 tid=0x00007f8b1ab59000 nid=0x5821 runnable > > >> >> [0x00007f8a6aefd000] > > >> >> > > >> >> > > >> > "data-plane-kafka-network-thread-10-ListenerName(PLAINTEXT)-PLAINTEXT-1" > > >> >> #61 prio=5 os_prio=0 tid=0x00007f8b1ab57800 nid=0x5820 runnable > > >> >> [0x00007f8a885cd000] > > >> >> > > >> >> It doesn't looks like GC and IO is the problem. > > >> >> > > >> >> Thanks > > >> >> > > >> > > > >> > > > > > >