dan norwood created KAFKA-8414: ---------------------------------- Summary: org.apache.kafka.common.metrics.MetricsTest.testConcurrentReadUpdateReport hang Key: KAFKA-8414 URL: https://issues.apache.org/jira/browse/KAFKA-8414 Project: Kafka Issue Type: Bug Reporter: dan norwood
caveat: this only happens on AMD Epyc machines with >=48 cpus. i have below a bunch of machine info from various `*a.*` aws instance sizes i ran against. i noticed what seems like a deadlock when running `org.apache.kafka.common.metrics.MetricsTest` on an aws instance with 96vCPUs (specifically a m5a.24xlarge). after some debugging it seems like the offending issue is [https://github.com/apache/kafka/blob/trunk/clients/src/test/java/org/apache/kafka/common/metrics/MetricsTest.java#L776-L778] {code:java} public void run() { try { while (alive.get()) { op.run(); } } catch (Throwable t) { log.error("Metric {} failed with exception", opName, t); } } {code} since the `op.run()` methods are all synchronized we end up nonstop hammering it. after adding some logging i saw steadily increasing wait times for entry in to each synchronized block. so this is not *really* a deadlock or hang, but a progressive slowdown that makes the test unrunnable. the offending op seems to be [https://github.com/apache/kafka/blob/trunk/clients/src/test/java/org/apache/kafka/common/metrics/MetricsTest.java#L747] {code:java} Future<?> reportFuture = executorService.submit(new ConcurrentMetricOperation(alive, "report", () -> reporter.processMetrics())); {code} possible fix: adding a `Thread.sleep(0, 1)` inside the runloop for `ConcurrentMetricOperation` seems to allow the test to pass. but i'm not sure that it wouldn't mask an issue that the test is meant to detect Good: t3a.large ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Thread(s) per core: 2 Core(s) per socket: 1 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7571 Stepping: 2 CPU MHz: 2200.116 BogoMIPS: 4400.23 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0,1 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext cpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save ``` t3a.2xlarge ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7571 Stepping: 2 CPU MHz: 2199.916 BogoMIPS: 4399.83 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext cpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save ``` m5a.4xlarge ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7571 Stepping: 2 CPU MHz: 2585.550 BogoMIPS: 4399.98 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-15 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core cpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save ``` bad: m5a.4xlarge ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 1 NUMA node(s): 3 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7571 Stepping: 2 CPU MHz: 2315.397 BogoMIPS: 4400.13 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-7,24-31 NUMA node1 CPU(s): 8-15,32-39 NUMA node2 CPU(s): 16-23,40-47 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core cpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save ``` m5a.24xlarge ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 2 NUMA node(s): 6 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7571 Stepping: 2 CPU MHz: 2421.512 BogoMIPS: 4399.19 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-7,48-55 NUMA node1 CPU(s): 8-15,56-63 NUMA node2 CPU(s): 16-23,64-71 NUMA node3 CPU(s): 24-31,72-79 NUMA node4 CPU(s): 32-39,80-87 NUMA node5 CPU(s): 40-47,88-95 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core cpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005)