swathi mocharla created KAFKA-14092: ---------------------------------------
Summary: Unsafe memory access operation exception leading to Error processing append operation on partition Key: KAFKA-14092 URL: https://issues.apache.org/jira/browse/KAFKA-14092 Project: Kafka Issue Type: Bug Affects Versions: 3.0.0 Environment: K8S Reporter: swathi mocharla Hi, We are frequently seeing "Unsafe memory access operation" exception leading to "Error processing append operation on partition" on a certain partition. Here are some logs: {code:java} {"process":"kafka.Kafka","service":"eventqueue","host":"hubudgen10-ncs-1-workerbm-2","container":"7efcf75afb20","system":"vtas250","neid":"123456","timezone":"UTC","type":"log","level":"info","time":"2022-07-20T11:15:10.624000000Z","log":{"msg":"[ProducerStateManager partition=cdr-group-000-ccf-000-mgmt-0] Wrote producer snapshot at offset 3114 with 0 producer ids in 47 ms.","class":"kafka.log.ProducerStateManager"}} {"process":"kafka.Kafka","service":"eventqueue","host":"hubudgen10-ncs-1-workerbm-2","container":"7efcf75afb20","system":"vtas250","neid":"123456","timezone":"UTC","type":"log","level":"error","time":"2022-07-20T11:15:10.628000000Z","log":{"msg":"[ReplicaManager broker=1] Error processing append operation on partition cdr-group-000-ccf-000-mgmt-0java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code\njava.io.UnixFileSystem.getBooleanAttributes0(Native Method)\njava.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)\njava.io.File.exists(File.java:830)\nkafka.log.TransactionIndex.<init>(TransactionIndex.scala:50)\nkafka.log.LogSegment$.open(LogSegment.scala:663)\nkafka.log.Log.$anonfun$roll$2(Log.scala:1692)\nkafka.log.Log.roll(Log.scala:2487)\nkafka.log.Log.maybeRoll(Log.scala:1632)\nkafka.log.Log.append(Log.scala:900)\nkafka.log.Log.appendAsLeader(Log.scala:741)\nkafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1042)\nkafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1030)\nkafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:931)\nscala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)\nscala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)\nscala.collection.mutable.HashMap.map(HashMap.scala:35)\nkafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:919)\nkafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:591)\nkafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:658)\nkafka.server.KafkaApis.handle(KafkaApis.scala:169)\nkafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75)\njava.lang.Thread.run(Thread.java:750)","class":"kafka.server.ReplicaManager"}}{code} {code:java} {"process":"kafka.Kafka","service":"eventqueue","host":"hubudgen10-ncs-1-workerbm-2","container":"7efcf75afb20","system":"vtas250","neid":"123456","timezone":"UTC","type":"log","level":"warning","time":"2022-07-20T11:15:11.632000000Z","log":{"msg":"[Log partition=cdr-group-000-ccf-000-mgmt-0, dir=/data] Newly rolled segment file /data/cdr-group-000-ccf-000-mgmt-0/00000000000000003114.log already exists; deleting it first","class":"kafka.log.Log"}} {"process":"kafka.Kafka","service":"eventqueue","host":"hubudgen10-ncs-1-workerbm-2","container":"7efcf75afb20","system":"vtas250","neid":"123456","timezone":"UTC","type":"log","level":"error","time":"2022-07-20T11:15:11.633000000Z","log":{"msg":"[ReplicaManager broker=1] Error processing append operation on partition cdr-group-000-ccf-000-mgmt-0java.lang.IllegalStateException: Attempt to append a timestamp (1658315708834) to slot 1 no larger than the last timestamp appended (8694993961132949504) to /data/cdr-group-000-ccf-000-mgmt-0/00000000000000003108.timeindex.\nkafka.log.TimeIndex.$anonfun$maybeAppend$1(TimeIndex.scala:128)\nkafka.log.TimeIndex.maybeAppend(TimeIndex.scala:114)\nkafka.log.LogSegment.onBecomeInactiveSegment(LogSegment.scala:503)\nkafka.log.Log.$anonfun$roll$8(Log.scala:1681)\nkafka.log.Log.$anonfun$roll$8$adapted(Log.scala:1681)\nscala.Option.foreach(Option.scala:437)\nkafka.log.Log.$anonfun$roll$2(Log.scala:1681)\nkafka.log.Log.roll(Log.scala:2487)\nkafka.log.Log.maybeRoll(Log.scala:1632)\nkafka.log.Log.append(Log.scala:900)\nkafka.log.Log.appendAsLeader(Log.scala:741)\nkafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1042)\nkafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1030)\nkafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:931)\nscala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)\nscala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)\nscala.collection.mutable.HashMap.map(HashMap.scala:35)\nkafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:919)\nkafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:591)\nkafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:658)\nkafka.server.KafkaApis.handle(KafkaApis.scala:169)\nkafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75)\njava.lang.Thread.run(Thread.java:750)","class":"kafka.server.ReplicaManager"}}{code} Notice that Kafka picks up a timestamp that is unrelated to the current timestamps or anywhere around it {code:java} Attempt to append a timestamp (1658315708834) to slot 1 no larger than the last timestamp appended (8694993961132949504){code} These are the files that have been generated and the dumps of a few of them: {code:java} bash-4.2$ ls -altr total 8 -rw-r--r--. 1 logging logging 43 Jul 20 06:34 partition.metadata -rw-r--r--. 1 logging logging 10 Jul 20 07:01 00000000000000000001.snapshot -rw-r--r--. 1 logging logging 10 Jul 20 11:14 00000000000000003096.snapshot -rw-r--r--. 1 logging logging 978 Jul 20 11:14 00000000000000003096.log -rw-r--r--. 1 logging logging 0 Jul 20 11:14 00000000000000003096.index -rw-r--r--. 1 logging logging 12 Jul 20 11:14 00000000000000003096.timeindex -rw-r--r--. 1 logging logging 10 Jul 20 11:14 00000000000000003102.snapshot -rw-r--r--. 1 logging logging 978 Jul 20 11:14 00000000000000003102.log -rw-r--r--. 1 logging logging 0 Jul 20 11:15 00000000000000003102.index -rw-r--r--. 1 logging logging 12 Jul 20 11:15 00000000000000003102.timeindex -rw-r--r--. 1 logging logging 10 Jul 20 11:15 00000000000000003108.snapshot -rw-r--r--. 1 logging logging 978 Jul 20 11:15 00000000000000003108.log -rw-r--r--. 1 logging logging 0 Jul 20 11:15 00000000000000003108.index -rw-r--r--. 1 logging logging 12 Jul 20 11:15 00000000000000003108.timeindex -rw-r--r--. 1 logging logging 10 Jul 20 11:15 00000000000000003114.snapshot -rw-r--r--. 1 logging logging 11 Jul 20 11:15 leader-epoch-checkpoint drwxr-xr-x. 2 logging logging 16 Jul 20 11:15 . drwxrwxrwx. 63 root root 68 Jul 21 04:49 .. bash-4.2$ ls -la --time-style=full-iso 00000000000000003114.snapshot -rw-r--r--. 1 logging logging 10 2022-07-20 11:15:10.580041566 +0000 00000000000000003114.snapshot bash-4.2$ ls -la --time-style=full-iso 00000000000000003108.timeindex -rw-r--r--. 1 logging logging 12 2022-07-20 11:15:10.522041021 +0000 00000000000000003108.timeindex bash-4.2$ kafka-run-class.sh kafka.tools.DumpLogSegments --deep-iteration --print-data-log --files 00000000000000003108.timeindex Dumping 00000000000000003108.timeindex timestamp: 1658315708834 offset: 3113{code} I've checked the dumpSegments of the log files and they look clean as well. Tried deleting the files and this issue re-occurs almost every time. Could this be related to KAFKA-9156 ? cc: [~ijuma] ,[~junrao] -- This message was sent by Atlassian Jira (v8.20.10#820010)