Yaodong Yang created KAFKA-7572: ----------------------------------- Summary: Producer should not send requests with negative partition id Key: KAFKA-7572 URL: https://issues.apache.org/jira/browse/KAFKA-7572 Project: Kafka Issue Type: Bug Components: clients Affects Versions: 1.0.1 Reporter: Yaodong Yang
h3. Issue: In one Kafka producer log from our users, we found the following weird one: timestamp="2018-10-09T17:37:41,237-0700",level="ERROR", Message="Write to Kafka failed with: ",exception="java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for topicName--2: 30042 ms has passed since batch creation plus linger time at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:94) at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:64) at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:29) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for topicName--2: 30042 ms has passed since batch creation plus linger time" After a few hours debugging, we finally understood the root cause of this issue: # The producer used a buggy custom Partitioner, which sometimes generates negative partition ids for new records. # The corresponding produce requests were rejected by brokers, because it's illegal to have a partition with a negative id. # The client kept refreshing its local cluster metadata, but could not send produce requests successfully. # From the above log, we found a suspicious string "topicName--2": # According to the source code, the format of this string in the log is TopicName+"-"+PartitionId. # It's not easy to notice that there were 2 consecutive dash in the above log. # Eventually, we found that the second dash was a negative sign. Therefore, the partition id is -2, rather than 2. # The bug the custom Partitioner. h3. Proposal: # Producer code should check the partitionId before sending requests to brokers. # If there is a negative partition Id, just throw an IllegalStateException{{ }}exception. # Such a quick check can save lots of time for people debugging their producer code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)