[ https://issues.apache.org/jira/browse/KAFKA-6098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928844#comment-16928844 ]
Brian Byrne commented on KAFKA-6098: ------------------------------------ I've been investigating this issue and have collected some thoughts on it. Since I'm relatively new to Kafka, I'll be verbose in my explanation so that my understanding may be validated/corrected. The successful client criteria for deleting a topic is that the server persists the intent to delete the topic via creating ZK node /admin/delete_topics/<topic>. At this time, in-memory data structures will be modified to reflect the ongoing destruction of the topic, where consequently the topic's ZK node will be removed, followed by the deletion intent node*. The purpose for the operation to be performed asynchronously is that a topic may be ineligible for deletion for an indefinite amount of time during partition reassignment or broker instability. Topic listing/creating appear to be at odds with each other, further complicated by the race-prone ZK update sequence: it's required that the deletion intent node is removed after the topic's node for obvious recovery consistency reasons, however this also means there's a window where the deletion intent exists but the topic doesn't. In this case, a racing topic recreation is prone to some unexpected and undesirable behavior as the former may still be undergoing deletion (note topic creation doesn't check for a deletion intent). The 'list topics' request uses a different source of truth than the creation path, where the topics are gathered by looking at a topic's outstanding partitions' states. The partitions may be removed while the deletion is still outstanding, hence why the ZK node may still exist on creation, as [~guozhang] noted. A possible fix would be to have 'list topics' return a more conservative set of topics that are undergoing deletion. This might require some changes to how metadata snapshots are handled which seems a bit excessive for resolving this issue, although I'm not familiar with this component. The "easy-fix" solution would have the create topic path check the metadata cache for the topic's existence, where if it doesn't exist but the topic's deletion intent does, then a transient error is returned that asks the client to backoff+retry. This ensures that all possible state for the previous topic has been eliminated before the new one is created. The only downside is that there's a window where no partitions for the topic exists (i.e. doesn't appear in list topics), but the topic deletion cannot be completed, which should be relatively small and likely due to ZK inaccessibility, which would prevent the creation from completing anyway. Does this sound reasonable? [*] There's actually a deletion of the topic's configuration in-between, which may be missed in this case, which may be Peter's issue: [https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/controller/KafkaController.scala#L1621-L1627] > Delete and Re-create topic operation could result in race condition > ------------------------------------------------------------------- > > Key: KAFKA-6098 > URL: https://issues.apache.org/jira/browse/KAFKA-6098 > Project: Kafka > Issue Type: Bug > Reporter: Guozhang Wang > Priority: Major > Labels: reliability > > Here is the following process to re-produce this issue: > 1. Delete a topic using the delete topic request. > 2. Confirm the topic is deleted using the list topics request. > 3. Create the topic using the create topic request. > In step 3) a race condition can happen that the response returns a > {{TOPIC_ALREADY_EXISTS}} error code, indicating the topic has already existed. > The root cause of the above issue is in the {{TopicDeletionManager}} class: > {code} > controller.partitionStateMachine.handleStateChanges(partitionsForDeletedTopic.toSeq, > OfflinePartition) > controller.partitionStateMachine.handleStateChanges(partitionsForDeletedTopic.toSeq, > NonExistentPartition) > topicsToBeDeleted -= topic > partitionsToBeDeleted.retain(_.topic != topic) > kafkaControllerZkUtils.deleteTopicZNode(topic) > kafkaControllerZkUtils.deleteTopicConfigs(Seq(topic)) > kafkaControllerZkUtils.deleteTopicDeletions(Seq(topic)) > controllerContext.removeTopic(topic) > {code} > I.e. it first update the broker's metadata cache through the ISR and metadata > update request, then delete the topic zk path, and then delete the > topic-deletion zk path. However, upon handling the create topic request, the > broker will simply try to write to the topic zk path directly. Hence there is > a race condition that between brokers update their metadata cache (hence list > topic request not returning this topic anymore) and zk path for the topic be > deleted (hence the create topic succeed). > The reason this problem could be exposed, is through current handling logic > of the create topic response, most of which takes {{TOPIC_ALREADY_EXISTS}} as > "OK" and moves on, and the zk path will be deleted later, hence leaving the > topic to be not created at all: > https://github.com/apache/kafka/blob/249e398bf84cdd475af6529e163e78486b43c570/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsKafkaClient.java#L221 > https://github.com/apache/kafka/blob/1a653c813c842c0b67f26fb119d7727e272cf834/connect/runtime/src/main/java/org/apache/kafka/connect/util/TopicAdmin.java#L232 > Looking at the code history, it seems this race condition always exist, but > testing on trunk / 1.0 with the above steps it is more likely to happen than > before. I wonder if the ZK async calls have an effect here. cc [~junrao] > [~onurkaraman] -- This message was sent by Atlassian Jira (v8.3.2#803003)