[jira] [Commented] (KAFKA-6098) Delete and Re-create topic operation could result in race condition

Brian Byrne (Jira) Thu, 12 Sep 2019 12:40:42 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-6098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928844#comment-16928844
 ]


Brian Byrne commented on KAFKA-6098:
------------------------------------

I've been investigating this issue and have collected some thoughts on it. 
Since I'm relatively new to Kafka, I'll be verbose in my explanation so that my 
understanding may be validated/corrected.

The successful client criteria for deleting a topic is that the server persists 
the intent to delete the topic via creating ZK node 
/admin/delete_topics/<topic>. At this time, in-memory data structures will be 
modified to reflect the ongoing destruction of the topic, where consequently 
the topic's ZK node will be removed, followed by the deletion intent node*. The 
purpose for the operation to be performed asynchronously is that a topic may be 
ineligible for deletion for an indefinite amount of time during partition 
reassignment or broker instability.

Topic listing/creating appear to be at odds with each other, further 
complicated by the race-prone ZK update sequence: it's required that the 
deletion intent node is removed after the topic's node for obvious recovery 
consistency reasons, however this also means there's a window where the 
deletion intent exists but the topic doesn't. In this case, a racing topic 
recreation is prone to some unexpected and undesirable behavior as the former 
may still be undergoing deletion (note topic creation doesn't check for a 
deletion intent).

The 'list topics' request uses a different source of truth than the creation 
path, where the topics are gathered by looking at a topic's outstanding 
partitions' states. The partitions may be removed while the deletion is still 
outstanding, hence why the ZK node may still exist on creation, as [~guozhang] 
noted. 

A possible fix would be to have 'list topics' return a more conservative set of 
topics that are undergoing deletion. This might require some changes to how 
metadata snapshots are handled which seems a bit excessive for resolving this 
issue, although I'm not familiar with this component.

The "easy-fix" solution would have the create topic path check the metadata 
cache for the topic's existence, where if it doesn't exist but the topic's 
deletion intent does, then a transient error is returned that asks the client 
to backoff+retry. This ensures that all possible state for the previous topic 
has been eliminated before the new one is created. The only downside is that 
there's a window where no partitions for the topic exists (i.e. doesn't appear 
in list topics), but the topic deletion cannot be completed, which should be 
relatively small and likely due to ZK inaccessibility, which would prevent the 
creation from completing anyway.

Does this sound reasonable?

 

[*] There's actually a deletion of the topic's configuration in-between, which 
may be missed in this case, which may be Peter's issue: 
[https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/controller/KafkaController.scala#L1621-L1627]

 

 

> Delete and Re-create topic operation could result in race condition
> -------------------------------------------------------------------
>
>                 Key: KAFKA-6098
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6098
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Guozhang Wang
>            Priority: Major
>              Labels: reliability
>
> Here is the following process to re-produce this issue:
> 1. Delete a topic using the delete topic request.
> 2. Confirm the topic is deleted using the list topics request.
> 3. Create the topic using the create topic request.
> In step 3) a race condition can happen that the response returns a 
> {{TOPIC_ALREADY_EXISTS}} error code, indicating the topic has already existed.
> The root cause of the above issue is in the {{TopicDeletionManager}} class:
> {code}
> controller.partitionStateMachine.handleStateChanges(partitionsForDeletedTopic.toSeq,
>  OfflinePartition)
> controller.partitionStateMachine.handleStateChanges(partitionsForDeletedTopic.toSeq,
>  NonExistentPartition)
> topicsToBeDeleted -= topic
> partitionsToBeDeleted.retain(_.topic != topic)
> kafkaControllerZkUtils.deleteTopicZNode(topic)
> kafkaControllerZkUtils.deleteTopicConfigs(Seq(topic))
> kafkaControllerZkUtils.deleteTopicDeletions(Seq(topic))
> controllerContext.removeTopic(topic)
> {code}
> I.e. it first update the broker's metadata cache through the ISR and metadata 
> update request, then delete the topic zk path, and then delete the 
> topic-deletion zk path. However, upon handling the create topic request, the 
> broker will simply try to write to the topic zk path directly. Hence there is 
> a race condition that between brokers update their metadata cache (hence list 
> topic request not returning this topic anymore) and zk path for the topic be 
> deleted (hence the create topic succeed).
> The reason this problem could be exposed, is through current handling logic 
> of the create topic response, most of which takes {{TOPIC_ALREADY_EXISTS}} as 
> "OK" and moves on, and the zk path will be deleted later, hence leaving the 
> topic to be not created at all:
> https://github.com/apache/kafka/blob/249e398bf84cdd475af6529e163e78486b43c570/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsKafkaClient.java#L221
> https://github.com/apache/kafka/blob/1a653c813c842c0b67f26fb119d7727e272cf834/connect/runtime/src/main/java/org/apache/kafka/connect/util/TopicAdmin.java#L232
> Looking at the code history, it seems this race condition always exist, but 
> testing on trunk / 1.0 with the above steps it is more likely to happen than 
> before. I wonder if the ZK async calls have an effect here. cc [~junrao] 
> [~onurkaraman]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (KAFKA-6098) Delete and Re-create topic operation could result in race condition

Reply via email to