[ https://issues.apache.org/jira/browse/KAFKA-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15999253#comment-15999253 ]
ASF GitHub Bot commented on KAFKA-5099: --------------------------------------- GitHub user onurkaraman opened a pull request: https://github.com/apache/kafka/pull/2986 KAFKA-5099: Replica Deletion Regression from KIP-101 Replica deletion regressed from KIP-101. Replica deletion happens when a broker receives a StopReplicaRequest with delete=true. Ever since KAFKA-1911, replica deletion has been async, meaning the broker responds with a StopReplicaResponse simply after marking the replica directory as staged for deletion. This marking happens by moving a data log directory and its contents such as /tmp/kafka-logs1/t1-0 to a marked directory like /tmp/kafka-logs1/t1-0.8c9c4c0c61c44cc59ebeb00075a2a07f-delete, acting as a soft-delete. A scheduled thread later actually deletes the data. It appears that the regression occurs while the scheduled thread is actually trying to delete the data, which means the controller considers operations such as partition reassignment and topic deletion complete. But if you look at the log4j logs and data logs, you'll find that the soft-deleted data logs actually won't get deleted. The bug is that upon log deletion, we attempt to flush the LeaderEpochFileCache to the original file location instead of the moved file location. Restarting the broker actually allows for the soft-deleted directories to get deleted. This patch avoids the issue by simply not flushing the LeaderEpochFileCache upon log deletion. You can merge this pull request into a Git repository by running: $ git pull https://github.com/onurkaraman/kafka KAFKA-5099 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/2986.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2986 ---- commit 081e09be262dea261f7faf9b5d81b50995600f68 Author: Onur Karaman <okara...@linkedin.com> Date: 2017-05-06T00:59:36Z KAFKA-5099: Replica Deletion Regression from KIP-101 Replica deletion regressed from KIP-101. Replica deletion happens when a broker receives a StopReplicaRequest with delete=true. Ever since KAFKA-1911, replica deletion has been async, meaning the broker responds with a StopReplicaResponse simply after marking the replica directory as staged for deletion. This marking happens by moving a data log directory and its contents such as /tmp/kafka-logs1/t1-0 to a marked directory like /tmp/kafka-logs1/t1-0.8c9c4c0c61c44cc59ebeb00075a2a07f-delete, acting as a soft-delete. A scheduled thread later actually deletes the data. It appears that the regression occurs while the scheduled thread is actually trying to delete the data, which means the controller considers operations such as partition reassignment and topic deletion complete. But if you look at the log4j logs and data logs, you'll find that the soft-deleted data logs actually won't get deleted. The bug is that upon log deletion, we attempt to flush the LeaderEpochFileCache to the original file location instead of the moved file location. Restarting the broker actually allows for the soft-deleted directories to get deleted. This patch avoids the issue by simply not flushing the LeaderEpochFileCache upon log deletion. ---- > Replica Deletion Regression from KIP-101 > ---------------------------------------- > > Key: KAFKA-5099 > URL: https://issues.apache.org/jira/browse/KAFKA-5099 > Project: Kafka > Issue Type: Bug > Reporter: Onur Karaman > Assignee: Onur Karaman > Priority: Blocker > Fix For: 0.11.0.0 > > > It appears that replica deletion regressed from KIP-101. Replica deletion > happens when a broker receives a StopReplicaRequest with delete=true. Ever > since KAFKA-1911, replica deletion has been async, meaning the broker > responds with a StopReplicaResponse simply after marking the replica > directory as staged for deletion. This marking happens by moving a data log > directory and its contents such as /tmp/kafka-logs1/t1-0 to a marked > directory like /tmp/kafka-logs1/t1-0.8c9c4c0c61c44cc59ebeb00075a2a07f-delete, > acting as a soft-delete. A scheduled thread later actually deletes the data. > It appears that the regression occurs while the scheduled thread is actually > trying to delete the data, which means the controller considers operations > such as partition reassignment and topic deletion complete. But if you look > at the log4j logs and data logs, you'll find that the soft-deleted data logs > haven't actually won't get deleted. It seems that restarting the broker > actually allows for the soft-deleted directories to get deleted. > Here's the setup: > {code} > > ./bin/zookeeper-server-start.sh config/zookeeper.properties > > export LOG_DIR=logs0 && ./bin/kafka-server-start.sh > > config/server0.properties > > export LOG_DIR=logs1 && ./bin/kafka-server-start.sh > > config/server1.properties > > ./bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic t0 > > --replica-assignment 1:0 > > ./bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic t1 > > --replica-assignment 1:0 > > ./bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic t0 > > cat p.txt > {"partitions": > [ > {"topic": "t1", "partition": 0, "replicas": [0] } > ], > "version":1 > } > > ./bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 > > --reassignment-json-file p.txt --execute > {code} > Here are sample logs: > {code} > [2017-04-20 17:46:54,801] INFO [ReplicaFetcherManager on broker 1] Removed > fetcher for partitions t0-0 (kafka.server.ReplicaFetcherManager) > [2017-04-20 17:46:54,814] INFO Log for partition t0-0 is renamed to > /tmp/kafka-logs1/t0-0.bbc8fa126e3e4ff787f6b68d158ab771-delete and is > scheduled for deletion (kafka.log.LogManager) > [2017-04-20 17:47:27,585] INFO Deleting index > /tmp/kafka-logs1/t0-0.bbc8fa126e3e4ff787f6b68d158ab771-delete/00000000000000000000.index > (kafka.log.OffsetIndex) > [2017-04-20 17:47:27,586] INFO Deleting index > /tmp/kafka-logs1/t0-0/00000000000000000000.timeindex (kafka.log.TimeIndex) > [2017-04-20 17:47:27,587] ERROR Exception in deleting > Log(/tmp/kafka-logs1/t0-0.bbc8fa126e3e4ff787f6b68d158ab771-delete). Moving it > to the end of the queue. (kafka.log.LogManager) > java.io.FileNotFoundException: > /tmp/kafka-logs1/t0-0/leader-epoch-checkpoint.tmp (No such file or directory) > at java.io.FileOutputStream.open0(Native Method) > at java.io.FileOutputStream.open(FileOutputStream.java:270) > at java.io.FileOutputStream.<init>(FileOutputStream.java:213) > at java.io.FileOutputStream.<init>(FileOutputStream.java:162) > at kafka.server.checkpoints.CheckpointFile.write(CheckpointFile.scala:41) > at > kafka.server.checkpoints.LeaderEpochCheckpointFile.write(LeaderEpochCheckpointFile.scala:61) > at > kafka.server.epoch.LeaderEpochFileCache.kafka$server$epoch$LeaderEpochFileCache$$flush(LeaderEpochFileCache.scala:178) > at > kafka.server.epoch.LeaderEpochFileCache$$anonfun$clear$1.apply$mcV$sp(LeaderEpochFileCache.scala:161) > at > kafka.server.epoch.LeaderEpochFileCache$$anonfun$clear$1.apply(LeaderEpochFileCache.scala:159) > at > kafka.server.epoch.LeaderEpochFileCache$$anonfun$clear$1.apply(LeaderEpochFileCache.scala:159) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:213) > at kafka.utils.CoreUtils$.inWriteLock(CoreUtils.scala:221) > at > kafka.server.epoch.LeaderEpochFileCache.clear(LeaderEpochFileCache.scala:159) > at kafka.log.Log.delete(Log.scala:1051) > at > kafka.log.LogManager.kafka$log$LogManager$$deleteLogs(LogManager.scala:442) > at > kafka.log.LogManager$$anonfun$startup$5.apply$mcV$sp(LogManager.scala:241) > at > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {code} > The same appears on broker 1 for t1-0 after partition reassignment moves the > replica off broker 1. > Here's what the data logs show: > {code} > > l /tmp/kafka-logs*/t* > /tmp/kafka-logs0/t0-0.166137ca8c28475d8c9fbf6f423a937e-delete: > total 0 > -rw-r--r-- 1 okaraman wheel 0 Apr 20 17:46 00000000000000000000.timeindex > -rw-r--r-- 1 okaraman wheel 0 Apr 20 17:46 leader-epoch-checkpoint > drwxr-xr-x 2 okaraman wheel 68 Apr 20 17:46 pid-mapping > /tmp/kafka-logs0/t1-0: > total 40960 > -rw-r--r-- 1 okaraman wheel 10485760 Apr 20 17:46 > 00000000000000000000.index > -rw-r--r-- 1 okaraman wheel 0 Apr 20 17:46 00000000000000000000.log > -rw-r--r-- 1 okaraman wheel 10485756 Apr 20 17:46 > 00000000000000000000.timeindex > -rw-r--r-- 1 okaraman wheel 0 Apr 20 17:46 leader-epoch-checkpoint > drwxr-xr-x 2 okaraman wheel 68 Apr 20 17:46 pid-mapping > /tmp/kafka-logs1/t0-0.bbc8fa126e3e4ff787f6b68d158ab771-delete: > total 0 > -rw-r--r-- 1 okaraman wheel 0 Apr 20 17:46 00000000000000000000.timeindex > -rw-r--r-- 1 okaraman wheel 0 Apr 20 17:46 leader-epoch-checkpoint > drwxr-xr-x 2 okaraman wheel 68 Apr 20 17:46 pid-mapping > /tmp/kafka-logs1/t1-0.8c9c4c0c61c44cc59ebeb00075a2a07f-delete: > total 0 > -rw-r--r-- 1 okaraman wheel 0 Apr 20 17:48 00000000000000000000.timeindex > -rw-r--r-- 1 okaraman wheel 0 Apr 20 17:46 leader-epoch-checkpoint > drwxr-xr-x 2 okaraman wheel 68 Apr 20 17:46 pid-mapping > {code} > I ran the above test on the checkin preceding KIP-101 and it worked just fine. -- This message was sent by Atlassian JIRA (v6.3.15#6346)