[ https://issues.apache.org/jira/browse/KUDU-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795324#comment-17795324 ]
Zoltan Martonka commented on KUDU-3528: --------------------------------------- Idea from [~alexey]: "There is a way to first move the .data and .metadata files being deleted into an intermediate "to-delete" sub-directory (and that is more robust, assuming the FS error happening doesn't affect all the FS metadata and inode blocks), and only deleting those files when they have been moved in that sub-directory. It's not a big deal from the IO perspective since that's just some inode-related ops on the same FS. Now, even if there is an IO error when trying to remove the .data file, the LogBlockManager isn't going to be affected by that since it's not loading the data from the "to-delete" sub-directory." > non-empty .data file left without .metadata file > ------------------------------------------------ > > Key: KUDU-3528 > URL: https://issues.apache.org/jira/browse/KUDU-3528 > Project: Kudu > Issue Type: Bug > Reporter: Zoltan Martonka > Priority: Major > > The following might happen if there is a filesystem error: # We delete the > last block in a full LogBlockContainer. > # We do not hole punch the last block, because we will delete the whole file > anyway (see LogBlockDeletionTransaction::~LogBlockDeletionTransaction) > # We try to delete “.data” and “.metadata” files, but only manage to delete > the “.metadata” file. > # Having a non-zero size ".data" without ".metadata" will prevent the > LogBlockManager from restarting. > // ----------------------------- Repo steps -------------------- > 1. In block_manager-test.cc > TYPED_TEST(BlockManagerTest, TestMetadataOkayDespiteFailure) > change FLAGS_log_container_max_size so that it is filled (=\{block size} * > kNumAppends). In a normal 4k x86_64 system, just change it to 16*1024 (from > 256 * 1024). > 2. set FLAGS_log_block_manager_delete_dead_container=true. > 3. Run the tests a couple of times. It has a nice chance to fail. > 4. Around the end of LogBlockContainerNativeMeta::CheckContainerFiles(...): > change the > {code:java} > if (s_meta.IsNotFound()) RETURN_NOT_OK_CONTAINER_DISK_FAILURE(s_meta); {code} > line to multiline and put a break point in it. > 5. Run from debugger and when the breakpoint is hit you can see the container > name 3 calls above. > 6. While standing in the break point, go to your 'data' dir. You will see > \{container_name}.data, which size is not 0. And no meta data. This is the > problem. > // ----------------------- end of repo --------- > If you run the test with the original FLAGS_log_container_max_size, and stop > it before > {code:java} > ASSERT_OK(this->ReopenBlockManager(scoped_refptr<MetricEntity>(), > shared_ptr<MemTracker>(), > { GetTestDataDirectory() }, > false /* create */)); {code} > you can still see .data files without .metadata, but their size will always > be 0. -- This message was sent by Atlassian Jira (v8.20.10#820010)