[ https://issues.apache.org/jira/browse/KUDU-3527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893652#comment-17893652 ]
ASF subversion and git services commented on KUDU-3527: ------------------------------------------------------- Commit 4843ecfe72c4453e990e6256cd4d25f7f9ebcf6d in kudu's branch refs/heads/branch-1.17.x from Zoltan Martonka [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=4843ecfe7 ] KUDU-3527 Fix block manager test when using 64k container block alignment BlockManagerTest.TestMetadataOkayDespiteFailure might fail on system where we use 64k alignment for data blocks. Root cause: Currently tablets fail to load if .metadata is missing but there is still a non-empty ".data" file. If FLAGS_env_inject_eio is set to greater than zero, then there is a chance that, when we delete the container file, we only delete the ".meta", but leave the ".data" file. So deleting containers with injected io errors is expected to sometime prevent the block manager from restarting properly. However container deletion almost never occurred in this test until we run it on the new RHEL 8.8 ARM with 64K page size. Why is it stable on x86_64: On x86_64 we usually use 4k block alignment. We write 6080 byte data into a block, which is padded to 8k. So in the current test settings we have 32 blocks in a container when it becomes full (FLAGS_log_container_max_size = 256k). Later we delete exactly half of the 500 blocks we wrote. The chance of deleting all 32 blocks in a container is very small, and even if it happens, it still has around 0.09 chance to become corrupted. It is a bit flaky, but it would fail less than once in a billion run. If you dramatically decrease the FLAGS_log_container_max_size flag, the test starts to occasionally fail on a traditional x86_64 machine too. Why is it unstable with 64k alignment: With 64k alignment (currently used on ARM RHEL 8.8 with 64k page size), there are 4 blocks in a full container file. We write 500 blocks, so we expect to have nearly 125 full files. If we delete exactly half of the blocks, we will make many (full) container file empty. Some of them might fail to be deleted properly leaving a lonely non-empty .data file without .metadata. On my RHEL machine the test fails 97-98% of the time for this exact reason. Solution: The test TestMetadataOkayDespiteFailure was supposed to test reloading the block manager with containers having deleted blocks, with some previous failed delete. It (probably) never tested the case when container deletion occurs. + Disabled container deletion, so the test scope remains the same as it was with smaller block alignments. + Add a new (currently disabled) test, to see how block manager handles the above described situation. Filed a JIRA issue to track the issue: KUDU-3528. The original issue is not ARM specific, and far from trivial to solve, and was always in the system. Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20 Reviewed-on: http://gerrit.cloudera.org:8080/20725 Reviewed-by: Ashwani Raina <ara...@cloudera.com> Reviewed-by: Alexey Serbin <ale...@apache.org> Tested-by: Alexey Serbin <ale...@apache.org> (cherry picked from commit 366596389b12df977a93408eebdc7dc9e30be3b3) Reviewed-on: http://gerrit.cloudera.org:8080/21984 Tested-by: Abhishek Chennaka <achenn...@cloudera.com> > Fix BlockManagerTest.TestMetadataOkayDespiteFailure on rhel 8.8 graviton > ------------------------------------------------------------------------- > > Key: KUDU-3527 > URL: https://issues.apache.org/jira/browse/KUDU-3527 > Project: Kudu > Issue Type: Bug > Reporter: Zoltan Martonka > Assignee: Zoltan Martonka > Priority: Major > Fix For: 1.18.0 > > > BlockManagerTest.TestMetadataOkayDespiteFailure might fail on systems where > fs_block_size=64k. > *Cause:* > Currently tablets fail to load if one metadata is missing but there is still > a non-empty ".data" file. If FLAGS_env_inject_eio is not zero, then there is > a chance that, when we delete the container file, we only delete the ".meta", > but leave the ".data" file. > In the current test on systems with fs_block_size=4k deletion never occurs. > Changing to kNumAppends=64 will cause the test to randomly fail on x86 > systems too, although only with a 2-3% chance (at least on my ubuntu20 > machine). > *Solution:* > This test was not intended to test the file deletion itself (as it does not > do it on x86_64 or 4k arm kernels). It only occurs, because > _FLAGS_log_container_max_size = 256 * 1024;_ isĀ _not "large enought"._ > _We should just set_ FLAGS_log_block_manager_delete_dead_container = false; > to restore the original scope of the test. > There is a separate issue for the root cause (which is not arm specific at > all): > https://issues.apache.org/jira/browse/KUDU-3528 -- This message was sent by Atlassian Jira (v8.20.10#820010)