Wellington Chevreuil created HDFS-11821:
-------------------------------------------

             Summary: BlockManager.getMissingReplOneBlocksCount() does not 
report correct value if corrupt file with replication factor of 1 gets deleted
                 Key: HDFS-11821
                 URL: https://issues.apache.org/jira/browse/HDFS-11821
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: hdfs
    Affects Versions: 3.0.0-alpha2
            Reporter: Wellington Chevreuil
            Assignee: Wellington Chevreuil
            Priority: Minor


*BlockManager* keeps a separate metric for number of missing blocks with 
replication factor of 1. This is returned by 
*BlockManager.getMissingReplOneBlocksCount()* method currently, and that's what 
is displayed on below attribute for *dfsadmin -report* (in below example, 
there's one corrupt block that relates to a file with replication factor of 1):

{noformat}
...
Missing blocks (with replication factor 1): 1
...
{noformat}

However, if the related file gets deleted, (for instance, using hdfs fsck 
-delete option), this metric never gets updated, and *dfsadmin -report* will 
keep reporting a missing block, even though the file does not exist anymore. 
The only workaround available is to restart the NN, so that this metric will be 
cleared.

This can be easily reproduced by forcing a replication factor 1 file corruption 
such as follows:

1) Put a file into hdfs with replication factor 1:

{noformat}
$ hdfs dfs -Ddfs.replication=1 -put test_corrupt /
$ hdfs dfs -ls /

-rw-r--r--   1 hdfs     supergroup         19 2017-05-10 09:21 /test_corrupt

{noformat}

2) Find related block for the file and delete it from DN:

{noformat}
$ hdfs fsck /test_corrupt -files -blocks -locations

...
/test_corrupt 19 bytes, 1 block(s):  OK
0. BP-782213640-172.31.113.82-1494420317936:blk_1073742742_1918 len=19 
Live_repl=1 
[DatanodeInfoWithStorage[172.31.112.178:20002,DS-a0dc0b30-a323-4087-8c36-26ffdfe44f46,DISK]]

Status: HEALTHY
...

$ find /dfs/dn/ -name blk_1073742742*

/dfs/dn/current/BP-782213640-172.31.113.82-1494420317936/current/finalized/subdir0/subdir3/blk_1073742742
/dfs/dn/current/BP-782213640-172.31.113.82-1494420317936/current/finalized/subdir0/subdir3/blk_1073742742_1918.meta

$ rm -rf 
/dfs/dn/current/BP-782213640-172.31.113.82-1494420317936/current/finalized/subdir0/subdir3/blk_1073742742
$ rm -rf 
/dfs/dn/current/BP-782213640-172.31.113.82-1494420317936/current/finalized/subdir0/subdir3/blk_1073742742_1918.meta

{noformat}

3) Running fsck will report the corruption as expected:

{noformat}
$ hdfs fsck /test_corrupt -files -blocks -locations

...
/test_corrupt 19 bytes, 1 block(s): 
/test_corrupt: CORRUPT blockpool BP-782213640-172.31.113.82-1494420317936 block 
blk_1073742742
 MISSING 1 blocks of total size 19 B
...
Total blocks (validated):       1 (avg. block size 19 B)
  ********************************
  UNDER MIN REPL'D BLOCKS:      1 (100.0 %)
  dfs.namenode.replication.min: 1
  CORRUPT FILES:        1
  MISSING BLOCKS:       1
  MISSING SIZE:         19 B
  CORRUPT BLOCKS:       1
...
{noformat}

4) Same for *dfsadmin -report*

{noformat}
$ hdfs dfsadmin -report
...
Under replicated blocks: 1
Blocks with corrupt replicas: 0
Missing blocks: 1
Missing blocks (with replication factor 1): 1
...
{noformat}

5) Running *fsck -delete* option does cause fsck to report correct information 
about corrupt block, but dfsadmin still shows the corrupt block:

{noformat}

$ hdfs fsck /test_corrupt -delete
...
$ hdfs fsck /
...
The filesystem under path '/' is HEALTHY
...

$ hdfs dfsadmin -report
...
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 1
...
{noformat}

The problem seems to be on *BlockManager.removeBlock()* method, which in turn 
uses util class *LowRedundancyBlocks* that classifies blocks according to the 
current replication level, including blocks currently marked as corrupt. 

The related metric showed on *dfsadmin -report* for corrupt blocks with 
replication factor 1 is tracked on this *LowRedundancyBlocks*. Whenever a block 
is marked as corrupt and it has replication factor of 1, the related metric is 
updated. When removing the block, though, *BlockManager.removeBlock()* is 
calling *LowRedundancyBlocks.remove(BlockInfo block, int priLevel)*, which does 
not check if the given block was previously marked as corrupt and had 
replication factor 1, which would require for updating the metric.

Am shortly proposing a patch that seems to fix this by making 
*BlockManager.removeBlock()*  call *LowRedundancyBlocks.BlockInfo block, int 
oldReplicas, int oldReadOnlyReplicas, int outOfServiceReplicas, int 
oldExpectedReplicas)* instead, which does update the metric properly.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to