Apache Hadoop qbt Report: trunk+JDK8 on Linux/x86_64

2022-07-16 Thread Apache Jenkins Server
For more details, see 
https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/923/

[Jul 15, 2022 8:52:12 PM] (noreply) HDFS-16566 Erasure Coding: Recovery may 
causes excess replicas when busy DN exsits (#4252)
[Jul 15, 2022 9:18:46 PM] (noreply) HADOOP-13144. Enhancing IPC client 
throughput via multiple connections per user (#4542)


[Error replacing 'FILE' - Workspace is not accessible]

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Apache Hadoop qbt Report: branch-2.10+JDK7 on Linux/x86_64

2022-07-16 Thread Apache Jenkins Server
For more details, see 
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/724/

No changes


[Error replacing 'FILE' - Workspace is not accessible]

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-16663) Allow block reconstruction pending timeout to be refreshable

2022-07-16 Thread caozhiqiang (Jira)
caozhiqiang created HDFS-16663:
--

 Summary: Allow block reconstruction pending timeout to be 
refreshable
 Key: HDFS-16663
 URL: https://issues.apache.org/jira/browse/HDFS-16663
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: ec, namenode
Affects Versions: 3.4.0
Reporter: caozhiqiang
Assignee: caozhiqiang


In [HDFS-16613|https://issues.apache.org/jira/browse/HDFS-16613], increase the 
value of dfs.namenode.replication.max-streams-hard-limit would maximize the IO 
performance of the decommissioning DN, witch has a lot of EC blocks. Besides 
this, we also need to decrease the value of 
dfs.namenode.reconstruction.pending.timeout-sec, default is 5 minutes, to 
shorten the interval time for checking pendingReconstructions. Or the 
decommissioning node would be idle to wait for copy tasks in much time of this 
5 minutes.

In decommission progress, we may need to reconfigure these 2 parameters several 
times. In [HDFS-14560|https://issues.apache.org/jira/browse/HDFS-14560], the 
dfs.namenode.replication.max-streams-hard-limit can already be reconfigured 
dynamically without namenode restart. And the 
dfs.namenode.reconstruction.pending.timeout-sec parameter also need to be 
reconfigured dynamically. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replica

2022-07-16 Thread Kevin Wikant (Jira)
Kevin Wikant created HDFS-16664:
---

 Summary: Use correct GenerationStamp when invalidating corrupt 
block replica
 Key: HDFS-16664
 URL: https://issues.apache.org/jira/browse/HDFS-16664
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Kevin Wikant


While trying to backport HDFS-16064 to an older Hadoop version, the new unit 
test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing 
unexpectedly.

Upon deep diving this unit test failure, I identified a bug in HDFS corrupt 
replica invalidation which results in the following datanode exception:
{quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to 
localhost/127.0.0.1:61365] WARN  datanode.DataNode 
(BPServiceActor.java:processCommand(887)) - Error processing datanode Command
java.io.IOException: Failed to delete 1 (out of 1) replica(s):
0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, 
existing replica is blk_1073741825_1001
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849)
        at java.lang.Thread.run(Thread.java:750)
{quote} * The issue is that the Namenode is sending wrong generationStamp to 
the datanode. By adding some additional logs, I was able to determine the root 
cause for this:
the generationStamp sent in the DNA_INVALIDATE is based on the [generationStamp 
of the block sent in the block 
report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733]
 * the problem is that the datanode with the corrupt block replica (that 
receives the DNA_INVALIDATE) is not necissarily the same datanode that sent the 
block report
 * this can cause the above exception when the corrupt block replica on the 
datanode receiving the DNA_INVALIDATE & the block replica on the datanode that 
sent the block report have different generationStamps

The solution is to store the corrupt replicas generationStamp in the 
CorruptReplicasMap, then to extract this correct generationStamp value when 
sending the DNA_INVALIDATE to the datanode

 
h2. Failed Test - Before the fix
{quote}> mvn test 
-Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock

 

[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 
Node 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , 
numPendingNodes=0 , adminState=Decommission In Progress , 
nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419]
{quote}
Logs:
{quote}> cat 
target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | 
grep 'Expected Replicas:\|XXX\|FINALIZED\|Block now\|Failed to delete'


2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO  
hdfs.TestDecommission 
(TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) - 
Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 
live replica on 127.0.0.1:61366
2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO  
hdfs.TestDecommission 
(TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) - 
Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 
decommissioning replica on 127.0.0.1:61366
XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001
XXX postponeBlock dn=127.0.0.1:61415 , blk=1073741825_1001
XXX invalidateBlock dn=127.0.0.1:61419 , blk=1073741825_1003
XXX addToInvalidates dn=127.0.0.1:61419 , blk=1073741825_1003
XXX addBlocksToBeInvalidated dn=127.0.0.1:61419 , blk=1073741825_1003
XXX rescanPostponedMisreplicatedBlocks blk=1073741825_1005
XXX DNA_INVALIDATE dn=/127.0.0.1:61419 , blk=1073741825_1003
XXX invalidate(on DN) dn=/127.0.0.1:61419 , invalidBlk=blk_1073741825_1003 , 
blkByIdAndGenStamp = FinalizedReplica, blk_1073741825_1003, FINALIZED
2022-07-16 08:07:49,084 [BP-958471676-X-1657973243350 heartbeating to 
localhost/127.0.0.1:61365] INFO  impl.FsDatasetAsyncDiskService 
(FsDatasetAsyncDiskService.java:deleteAsync(226)) - Scheduling 
blk_1073741825_1003 replica FinalizedReplica, blk_1073741825_1003, FINALIZED
XXX addBlock