Re: [VOTE] Release Apache Hadoop Thirdparty 1.2.0 RC0

2024-02-04 Thread Shuyan Zhang
+1 (non-binding)

- Verified hashes
- LICENSE and NOTICE are included.
- Rat check is ok. `mvn clean apache-rat:check`
- `mvn clean install` works well


slfan1989  于2024年2月2日周五 11:11写道:

> Thank you very much for the review! I will avoid the diff.
>
> Best Regards,
> Shilun Fan.
>
> On Fri, Feb 2, 2024 at 9:59 AM Takanobu Asanuma 
> wrote:
>
> > It also looks good to me, except for the diff.
> >
> > * Verified signatures and hashes
> > * Reviewed the documents
> > * Successfully built from source with `mvn clean install`
> > * Successfully compiled Hadoop trunk and branch-3.4 using the Hadoop
> > thirdparty 1.2.0
> >
> > Anyway, since hadoop-thirdparty-1.1.1 has some high vulnerabilities,
> > hadoop-thirdparty-1.2.0 would be required for Hadoop-3.4.0.
> >
> > Thanks,
> > - Takanobu
> >
> > 2024年2月2日(金) 4:45 slfan1989 :
> >
> > > Thank you for helping to review Hadoop-Thirdparty-1.2.0-RC0 and
> providing
> > > feedback!
> > >
> > > I followed the "how to release" documentation and tried to package it
> > using
> > > create-release and Dockerfile, but I couldn't successfully package it
> > > directly. Some modifications are required before compilation. I should
> > > submit a pull request to fix this issue before
> > > Hadoop-Thirdparty-1.2.0-RC0 compile.
> > >
> > > This is an area that needs improvement. We should ensure that the code
> of
> > > src is consistent with the tag.
> > >
> > > On Fri, Feb 2, 2024 at 2:25 AM Ayush Saxena 
> wrote:
> > >
> > > >
> > > > There is some diff b/w the git tag & the src tar, the Dockerfile &
> the
> > > > create-release are different, Why?
> > > >
> > > > Files hadoop-thirdparty/dev-support/bin/create-release and
> > > > hadoop-thirdparty-1.2.0-src/dev-support/bin/create-release differ
> > > >
> > > > Files hadoop-thirdparty/dev-support/docker/Dockerfile and
> > > > hadoop-thirdparty-1.2.0-src/dev-support/docker/Dockerfile differ
> > > >
> > > >
> > > > ayushsaxena@ayushsaxena hadoop-thirdparty-1.2.0-RC0 % diff
> > > > hadoop-thirdparty/dev-support/bin/create-release
> > > > hadoop-thirdparty-1.2.0-src/dev-support/bin/create-release
> > > >
> > > > 444,446c444,446
> > > >
> > > > < echo "RUN groupadd --non-unique -g ${group_id} ${user_name}"
> > > >
> > > > < echo "RUN useradd -g ${group_id} -u ${user_id} -m ${user_name}"
> > > >
> > > > < echo "RUN chown -R ${user_name} /home/${user_name}"
> > > >
> > > > ---
> > > >
> > > > > echo "RUN groupadd --non-unique -g ${group_id} ${user_name};
> exit
> > > > 0;"
> > > >
> > > > > echo "RUN useradd -g ${group_id} -u ${user_id} -m ${user_name};
> > > > exit 0;"
> > > >
> > > > > echo "RUN chown -R ${user_name} /home/${user_name}; exit 0;"
> > > >
> > > > ayushsaxena@ayushsaxena hadoop-thirdparty-1.2.0-RC0 % diff
> > > > hadoop-thirdparty/dev-support/docker/Dockerfile
> > > > hadoop-thirdparty-1.2.0-src/dev-support/docker/Dockerfile
> > > >
> > > > 103a104,105
> > > >
> > > > > RUN rm -f /etc/maven/settings.xml && ln -s
> > /home/root/.m2/settings.xml
> > > > /etc/maven/settings.xml
> > > >
> > > > >
> > > >
> > > > 126a129,130
> > > >
> > > > > RUN pip2 install setuptools-scm==5.0.2
> > > >
> > > > > RUN pip2 install lazy-object-proxy==1.5.0
> > > >
> > > > 159d162
> > > >
> > > > <
> > > >
> > > >
> > > >
> > > >
> > > > Other things look Ok,
> > > > * Built from source
> > > > * Verified Checksums
> > > > * Verified Signatures
> > > > * Validated files have ASF header
> > > >
> > > > Not sure if having diff b/w the git tag & src tar is ok, this doesn't
> > > look
> > > > like core code change though, can anybody check & confirm?
> > > >
> > > > -Ayush
> > > >
> > > >
> > > > On Thu, 1 Feb 2024 at 13:39, Xiaoqiao He 
> > wrote:
> > > >
> > > >> Gentle ping. @Ayush Saxena  @Steve Loughran
> > > >>  @inigo...@apache.org 
> > > >> @Masatake
> > > >> Iwasaki  and some other folks.
> > > >>
> > > >> On Wed, Jan 31, 2024 at 10:17 AM slfan1989 
> > > wrote:
> > > >>
> > > >> > Thank you for the review and vote! Looking forward to other forks
> > > >> helping
> > > >> > with voting and verification.
> > > >> >
> > > >> > Best Regards,
> > > >> > Shilun Fan.
> > > >> >
> > > >> > On Tue, Jan 30, 2024 at 6:20 PM Xiaoqiao He <
> hexiaoq...@apache.org>
> > > >> wrote:
> > > >> >
> > > >> > > Thanks Shilun for driving it and making it happen.
> > > >> > >
> > > >> > > +1(binding).
> > > >> > >
> > > >> > > [x] Checksums and PGP signatures are valid.
> > > >> > > [x] LICENSE files exist.
> > > >> > > [x] NOTICE is included.
> > > >> > > [x] Rat check is ok. `mvn clean apache-rat:check`
> > > >> > > [x] Built from source works well: `mvn clean install`
> > > >> > > [x] Built Hadoop trunk with updated thirdparty successfully
> > (include
> > > >> > update
> > > >> > > protobuf shaded path).
> > > >> > >
> > > >> > > BTW, hadoop-thirdparty-1.2.0 will be included in release-3.4.0,
> > hope
> > > >> we
> > > >> > > could finish this vote before 2024/02/06(UTC) if there are no
> > > >> concerns.
> > > >> > > Thanks all.

[jira] [Created] (HDFS-16099) Make bpServiceToActive to be volatile

2021-06-29 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-16099:
---

 Summary: Make bpServiceToActive to be volatile
 Key: HDFS-16099
 URL: https://issues.apache.org/jira/browse/HDFS-16099
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Reporter: Shuyan Zhang
Assignee: Shuyan Zhang


BPOfferService#bpServiceToActive is not volatile, which may cause 
_commandProcessingThread_ to get an out-of-date active namenode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16146) All three replicas are lost due to not adding a new DataNode in time

2021-07-29 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-16146:
---

 Summary: All three replicas are lost due to not adding a new 
DataNode in time
 Key: HDFS-16146
 URL: https://issues.apache.org/jira/browse/HDFS-16146
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, hdfs
Reporter: Shuyan Zhang
Assignee: Shuyan Zhang


We have a three-replica file, and all replicas of a block are lost when the 
default datanode replacement strategy is used. It happened like this:
1. addBlock() applies for a new block and successfully connects three datanodes 
(dn1, dn2 and dn3) to build a pipeline;
2. Write data;
3. dn1 has an error and was kicked out. At this time, the remaining datanodes 
in the pipeline > 1, according to the replacement strategy, there is no need to 
add a new datanode;
4. After writing is completed, enter PIPELINE_CLOSE;
5. dn2 has an error and was kicked out. But because it is already in the close 
phase, addDatanode2ExistingPipeline() decides to hand over the task of 
transfering the replica to the NameNode. At this time, there is only one 
datanode left in the pipeline;
6. dn3 error, all replicas are lost.
If we add a new datanode in step 5, we can avoid losing all replicas in this 
case. I think error in PIPELINE_CLOSE and error in DATA_STREAMING have the same 
risk of losing replicas,  we should not skip adding a new datanode during 
PIPELINE_CLOSE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16683) All method metrics related to the rpc protocol should be initialized

2022-07-25 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-16683:
---

 Summary: All method metrics related to the rpc protocol should be 
initialized
 Key: HDFS-16683
 URL: https://issues.apache.org/jira/browse/HDFS-16683
 Project: Hadoop HDFS
  Issue Type: Bug
 Environment: When an RPC protocol is used, the metric of 
protocol-related methods should be initialized; otherwise, metric information 
will be incomplete. For example, when we call 
HAServiceProtocol#monitorHealth(), only the metric of monitorHealth() are 
initialized, and the metric of transitionToStandby() are still not reported. 
This incompleteness caused a little trouble for our monitoring system.
The root cause is that the parameter passed by RpcEngine to 
MutableRatesWithAggregation#init(java.lang.Class)  is always XXXProtocolPB, 
which is inherited from BlockingInterface and does not implement any methods. 
We should fix this bug.
Reporter: Shuyan Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16735) Reduce the number of HeartbeatManager loops

2022-08-21 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-16735:
---

 Summary: Reduce the number of HeartbeatManager loops
 Key: HDFS-16735
 URL: https://issues.apache.org/jira/browse/HDFS-16735
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Shuyan Zhang


HeartbeatManager only processes one dead datanode (and failed storage) per 
round in heartbeatCheck(), that is to say, if there are ten failed storages, 
all datanode states need to be scanned 10 times, which is unnecessary. This 
patch makes the number of bad storages processed per scan configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16939) Fix the thread safety bug in LowRedundancyBlocks

2023-03-02 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-16939:
---

 Summary: Fix the thread safety bug in LowRedundancyBlocks
 Key: HDFS-16939
 URL: https://issues.apache.org/jira/browse/HDFS-16939
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namanode
Reporter: Shuyan Zhang
Assignee: Shuyan Zhang


The remove method in LowRedundancyBlocks is not protected by synchronized. This 
method is private and is called by BlockManager. As a result, priorityQueues 
has the risk of being accessed concurrently by multiple threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16958) Fix bug in processing EC excess redundancy

2023-03-17 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-16958:
---

 Summary: Fix bug in processing EC excess redundancy 
 Key: HDFS-16958
 URL: https://issues.apache.org/jira/browse/HDFS-16958
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Shuyan Zhang
Assignee: Shuyan Zhang


When processing excess redundancy, the number of internal blocks is computed by 
traversing `nonExcess`. This way is not accurate, because `nonExcess` excludes 
replicas in abnormal states, such as corrupt ones, or maintenance ones. 
`numOfTarget` may be smaller than the actual value, which will result in 
inaccurate generated `excessTypes`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16964) Improve processing of excess redundancy after failover

2023-03-24 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-16964:
---

 Summary: Improve processing of excess redundancy after failover
 Key: HDFS-16964
 URL: https://issues.apache.org/jira/browse/HDFS-16964
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Shuyan Zhang


After failover, the block with excess redundancy cannot be processed until all 
replicas are not stale, because the stale ones may have been deleted. That is 
to say, we need to wait for the FBRs of all datanodes on which the block 
resides before deleting the redundant replicas. This is unnecessary, we can 
bypass stale replicas when dealing with excess replicas, and delete non-stale 
excess replicas in a more timely manner.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16974) Consider load of every volume when choosing target

2023-04-09 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-16974:
---

 Summary: Consider load of every volume when choosing target
 Key: HDFS-16974
 URL: https://issues.apache.org/jira/browse/HDFS-16974
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Shuyan Zhang


The current target choosing policy only considers the load of the entire 
datanode. If both DN1 and DN2 have an `xceiverCount` of 100, but DN1 has 10 
volumes to write to and DN2 only has 1, then the pressure on DN2 is actually 
much greater than that on DN1. This patch has added a configuration that allows 
us to avoid nodes with too much pressure on a single volume when choosing 
targets, so as to avoid overloading datanodes with few volumes or slowing down 
writes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16986) EC: Fix locationBudget in getListing()

2023-04-22 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-16986:
---

 Summary: EC: Fix locationBudget in getListing()
 Key: HDFS-16986
 URL: https://issues.apache.org/jira/browse/HDFS-16986
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Shuyan Zhang


The current `locationBudget` is estimated using the `block_replication` in 
`FileStatus`, which is unreasonable on EC files, because it will count the 
number of locations of a EC block as 1. We should consider ErasureCodingPolicy 
of the files to keep the meaning of `locationBudget` consistent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16999) Fix wrong use of processFirstBlockReport()

2023-05-04 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-16999:
---

 Summary: Fix wrong use of processFirstBlockReport()
 Key: HDFS-16999
 URL: https://issues.apache.org/jira/browse/HDFS-16999
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Shuyan Zhang


`processFirstBlockReport()` is used to process first block report from 
datanode. It does not calculating `toRemove` list because it believes that 
there is no metadata about the datanode in the namenode. However, If a datanode 
is re registered after restarting, its `blockReportCount` will be updated to 0. 
That is to say, the first block report after a datanode restarts will be 
processed by `processFirstBlockReport()`.  This is unreasonable because the 
metadata of the datanode already exists in namenode at this time, and if 
redundant replica metadata is not removed in time, the blocks with insufficient 
replicas cannot be reconstruct in time, which increases the risk of missing 
block. In summary, `processFirstBlockReport()` should only be used when the 
namenode restarts, not when the datanode restarts. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17021) GENSTAMP_MISMATCH replica can not be removed by invalidateCorruptReplicas()

2023-05-18 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-17021:
---

 Summary: GENSTAMP_MISMATCH replica can not be removed by 
invalidateCorruptReplicas()
 Key: HDFS-17021
 URL: https://issues.apache.org/jira/browse/HDFS-17021
 Project: Hadoop HDFS
  Issue Type: Bug
 Environment: If a replica is corrupted due to generation stamp 
mismatch, the corresponding datanode stores a wrong generation stamp while 
`invalidateCorruptReplicas()` will send right generation stamp to the datanode. 
 Therefore, the check on datanode can not pass successfully as discussion in 
[https://github.com/apache/hadoop/pull/5643.] resulting in the corrupted 
replica unable to be successfully deleted. 
Reporter: Shuyan Zhang
Assignee: Shuyan Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17021) GENSTAMP_MISMATCH replica can not be removed by invalidateCorruptReplicas()

2023-05-21 Thread Shuyan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuyan Zhang resolved HDFS-17021.
-
Resolution: Not A Bug

> GENSTAMP_MISMATCH replica can not be removed by invalidateCorruptReplicas()
> ---
>
> Key: HDFS-17021
> URL: https://issues.apache.org/jira/browse/HDFS-17021
> Project: Hadoop HDFS
>  Issue Type: Bug
>    Reporter: Shuyan Zhang
>    Assignee: Shuyan Zhang
>Priority: Major
>
> If a replica is corrupted due to generation stamp mismatch, the corresponding 
> datanode stores a wrong generation stamp while `invalidateCorruptReplicas()` 
> will send right generation stamp to the datanode.  Therefore, the check on 
> datanode can not pass successfully as discussion in 
> [https://github.com/apache/hadoop/pull/5643.] resulting in the corrupted 
> replica unable to be successfully deleted. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17037) Consider nonDfsUsed when running balancer

2023-06-04 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-17037:
---

 Summary: Consider nonDfsUsed when running balancer
 Key: HDFS-17037
 URL: https://issues.apache.org/jira/browse/HDFS-17037
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Shuyan Zhang
Assignee: Shuyan Zhang


When we run balancer with `BalancingPolicy.Node` policy, our goal is to make 
each datanode storage balanced. But in the current implementation, the balancer 
doesn't account for storage used by non-dfs on the datanodes, which can make 
the situation worse for datanodes that are already strained on storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17049) Fix duplicate block group ids generated by SequentialBlockGroupIdGenerator

2023-06-13 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-17049:
---

 Summary: Fix duplicate block group ids generated by 
SequentialBlockGroupIdGenerator
 Key: HDFS-17049
 URL: https://issues.apache.org/jira/browse/HDFS-17049
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Shuyan Zhang


When I used multiple clients to write EC files concurrently, I found that 
NameNode generated the same block group ID for different files:

```

2023-06-13 20:09:59,514 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
allocate blk_-9223372036854697568_14389 for /ec-test/10/4068034329705654124

2023-06-13 20:09:59,514 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
allocate blk_-9223372036854697568_14390 for /ec-test/19/7042966144171770731

```

After diving into `SequentialBlockGroupIdGenerator`, I found that the current 
implementation of `nextValue` is not thread safety. 

This problem must be fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17089) Close child files systems in ViewFileSystem when cache is disabled.

2023-07-16 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-17089:
---

 Summary: Close child files systems in ViewFileSystem when cache is 
disabled.
 Key: HDFS-17089
 URL: https://issues.apache.org/jira/browse/HDFS-17089
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Shuyan Zhang


When the cache is configured to disabled (namely, 
`fs.viewfs.enable.inner.cache=false` and `fs.*.impl.disable.cache=true`), even 
if `FileSystem.close()` is called, the client cannot truly close the child file 
systems in a ViewFileSystem. This caused our long-running clients to constantly 
produce resource leaks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes

2023-07-18 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-17094:
---

 Summary: EC: Fix bug in block recovery when there are stale 
datanodes
 Key: HDFS-17094
 URL: https://issues.apache.org/jira/browse/HDFS-17094
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Shuyan Zhang


When a block recovery occurs, `RecoveryTaskStriped` in datanode expects 
`rBlock.getLocations()` and `rBlock. getBlockIndices()` to be in one-to-one 
correspondence. However, if there are locations in stale state when NameNode 
handles heartbeat, this correspondence will be disrupted. In detail, there is 
no stale location in `recoveryLocations`, but the block indices array is still 
complete (i.e. contains the indices of all the locations). This will cause 
`BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate a wrong internal 
block ID, and the corresponding datanode cannot find the relica, thus making 
the recovery process fail. This bug needs to be fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17112) Show decommission duration in JMX and HTML

2023-07-20 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-17112:
---

 Summary: Show decommission duration in JMX and HTML
 Key: HDFS-17112
 URL: https://issues.apache.org/jira/browse/HDFS-17112
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Shuyan Zhang


Expose decommission duration time in JMX page. It's a very useful info when 
decommissioning a batch of datanodes in a cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17134) RBF: Fix duplicate results of getListing through Router.

2023-07-28 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-17134:
---

 Summary: RBF: Fix duplicate results of getListing through Router.
 Key: HDFS-17134
 URL: https://issues.apache.org/jira/browse/HDFS-17134
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Shuyan Zhang


The result of `getListing` in NameNode are sorted based on `byte[]`, while the 
Router side is based on `String`. If there are special characters in path, the 
sorting result of the router is inconsistent with the namenode. This may result 
in duplicate `getListing` results obtained by the client due to wrong 
`startAfter` parameter.

For exemple, namenode returns [path1, path2, path3], while router returns 
[path1, path3, path2] to client. Then client will pass `startAfter` as `path2` 
at the next iteration, so it will receive `path3` again.

We need to fix the Router code so that the order of its results is the same as 
NameNode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-17150:
---

 Summary: EC: Fix the bug of failed lease recovery.
 Key: HDFS-17150
 URL: https://issues.apache.org/jira/browse/HDFS-17150
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Shuyan Zhang


If the client crashes without writing the minimum number of internal blocks 
required by the EC policy, the lease recovery process for the corresponding 
unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
timeline is as follows:
1. The client writes some data to only 5 datanodes;
2. Client crashes;
3. NN fails over;
4. Now the result of `uc.getNumExpectedLocations()` completely depends on block 
report, and there are 5 datanodes reporting internal blocks;
5. When the lease expires hard limit, NN issues a block recovery command;
6. The datanode checks the command and finds that the number of internal blocks 
is insufficient, resulting in an error and recovery failure;

7. The lease expires hard limit again, and NN issues a block recovery command 
again, but the recovery fails again..

When the number of internal blocks written by the client is less than 6, the 
block group is actually unrecoverable. We should equate this situation to the 
case where the number of replicas is 0 when processing replica files, i.e., 
directly remove the last block group and close the file.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17151) EC: Fix wrong metadata in BlockInfoStriped after recovery

2023-08-10 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-17151:
---

 Summary: EC: Fix wrong metadata in BlockInfoStriped after recovery
 Key: HDFS-17151
 URL: https://issues.apache.org/jira/browse/HDFS-17151
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Shuyan Zhang


When the datanode completes a block recovery, it will call 
`commitBlockSynchronization` method to notify NN the new locations of the 
block. For a EC block group, NN determines the index of each internal block 
based on the position of the DatanodeID in the parameter `newtargets`.

If the internal blocks written by the client don't have continuous indices, the 
current datanode code might cause NN to record incorrect block metadata. For 
simplicity, let's take RS (3,2) as an example. The timeline of the problem is 
as follows:
1. The client plans to write internal blocks with indices [0,1,2,3,4] to 
datanode [dn0, dn1, dn2, dn3, dn4] respectively. But dn1 is unable to connect, 
so the client only writes data to the remaining 4 datanodes;
2. Client crashes;
3. NN fails over;
4. Now the content of `uc. getExpectedStorageLocations()` completely depends on 
block reports, and now it is ;
5. When the lease expires hard limit, NN issues a block recovery command;
6. Datanode that receives the recovery command fills `DatanodeID [] newLocs` 
with [dn0, null, dn2, dn3, dn4];
7. The serialization process filters out null values, so the parameters passed 
to NN become [dn0, dn2, dn3, dn4];
8. NN mistakenly believes that dn2 stores an internal block with index 1, dn3 
stores an internal block with index 2, and so on.

The above timeline is just an example, and there are other situations that may 
result in the same error, such as an update pipeline occurs on the client side. 
We should fix this bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17154) EC: Fix bug in updateBlockForPipeline after failover

2023-08-11 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-17154:
---

 Summary: EC: Fix bug in updateBlockForPipeline after failover
 Key: HDFS-17154
 URL: https://issues.apache.org/jira/browse/HDFS-17154
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Shuyan Zhang


In the method `updateBlockForPipeline`, NameNode uses the 
`BlockUnderConstructionFeature` of a BlockInfo to generate the member 
`blockIndices` of `LocatedStripedBlock`. 

And then, NameNode uses `blockIndices` to generate block tokens for client.

However, if there is a failover, the location info in 
BlockUnderConstructionFeature may be incomplete, which results in the absence 
of the corresponding block tokens.

When the client receives these incomplete block tokens, it will throw a NPE 
because `updatedBlks[i]` is null.

NameNode should just return block tokens for all indices to the client. Client 
can pick whichever it likes to use. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17190) EC: Fix bug for OIV processing XAttr.

2023-09-13 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-17190:
---

 Summary: EC: Fix bug for OIV processing XAttr.
 Key: HDFS-17190
 URL: https://issues.apache.org/jira/browse/HDFS-17190
 Project: Hadoop HDFS
  Issue Type: Bug
 Environment: When we need to use OIV to print EC information for a 
directory, `PBImageTextWriter#getErasureCodingPolicyName` will be called. 
Currently, this method uses 
`XATTR_ERASURECODING_POLICY.contains(xattr.getName())` to filter and obtain EC 
XAttr, which is very dangerous. If we have an XAttr whose name happens to be a 
substring of `hdfs.erasurecoding.policy`, then `getErasureCodingPolicyName` 
will return the wrong result. Our internal production environment has 
customized some XAttrs, and this bug caused errors in the parsing results of 
OIV when using `-ec` option. 
Reporter: Shuyan Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17197) Show file replication when listing corrupt files.

2023-09-17 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-17197:
---

 Summary: Show file replication when listing corrupt files.
 Key: HDFS-17197
 URL: https://issues.apache.org/jira/browse/HDFS-17197
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Shuyan Zhang


Files with different replication have different reliability guarantees. We need 
to pay attention to corrupted files with a specified replication greater than 
or equal to 3. So, when listing corrupt files, it would be useful to display 
the corresponding replication of the files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17204) EC: Reduce unnecessary log when processing excess redundancy.

2023-09-21 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-17204:
---

 Summary: EC: Reduce unnecessary log when processing excess 
redundancy.
 Key: HDFS-17204
 URL: https://issues.apache.org/jira/browse/HDFS-17204
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Shuyan Zhang


This is a follow-up of 
[HDFS-16964|https://issues.apache.org/jira/browse/HDFS-16964]. We now avoid 
stale replicas when dealing with redundancy. This may result in redundant 
replicas not being in the `nonExcess` set when we enter 
`BlockManager#chooseExcessRedundancyStriped` (because the datanode where the 
redundant replicas are located has not send FBR yet, so those replicas are 
filtered out and not added to the `nonExcess` set). A further result is that no 
excess storage type is selected and the log "excess types chosen for block..." 
is printed. When a failover occurs, a large number of datanodes become stale, 
which causes NameNodes to print a large number of unnecessary logs.
This issue needs to be fixed, otherwise the performance after failover will be 
affected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15963) Unreleased volume references cause an infinite loop

2021-04-10 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-15963:
---

 Summary: Unreleased volume references cause an infinite loop
 Key: HDFS-15963
 URL: https://issues.apache.org/jira/browse/HDFS-15963
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Reporter: Shuyan Zhang


When BlockSender throws an exception because the meta-data cannot be found, the 
volume reference obtained by the thread is not released, which causes the 
thread trying to remove the volume to wait and fall into an infinite loop.

{code:java}
boolean checkVolumesRemoved() {
  Iterator it = volumesBeingRemoved.iterator();
  while (it.hasNext()) {
FsVolumeImpl volume = it.next();
if (!volume.checkClosed()) {
  return false;
}
it.remove();
  }
  return true;
}

boolean checkClosed() {
  // always be true.
  if (this.reference.getReferenceCount() > 0) {
FsDatasetImpl.LOG.debug("The reference count for {} is {}, wait to be 0.",
this, reference.getReferenceCount());
return false;
  }
  return true;
}
{code}
At the same time, because the thread has been holding checkDirsLock when 
removing the volume, other threads trying to acquire the same lock will be 
permanently blocked.
Similar problems also occur in RamDiskAsyncLazyPersistService and 
FsDatasetAsyncDiskService.
This patch releases the three previously unreleased volume references.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17227) EC: Fix bug in choosing targets when racks is not enough.

2023-10-16 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-17227:
---

 Summary: EC: Fix bug in choosing targets when racks is not enough.
 Key: HDFS-17227
 URL: https://issues.apache.org/jira/browse/HDFS-17227
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Shuyan Zhang


*Bug description*
If,
1. There is a striped block blockinfo1, which has an excess replica on 
datanodeA.
2. blockinfo1 has an internal block that needs to be reconstruction.
3. The number of racks is less than the number of internal blocks of Blockinfo1.
Then, NN may choose datanodeA to reconstruct the internal block, resulting in 
two internal blocks of blockinfo1 on datanodeA, causing confusion. 

*Root cause and solution*
When we use `BlockPlacementPolicyRackFaultTolerant` for choosing targets and 
the racks is insufficient, `chooseEvenlyFromRemainingRacks` will be called. 
Currently, `chooseEvenlyFromRemainingRacks` calls `chooseOnce`, `chooseOnce` 
use `newExcludeNodes` as parameter instead of `excludedNodes`. When we choose 
targets for reconstructing internal blocks, 'newExcludeNodes' only includes 
those datanodes that contain live replicas, and does not include datanodes that 
have excess replicas. This may result in datanodes with excess replicas is 
chosen.
I don't think we need to use 'newExcludeNodes', just pass `excludedNodes` to 
`chooseOnce`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17243) Add the parameter storage type for getBlocks method

2023-11-05 Thread Shuyan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuyan Zhang resolved HDFS-17243.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Add the parameter storage type for getBlocks method
> ---
>
> Key: HDFS-17243
> URL: https://issues.apache.org/jira/browse/HDFS-17243
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balamcer
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When Balancer is running, it is found that there are many logs, such as 
> {code:java}
> INFO  balancer.Dispatcher (Dispatcher.java:markMovedIfGoodBlock(306)) - No 
> striped internal block on source xxx:50010:SSD, block blk_-xxx_xxx 
> size=982142783. Skipping.
> {code}
> these logs show that Balancer cannot to balancer SSD type source, and it 
> causes that Balancer will frequently get blocks from NN through getBlocks RPC.
> The main reason is the storage type in the current Source is SSD, but now 
> getBlocks obtains all list of blocks belonging to datanode, so need add the 
> parameter storage type for getBlocks method



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17152) Fix the documentation of count command in FileSystemShell.md

2023-12-11 Thread Shuyan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuyan Zhang resolved HDFS-17152.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix the documentation of count command in FileSystemShell.md
> 
>
> Key: HDFS-17152
> URL: https://issues.apache.org/jira/browse/HDFS-17152
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> count -q means show quotas and usage.
> count -u means show quotas.
> We should fix this minor documentation error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17275) Judge whether the block has been deleted in the block report

2023-12-26 Thread Shuyan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuyan Zhang resolved HDFS-17275.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
 Assignee: lei w
   Resolution: Fixed

> Judge whether the block has been deleted in the block report
> 
>
> Key: HDFS-17275
> URL: https://issues.apache.org/jira/browse/HDFS-17275
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: lei w
>Assignee: lei w
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Now, we use asynchronous thread MarkedDeleteBlockScrubber to delete block. In 
> block report.,We may do some useless block related calculations when blocks 
> haven't been added to invalidateBlocks 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17283) Change the name of variable SECOND in HdfsClientConfigKeys

2024-01-04 Thread Shuyan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuyan Zhang resolved HDFS-17283.
-
   Fix Version/s: 3.4.0
Hadoop Flags: Reviewed
Target Version/s: 3.4.0  (was: 3.5.0)
  Resolution: Fixed

> Change the name of variable SECOND in HdfsClientConfigKeys
> --
>
> Key: HDFS-17283
> URL: https://issues.apache.org/jira/browse/HDFS-17283
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17289) Considering the size of non-lastBlocks equals to complete block size can cause append failure.

2024-01-13 Thread Shuyan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuyan Zhang resolved HDFS-17289.
-
   Fix Version/s: 3.4.0
Hadoop Flags: Reviewed
Target Version/s: 3.4.0  (was: 3.5.0)
  Resolution: Fixed

> Considering the size of non-lastBlocks equals to complete block size can 
> cause append failure.
> --
>
> Key: HDFS-17289
> URL: https://issues.apache.org/jira/browse/HDFS-17289
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17291) DataNode metric bytesWritten is not totally accurate in some situations.

2024-01-13 Thread Shuyan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuyan Zhang resolved HDFS-17291.
-
   Fix Version/s: 3.4.0
Hadoop Flags: Reviewed
Target Version/s: 3.4.0  (was: 3.5.0)
  Resolution: Fixed

> DataNode metric bytesWritten is not totally accurate in some situations.
> 
>
> Key: HDFS-17291
> URL: https://issues.apache.org/jira/browse/HDFS-17291
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> As the title described, dataNode metric bytesWritten is not totally accurate 
> in some situations, such as failure recovery, re-send data. We should fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17337) RPC RESPONSE time seems not exactly accurate when using FSEditLogAsync.

2024-01-17 Thread Shuyan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuyan Zhang resolved HDFS-17337.
-
   Fix Version/s: 3.4.0
Hadoop Flags: Reviewed
Target Version/s: 3.4.0  (was: 3.5.0)
  Resolution: Fixed

> RPC RESPONSE time seems not exactly accurate when using FSEditLogAsync.
> ---
>
> Key: HDFS-17337
> URL: https://issues.apache.org/jira/browse/HDFS-17337
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Currently, FSEditLogAsync is enabled by default. 
> We have below codes in method Server$RpcCall#run:
>  
> {code:java}
>       if (!isResponseDeferred()) {
>         long deltaNanos = Time.monotonicNowNanos() - startNanos;
>         ProcessingDetails details = getProcessingDetails();        
> details.set(Timing.PROCESSING, deltaNanos, TimeUnit.NANOSECONDS);
>         deltaNanos -= details.get(Timing.LOCKWAIT, TimeUnit.NANOSECONDS);
>         deltaNanos -= details.get(Timing.LOCKSHARED, TimeUnit.NANOSECONDS);
>         deltaNanos -= details.get(Timing.LOCKEXCLUSIVE, TimeUnit.NANOSECONDS);
>         details.set(Timing.LOCKFREE, deltaNanos, TimeUnit.NANOSECONDS);
>         startNanos = Time.monotonicNowNanos();
> setResponseFields(value, responseParams);
>         sendResponse();        
> deltaNanos = Time.monotonicNowNanos() - startNanos;
>         details.set(Timing.RESPONSE, deltaNanos, TimeUnit.NANOSECONDS);
>       } else {
>         if (LOG.isDebugEnabled()) {
>           LOG.debug("Deferring response for callId: " + this.callId);
>         }
>       }{code}
> It computes Timing.RESPONSE of a RpcCall using *Time.monotonicNowNanos() - 
> startNanos;*
> However, if we use async editlogging,  we will not send response here but in 
> FSEditLogAsync.RpcEdit#logSyncNotify.
> This causes the Timing.RESPONSE of a RpcCall not be exactly accurate.
> {code:java}
>     @Override
>     public void logSyncNotify(RuntimeException syncEx) {
>       try {
>         if (syncEx == null) {
>           call.sendResponse();
>         } else {
>           call.abortResponse(syncEx);
>         }
>       } catch (Exception e) {} // don't care if not sent.
>     } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17331) Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in federationhealth.html

2024-01-18 Thread Shuyan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuyan Zhang resolved HDFS-17331.
-
   Fix Version/s: 3.5.0
Hadoop Flags: Reviewed
Target Version/s: 3.5.0
Assignee: lei w
  Resolution: Fixed

> Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in 
> federationhealth.html
> ---
>
> Key: HDFS-17331
> URL: https://issues.apache.org/jira/browse/HDFS-17331
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: lei w
>Assignee: lei w
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
> Attachments: After fix.png, Before fix.png
>
>
> Blocks are always -1 and DataNode`s version are always UNKNOWN in 
> federationhealth.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17293) First packet data + checksum size will be set to 516 bytes when writing to a new block.

2024-01-21 Thread Shuyan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuyan Zhang resolved HDFS-17293.
-
Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> First packet data + checksum size will be set to 516 bytes when writing to a 
> new block.
> ---
>
> Key: HDFS-17293
> URL: https://issues.apache.org/jira/browse/HDFS-17293
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> First packet size will be set to 516 bytes when writing to a new block.
> In  method computePacketChunkSize, the parameters psize and csize would be 
> (0, 512)
> when writting to a new block. It should better use writePacketSize.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17346) Fix DirectoryScanner check mark the normal blocks as corrupt.

2024-01-24 Thread Shuyan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuyan Zhang resolved HDFS-17346.
-
   Fix Version/s: 3.5.0
Hadoop Flags: Reviewed
Target Version/s: 3.5.0
  Resolution: Fixed

> Fix DirectoryScanner check mark the normal blocks as corrupt.
> -
>
> Key: HDFS-17346
> URL: https://issues.apache.org/jira/browse/HDFS-17346
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> DirectoryScanner check mark the normal blocks as corrupt and report to 
> namenode, it maybe cause some corrupted blocks, actually these are health.
> This can happen if Appending and DirectoryScanner are running at the same 
> time, and the probability is very high.
> *Root cause:*
> * Create a file such as:blk_xxx_1001 and diskFile is 
> "file:/XXX/current/finalized/blk_xxx", diskMetaFile is 
> "file:/XXX/current/finalized/blk_xxx_1001.meta"
> * Run DirectoryScanner, first will create BlockPoolReport.ScanInfo and record 
> blockFile is "file:/XXX/current/finalized/blk_xxx" and metaFile is 
> "file:/XXX/current/finalized/blk_xxx_1001.meta"
> * Simultaneously other thread to complete append for blk_xxx, then the 
> diskFile "file:/XXX/current/finalized/blk_xxx", diskMetaFile 
> "file:/XXX/current/finalized/blk_xxx_1002.meta", memMetaFile 
> "file:/XXX/current/finalized/blk_xxx", memDataFile 
> "file:/XXX/current/finalized/blk_xxx_1002.meta"
> * DirectoryScanner continue to run, due to the different generation stamps of 
> the metadata file in mem and metadata file in scanInfo will add the scanInfo 
> object to the list of differences
> * Continue to run FsDatasetImpl#checkAndUpdate will traverse the list of 
> differences, due to current diskMetaFile 
> "/XXX/current/finalized/blk_xxx_1001.meta" is not exists, so isRegular as 
> false
> {code:java}
> final boolean isRegular = FileUtil.isRegularFile(diskMetaFile, false) && 
> FileUtil.isRegularFile(diskFile, false);
> {code}
> * Here will mark the normal blocks as corrupt and report to namenode
> {code:java}
> } else if (!isRegular) {
>   corruptBlock = new Block(memBlockInfo);
>   LOG.warn("Block:{} is not a regular file.", corruptBlock.getBlockId());
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17339) BPServiceActor should skip cacheReport when one blockPool does not have CacheBlock on this DataNode

2024-01-25 Thread Shuyan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuyan Zhang resolved HDFS-17339.
-
   Fix Version/s: 3.5.0
Hadoop Flags: Reviewed
Target Version/s: 3.5.0
  Resolution: Fixed

> BPServiceActor should skip cacheReport when one blockPool does not have 
> CacheBlock on this DataNode
> ---
>
> Key: HDFS-17339
> URL: https://issues.apache.org/jira/browse/HDFS-17339
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: lei w
>Assignee: lei w
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Now, DataNode will cacheReport to all NameNode when CacheCapacitySize is not 
> zero. But sometimes, not all NameNodes have CacheBlock on this DataNode. So 
> BPServiceActor should skip cacheReport when one blockPool does not have 
> CacheBlock on this DataNode. If so, the NameNode will reduce unnecessary lock 
> contention



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block

2024-02-06 Thread Shuyan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuyan Zhang resolved HDFS-17342.
-
Hadoop Flags: Reviewed
Target Version/s: 3.5.0
  Resolution: Fixed

> Fix DataNode may invalidates normal block causing missing block
> ---
>
> Key: HDFS-17342
> URL: https://issues.apache.org/jira/browse/HDFS-17342
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> When users read an append file, occasional exceptions may occur, such as 
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: xxx.
> This can happen if one thread is reading the block while writer thread is 
> finalizing it simultaneously.
> *Root cause:*
> # The reader thread obtains a RBW replica from VolumeMap, such as: 
> blk_xxx_xxx[RBW] and  the data file should be in /XXX/rbw/blk_xxx.
> # Simultaneously, the writer thread will finalize this block, moving it from 
> the RBW directory to the FINALIZE directory. the data file is move from 
> /XXX/rbw/block_xxx to /XXX/finalize/block_xxx.
> # The reader thread attempts to open this data input stream but encounters a 
> FileNotFoundException because the data file /XXX/rbw/blk_xxx or meta file 
> /XXX/rbw/blk_xxx_xxx doesn't exist at this moment.
> # The reader thread  will treats this block as corrupt, removes the replica 
> from the volume map, and the DataNode reports the deleted block to the 
> NameNode.
> # The NameNode removes this replica for the block.
> # If the current file replication is 1, this file will cause a missing block 
> issue until this DataNode executes the DirectoryScanner again.
> As described above, when the reader thread encountered FileNotFoundException 
> is as expected, because the file is moved.
> So we need to add a double check to the invalidateMissingBlock logic to 
> verify whether the data file or meta file exists to avoid similar cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17345) Add a metrics to record block report generating cost time

2024-03-06 Thread Shuyan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuyan Zhang resolved HDFS-17345.
-
Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Add a metrics to record block report generating cost time
> -
>
> Key: HDFS-17345
> URL: https://issues.apache.org/jira/browse/HDFS-17345
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.5.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Currently, we have block report send time metrics recorded by blockReports.
> We should better add another metric to record block report creating cost time:
> {code:java}
> long brCreateCost = brSendStartTime - brCreateStartTime; {code}
> It is useful for us to measure the perfomance of creating block reports.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17408) Reduce the number of quota calculations in FSDirRenameOp

2024-04-01 Thread Shuyan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuyan Zhang resolved HDFS-17408.
-
   Fix Version/s: 3.5.0
Hadoop Flags: Reviewed
Target Version/s: 3.5.0
  Resolution: Fixed

> Reduce the number of quota calculations in FSDirRenameOp
> 
>
> Key: HDFS-17408
> URL: https://issues.apache.org/jira/browse/HDFS-17408
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: lei w
>Assignee: lei w
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> During the execution of the rename operation, we first calculate the quota 
> for the source INode using verifyQuotaForRename, and at the same time, we 
> calculate the quota for the target INode. Subsequently, in 
> RenameOperation#removeSrc, RenameOperation#removeSrc4OldRename, and 
> RenameOperation#addSourceToDestination, the quota for the source directory is 
> calculated again. In exceptional cases, RenameOperation#restoreDst and 
> RenameOperation#restoreSource will also perform quota calculations for the 
> source and target directories. In fact, many of the quota calculations are 
> redundant and unnecessary, so we should optimize them away.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17383) Datanode current block token should come from active NameNode in HA mode

2024-04-15 Thread Shuyan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuyan Zhang resolved HDFS-17383.
-
   Fix Version/s: 3.5.0
Hadoop Flags: Reviewed
Target Version/s: 3.5.0
Assignee: lei w
  Resolution: Fixed

> Datanode current block token should come from active NameNode in HA mode
> 
>
> Key: HDFS-17383
> URL: https://issues.apache.org/jira/browse/HDFS-17383
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: lei w
>Assignee: lei w
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
> Attachments: reproduce.diff
>
>
> We found that transfer block failed during the namenode upgrade. The specific 
> error reported was that the block token verification failed. The reason is 
> that during the datanode transfer block process, the source datanode uses its 
> own generated block token, and the keyid comes from ANN or SBN. However, 
> because the newly upgraded NN has just been started, the keyid owned by the 
> source datanode may not be owned by the target datanode, so the write fails. 
> Here's how to reproduce this situation in the attachment



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17365) EC: Add extra redunency configuration in checkStreamerFailures to prevent data loss.

2025-08-22 Thread Shuyan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuyan Zhang resolved HDFS-17365.
-
Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> EC: Add extra redunency configuration in checkStreamerFailures to prevent 
> data loss.
> 
>
> Key: HDFS-17365
> URL: https://issues.apache.org/jira/browse/HDFS-17365
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ec
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org