Re: [PR] HDFS-17397. Choose another DN as soon as possible, when encountering network issues [hadoop]

via GitHub Thu, 29 Feb 2024 17:18:55 -0800


xleoken commented on code in PR #6591:
URL: https://github.com/apache/hadoop/pull/6591#discussion_r1508372843



##########
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##########
@@ -1182,10 +1182,12 @@ public void run() {
             if (begin != null) {
               long duration = Time.monotonicNowNanos() - begin;
               if (TimeUnit.NANOSECONDS.toMillis(duration) > 
dfsclientSlowLogThresholdMs) {
-                LOG.info("Slow ReadProcessor read fields for block " + block
+                final String msg = "Slow ReadProcessor read fields for block " 
+ block
                     + " took " + TimeUnit.NANOSECONDS.toMillis(duration) + "ms 
(threshold="
                     + dfsclientSlowLogThresholdMs + "ms); ack: " + ack
-                    + ", targets: " + Arrays.asList(targets));
+                    + ", targets: " + Arrays.asList(targets);
+                LOG.warn(msg);
+                throw new IOException(msg);

Review Comment:
   @ZanderXu 
   
   > How to identify this case
   
   When the client takes more time to read ack than 
`dfsclientSlowLogThresholdMs`.
   
   > Which datanode should be marked as a bad or slow DN
   
   When some datanodes in poor network environment.
   
   > Maybe Datastreamer can identify this case and recovery it through 
PipelineRecovery
   
   The core issue is that the response time between the client and DN is 
greater than `dfsclientSlowLogThresholdMs`, but only print a log without taking 
any action. We should print the log and throw an `IOException`.
   
   > but I don't think your modification is a good solution.
   
   Maybe you're right, but this may be the simplest modification. After this 
patch, we solved the slow dn problem in production environment.



##########
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##########
@@ -1182,10 +1182,12 @@ public void run() {
             if (begin != null) {
               long duration = Time.monotonicNowNanos() - begin;
               if (TimeUnit.NANOSECONDS.toMillis(duration) > 
dfsclientSlowLogThresholdMs) {
-                LOG.info("Slow ReadProcessor read fields for block " + block
+                final String msg = "Slow ReadProcessor read fields for block " 
+ block
                     + " took " + TimeUnit.NANOSECONDS.toMillis(duration) + "ms 
(threshold="
                     + dfsclientSlowLogThresholdMs + "ms); ack: " + ack
-                    + ", targets: " + Arrays.asList(targets));
+                    + ", targets: " + Arrays.asList(targets);
+                LOG.warn(msg);
+                throw new IOException(msg);

Review Comment:
   Welcome @ZanderXu 
   
   > How to identify this case
   
   When the client takes more time to read ack than 
`dfsclientSlowLogThresholdMs`.
   
   > Which datanode should be marked as a bad or slow DN
   
   When some datanodes in poor network environment.
   
   > Maybe Datastreamer can identify this case and recovery it through 
PipelineRecovery
   
   The core issue is that the response time between the client and DN is 
greater than `dfsclientSlowLogThresholdMs`, but only print a log without taking 
any action. We should print the log and throw an `IOException`.
   
   > but I don't think your modification is a good solution.
   
   Maybe you're right, but this may be the simplest modification. After this 
patch, we solved the slow dn problem in production environment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDFS-17397. Choose another DN as soon as possible, when encountering network issues [hadoop]

Reply via email to