Dimas Shidqi Parikesit created HDFS-17768: ---------------------------------------------
Summary: Observer namenode network delay causing empty block location for getBatchedListing Key: HDFS-17768 URL: https://issues.apache.org/jira/browse/HDFS-17768 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.4.1 Reporter: Dimas Shidqi Parikesit In our testing with the latest hdfs version (e8a64d0), we found a similar case to HDFS-16732 happening in getBatchedListing. During a getBatchedListing, if the block report of the observer nn is delayed, one or more of the listing results will return blocks without location. Steps to reproduce this bug: # Start a cluster with 1 observer namenode # Create an empty file # Inject network delay between observer nn and active nn to delay block report (or add sleep to the BlockReportProcessingThread of the observer). # Append file to add block # Send a batchedListPaths request using client API # Check that the result has block without location In HDFS-16732 and HDFS-13924, a check was added in getBlockLocations, getFileInfo, and getListing that checks whether the found blocks have valid locations. Missing locations indicate that the observer namenode is not up-to-date compared to the active namenode. We propose to add the same check to getBatchedListing. If any of the sub-listing return blocks without location then it will throw ObserverRetryOnActiveException and exit the function early. The entire batchedListing request will be then retried on active namenode. Your insights are very much appreciated. We will continue following up this issue until it is resolved. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org