Re: [PR] [SPARK-52011][SQL] Reduce HDFS NameNode RPC on vectorized Parquet reader [spark]

via GitHub Wed, 17 Sep 2025 23:03:15 -0700


pan3793 commented on code in PR #50765:
URL: https://github.com/apache/spark/pull/50765#discussion_r2357607758



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
##########
@@ -47,26 +47,25 @@ import org.apache.spark.util.NextIterator
  * that need to be prepended to each row.
  *
  * @param partitionValues value of partition columns to be prepended to each 
row.
- * @param filePath URI of the file to read
  * @param start the beginning offset (in bytes) of the block.
  * @param length number of bytes to read.
- * @param modificationTime The modification time of the input file, in 
milliseconds.
- * @param fileSize The length of the input file (not the block), in bytes.
+ * @param fileStatus The FileStatus instance of the file to read.
  * @param otherConstantMetadataColumnValues The values of any additional 
constant metadata columns.
  */
 case class PartitionedFile(
     partitionValues: InternalRow,
-    filePath: SparkPath,
     start: Long,
     length: Long,
+    fileStatus: FileStatus,

Review Comment:
   @sunchao thanks for your suggestion, after an offline discussion with 
@cloud-fan, I understand his concerns about the overhead of `FileStatus`, let 
me summarize the conclusion and my thoughts:
   
   1. there may be different Hadoop FileSystem implementations, the 
`getFileStatus` might be cheap or have executor-side cache in some 
implementations, but for our case - HDFS with RBF, it's relatively heavy.
   2. there is an upcoming optimization to replace `FileStatusCache` with 
`PathCache`(only carry necessary metadata) on the driver side to reduce driver 
memory.
   3. @cloud-fan suggests constructing `FileStatus` from the executor side 
directly
   
   so, I'm going to split this PR into two parts
   1. I will experiment (3), but I can only do it on HDFS cases (w/ and w/o 
RBF, w/ and w/ EC)
   2. span the rest of the executor-side changes into a dedicated PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-52011][SQL] Reduce HDFS NameNode RPC on vectorized Parquet reader [spark]

Reply via email to