Re: [PR] [WIP] Reduce HDFS NameNode RPC on vectorized Parquet reader [spark]

via GitHub Wed, 30 Apr 2025 02:50:17 -0700


pan3793 commented on code in PR #50765:
URL: https://github.com/apache/spark/pull/50765#discussion_r2068334750



##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:
##########
@@ -89,24 +90,29 @@ public abstract class SpecificParquetRecordReaderBase<T> 
extends RecordReader<Vo
   @Override
   public void initialize(InputSplit inputSplit, TaskAttemptContext 
taskAttemptContext)
       throws IOException, InterruptedException {
-    initialize(inputSplit, taskAttemptContext, Option.empty());
+    initialize(inputSplit, taskAttemptContext, Option.empty(), Option.empty(), 
Option.empty());
   }
 
   public void initialize(
       InputSplit inputSplit,
       TaskAttemptContext taskAttemptContext,
+      Option<HadoopInputFile> inputFile,
+      Option<SeekableInputStream> inputStream,
       Option<ParquetMetadata> fileFooter) throws IOException, 
InterruptedException {
     Configuration configuration = taskAttemptContext.getConfiguration();
     FileSplit split = (FileSplit) inputSplit;
     this.file = split.getPath();
+    ParquetReadOptions options = HadoopReadOptions
+        .builder(configuration, file)
+        .withRange(split.getStart(), split.getStart() + split.getLength())
+        .build();
     ParquetFileReader fileReader;
-    if (fileFooter.isDefined()) {
-      fileReader = new ParquetFileReader(configuration, file, 
fileFooter.get());

Review Comment:
   This constructor internally calls `HadoopInputFile.fromPath(file, 
configuration)`, which produces an unnecessary `GetFileInfo` RPC
   
   ```
     public static HadoopInputFile fromPath(Path path, Configuration conf) 
throws IOException {
       FileSystem fs = path.getFileSystem(conf);
       return new HadoopInputFile(fs, fs.getFileStatus(path), conf);
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP] Reduce HDFS NameNode RPC on vectorized Parquet reader [spark]

Reply via email to