jerolba opened a new pull request, #3227:
URL: https://github.com/apache/parquet-java/pull/3227

   Avoid LongStream reading files and use an ad-hoc Long Iterator
   
   ### Rationale for this change
   
   Profiling the load of a Parquet file with Java Mission Control, I've noticed 
that InternalParquetRecordReader 
[LongStream](https://github.com/apache/parquet-java/blob/1f1e07bbf750fba228851c2d63470c3da5726831/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L323)
 consumes relevant amount of time:
   
   
![image](https://github.com/user-attachments/assets/31b78224-1901-4418-b988-4f5123604b46)
   
   This LongStream can be replaced with a simpler Long Iterator that iterates 
from 0 to pages.getRowCount().
   
   To measure the overhead I've created a test 
[project](https://github.com/jerolba/parquet-rowindexiterator) that overwrites 
`InternalParquetRecordReader` class using a Long Iterator (the same change than 
proposed in the PR)
   
   The execution time is sensitive to the context of the JVM, but running the 
benchmark multiple times shows that LongStream is slower than LongIterator, 
between 1% and 4% depending on the run.
   
   ### What changes are included in this PR?
   
   A new `LongIterator` that implements `PrimitiveIterator.OfLong` and replaces 
a `LongStream.range(0, pages.getRowCount()).iterator()`
   
   ### Are these changes tested?
   
   Not directly, but it's covered by existing tests
   
   ### Are there any user-facing changes?
   
   No
   
   Closes #3226
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to