Re: [DISCUSS] Creating an Apache project for Parquet reader optimisations

Micah Kornfield Fri, 21 Nov 2025 11:56:13 -0800

>
> 1/ Make changes to parquet java to pass this info down when opening the
> file.
> 2/ Each underlying input stream implementation would have to make changes
> to make use of this info.



I'm still trying to understand exactly what is being proposed.  Would it be
be correct (or at least close to say) the goal is to have effectively make
a new abstract InputStream that is object store aware (and then have the
object store pluggable) so the business logic of reading (i.e. vectored
reads, closed range reads, etc) are expressed in the input stream, then the
backing store is pluggable?   I think the assumption here is that the
business logic would likely change more quickly then the underlying object
storage APIs?  Is the scope broader or narrower then this?

IIUC, and this is specific to Parquet file reading, the Parquet project
might be a good place to at least start prototyping what this would look
like.  Or is there a reason that a separate project would be necessary in
the short term?

Thanks,
Micah

On Fri, Nov 21, 2025 at 6:49 AM Andrew Lamb <[email protected]> wrote:

> > What I’m suggesting here is that we work to get rid of this duplication,
> and have a common Apache project with a single implementation of an
> optimized stream. In my mind, this brings the Parquet java library closer
> to the underlying data stream it relies on. And If we can establish some
> common ground here, in the future, we can start looking at more changes we
> can make to the parquet java library itself.
>
> Makes total sense to me.
>
> Thanks for the clarification
>
> Andrew
>
> On Fri, Nov 21, 2025 at 9:18 AM Suhail, Ahmar <[email protected]>
> wrote:
>
>> Thanks Andrew,
>>
>> I think you’re referring to adding the right API’s into parquet-java
>> library. The readVectored() API was added in to parquet-java a couple of
>> years ago (thanks to Mukund and Steve), PR here:
>> https://github.com/apache/parquet-java/pull/1139.
>>
>> The issue then becomes that the underlying streams, eg: the
>> S3AInputStream [1] in S3A, or the S3InputStream [2] in S3FileIO, must
>> provide implementations for this. And currently we end up with
>> implementations by each cloud provider, for each file system. Eg: Google’s
>> S3A implementation is: GoogleHadoopFSInputStream [3].
>>
>> What I’m suggesting here is that we work to get rid of this duplication,
>> and have a common Apache project with a single implementation of an
>> optimized stream. In my mind, this brings the Parquet java library closer
>> to the underlying data stream it relies on. And If we can establish some
>> common ground here, in the future, we can start looking at more changes we
>> can make to the parquet java library itself.
>>
>> As an example, if we wanted to make a change to allow parquet-java to
>> pass down the boundaries of the current split, so optimized input streams
>> can get all the relevant columns for all row groups in the current split we
>> would have to:
>>
>> 1/ Make changes to parquet java to pass this info down when opening the
>> file.
>> 2/ Each underlying input stream implementation would have to make changes
>> to make use of this info.
>>
>> A common project focused on optimisations means we should only need to do
>> this once and can share the work/maintenance.
>>
>> Hopefully I understood what you were saying correctly! But please do let
>> me know in case I’ve missed the point completely 😊
>>
>> Thanks,
>> Ahmar
>>
>> [1]:
>> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
>> [2]:
>> https://github.com/apache/iceberg/blob/main/aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java
>> [3]:
>> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-gcp/src/main/java/org/apache/hadoop/fs/gs/GoogleHadoopFSInputStream.java
>>
>> From: Andrew Lamb <[email protected]>
>> Reply to: "[email protected]" <[email protected]>
>> Date: Thursday, 20 November 2025 at 11:10
>> To: "[email protected]" <[email protected]>
>> Cc: "[email protected]" <[email protected]>, "
>> [email protected]" <[email protected]>, "[email protected]"
>> <[email protected]>, "[email protected]" <[email protected]>, "
>> [email protected]" <[email protected]>, "Ratnasingham, Kannan" <
>> [email protected]>, "Summers, Carl" <[email protected]>, "Peace,
>> Andrew" <[email protected]>, "[email protected]" <
>> [email protected]>, "Basik, Fuat" <[email protected]>, "
>> [email protected]" <[email protected]>, "[email protected]" <
>> [email protected]>, "[email protected]" <[email protected]>, "
>> [email protected]" <[email protected]>
>> Subject: RE: [EXTERNAL] [DISCUSS] Creating an Apache project for Parquet
>> reader optimisations
>>
>>
>> CAUTION: This email originated from outside of the organization. Do not
>> click links or open attachments unless you can confirm the sender and know
>> the content is safe.
>>
>> One approach, which I think has served us well in the Rust ecosystem, has
>> been to keep the Parquet implementation in a separate library, and
>> carefully design APIs that enable downstream optimizations, rather than
>> multiple more tightly integrated implementations in different query engines.
>>
>> Specifically, have you considered adding the appropriate APIs to the
>> parquet-java codebase (for example, to get the ranges needed to prefetch
>> given a set of filters)? It would take non trivial care to design these
>> APIs correctly, but you could then plausibly use them to implement the
>> system specific optimizations you describe. It may be hard to implement
>> parquet optimizations as a stream without more detailed information known
>> to the decoder.
>>
>> I realize it is more common to have the Parquet reader/writer in the
>> actual engines (e.g. Spark and Trino) but doing so means trying to optimize
>> / implement best practices requires duplicated effort. Of course this comes
>> with tradeoffs of having to manage requirements across multiple engines and
>> coordinate release schedules, etc
>>
>> Examples of some generic APIs in arrow-rs's Parquet reader are:
>> 1. Filter evaluation API (not it is not part of a query engine)[1]
>> 2. PushDecoder to separate IO from parquet decoding[2]
>>
>> Andrew
>>
>> [1]:
>> https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html
>> [2]:
>> https://github.com/apache/arrow-rs/blob/fea605cb16f7524cb69a197bfa581a1d4f5fe5d0/parquet/src/arrow/push_decoder/mod.rs#L218-L233
>>
>> On Wed, Nov 19, 2025 at 8:28 AM Ahmar Suhail <[email protected]<mailto:
>> [email protected]>> wrote:
>> Hey everyone,
>>
>> I'm part of the S3 team at AWS, and a PMC on the Hadoop project,
>> contributing mainly to S3A. I would like to start a discussion on
>> collaborating on a single Apache level project, which will implement
>> parquet input stream level optimisations like readVectored() in a unified
>> place, rather than having vendor specific implementations.
>>
>> Last year, my team started working on an analytics accelerator for S3
>> <https://github.com/awslabs/analytics-accelerator-s3> (AAL), with the
>> goal
>> of improving query performance for Spark workloads by implementing client
>> side best practices. You can find more details about the project in this
>> doc
>> <
>> https://docs.google.com/document/d/13shy0RWotwfWC_qQksb95PXdi-vSUCKQyDzjoExQEN0/edit?tab=t.0#heading=h.3lc3p7s26rnw
>> >,
>> which was shared on the Iceberg mailing lists earlier this year, and the
>> Iceberg issue to integrate this as the default stream here
>> <https://github.com/apache/iceberg/issues/14350>.
>>
>> The team at Google has gcs-analytics-core
>> <https://github.com/GoogleCloudPlatform/gcs-analytics-core> which
>> implements Parquet stream level optimizations, and was released in
>> September of this year, iceberg issue here
>> <https://github.com/apache/iceberg/issues/14326>.
>>
>> Most parquet reader optimisations are not vendor specific, with the major
>> feature set required being:
>>
>>    -  Parquet footer prefetching and caching - Prefetch the last X
>>    bytes  (eg: 32KB) to avoid the "Parquet Footer dance" and cache them.
>>    -  Vectored reads - Lets the parquet-reader pass in a list of columns
>>    that can be prefetched in parallel.
>>    - Sequential Prefetching - Useful for speeding up things where the
>> whole
>>    Parquet object is going to be read eg: DistCP, and should help with
>>    compaction as well.
>>
>>
>> With this in mind, I would like to propose the following:
>>
>>    - A new ASF project (top level or a sub project of the existing
>>    hadoop/iceberg projects).
>>    - Project has a goal of bringing stream reading best practices into one
>>    place. Eg: For parquet, it implements footer prefetching and caching,
>>    vectored reads etc.
>>    - Implements non-format specific best practices/optimisations: eg:
>>    Sequential prefetching and reading small objects in a single GET.
>>    - Is integrated into upstream projects like Iceberg and Hadoop as a
>>    replacement/alternative for the current input stream implementations.
>>
>> We can structure it similar to how Hadoop and Iceberg are today:
>>
>>    - A shared logical layer (think of it similar to hadoop-common), where
>>    the common logic goes. Ideally, 80% of the code ends up here
>>    (optimisations, memory management, thread pools etc.)
>>    - A  light vendor specific client layer (Kind of like the
>>    hadoop-aws/gcp/abfs modules), where any store specific logic ends up. I
>>    imagine different cloud stores will have different requirements on
>> things
>>    like optimal request sizes, concurrency and certain features that are
>> not
>>    common.
>>
>> Note: These are all high level ideas, influenced by the direction AAL has
>> taken in the last year, and perhaps there is a different, more optimal way
>> to this all together.
>>
>> From TPC-DS benchmarking my team has done, there looks to be a 10% query
>> read performance gain that can be achieved through the above listed
>> optimisations, and through collaboration, we can likely drive this number
>> up further. For example, it would be great to discuss how Spark and the
>> Parquet reader can pass any additional information they have to the stream
>> (similar to vectored reads), which can help read performance.
>>
>> In my opinion, there is a lot of opportunity here, and collaborating on a
>> single, shared ASF project helps us achieve it faster, both in terms of
>> adoption across upstream projects (eg: Hadoop, Iceberg, Trino), and long
>> term maintenance of libraries like these. It also gives us an opportunity
>> to combine our knowledge in this space, and react to upcoming changes in
>> the Parquet format.
>>
>> If this sounds good, as a next step I can schedule a sync post
>> thanksgiving
>> to brainstorm ideas and next steps.
>>
>> Thank you, and looking forward to hearing your thoughts.
>>
>> Ahmar
>>
>

Re: [DISCUSS] Creating an Apache project for Parquet reader optimisations

Reply via email to