Re: [DISCUSS] Creating an Apache project for Parquet reader optimisations

Julien Le Dem Fri, 05 Dec 2025 16:01:21 -0800

{catching up on email}
It also sounds good to me to start in Parquet.
I think we have a lot of options. It can be in parquet-java. If that
becomes unwieldy, we could create a separate component. And eventually this
could even graduate out as its own project if really needed. (datafusion
graduated out of arrow).
We have also wanted to decouple the parquet-java code from the HDFS apis
forever. So that could be an opportunity to evolve in that direction. (not
a requirement! I'm not trying to expand the scope more than needed here)


On Tue, Dec 2, 2025 at 10:34 AM Suhail, Ahmar <[email protected]> wrote:

> Thanks Chris and Micah, this sounds good to me.
>
> As a next step, I'll start a separate discussion on the parquet dev
> mailing list.
>
> On 02/12/2025, 18:27, "Chris Nauroth" <[email protected] <mailto:
> [email protected]>> wrote:
>
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
>
>
>
> I agree that starting within Parquet itself seems like a good starting
> point. It's not clear to me at this point if the proposed scope is large
> enough to warrant its own top-level ASF project. If that scope grows larger
> over time, then there is precedent for creating a spin-off project with the
> sub-community of contributors becoming the initial committers and PMC
> members of that new project.
>
>
> Chris Nauroth
>
>
>
>
> On Fri, Nov 21, 2025 at 11:56 AM Micah Kornfield <[email protected]
> <mailto:[email protected]>>
> wrote:
>
>
> > >
> > > 1/ Make changes to parquet java to pass this info down when opening the
> > > file.
> > > 2/ Each underlying input stream implementation would have to make
> changes
> > > to make use of this info.
> >
> >
> > I'm still trying to understand exactly what is being proposed. Would it
> be
> > be correct (or at least close to say) the goal is to have effectively
> make
> > a new abstract InputStream that is object store aware (and then have the
> > object store pluggable) so the business logic of reading (i.e. vectored
> > reads, closed range reads, etc) are expressed in the input stream, then
> the
> > backing store is pluggable? I think the assumption here is that the
> > business logic would likely change more quickly then the underlying
> object
> > storage APIs? Is the scope broader or narrower then this?
> >
> > IIUC, and this is specific to Parquet file reading, the Parquet project
> > might be a good place to at least start prototyping what this would look
> > like. Or is there a reason that a separate project would be necessary in
> > the short term?
> >
> > Thanks,
> > Micah
> >
> > On Fri, Nov 21, 2025 at 6:49 AM Andrew Lamb <[email protected]
> <mailto:[email protected]>>
> > wrote:
> >
> > > > What I’m suggesting here is that we work to get rid of this
> > duplication,
> > > and have a common Apache project with a single implementation of an
> > > optimized stream. In my mind, this brings the Parquet java library
> closer
> > > to the underlying data stream it relies on. And If we can establish
> some
> > > common ground here, in the future, we can start looking at more changes
> > we
> > > can make to the parquet java library itself.
> > >
> > > Makes total sense to me.
> > >
> > > Thanks for the clarification
> > >
> > > Andrew
> > >
> > > On Fri, Nov 21, 2025 at 9:18 AM Suhail, Ahmar <[email protected]
> <mailto:[email protected]>>
> > > wrote:
> > >
> > >> Thanks Andrew,
> > >>
> > >> I think you’re referring to adding the right API’s into parquet-java
> > >> library. The readVectored() API was added in to parquet-java a couple
> of
> > >> years ago (thanks to Mukund and Steve), PR here:
> > >> https://github.com/apache/parquet-java/pull/1139 <
> https://github.com/apache/parquet-java/pull/1139>.
> > >>
> > >> The issue then becomes that the underlying streams, eg: the
> > >> S3AInputStream [1] in S3A, or the S3InputStream [2] in S3FileIO, must
> > >> provide implementations for this. And currently we end up with
> > >> implementations by each cloud provider, for each file system. Eg:
> > Google’s
> > >> S3A implementation is: GoogleHadoopFSInputStream [3].
> > >>
> > >> What I’m suggesting here is that we work to get rid of this
> duplication,
> > >> and have a common Apache project with a single implementation of an
> > >> optimized stream. In my mind, this brings the Parquet java library
> > closer
> > >> to the underlying data stream it relies on. And If we can establish
> some
> > >> common ground here, in the future, we can start looking at more
> changes
> > we
> > >> can make to the parquet java library itself.
> > >>
> > >> As an example, if we wanted to make a change to allow parquet-java to
> > >> pass down the boundaries of the current split, so optimized input
> > streams
> > >> can get all the relevant columns for all row groups in the current
> > split we
> > >> would have to:
> > >>
> > >> 1/ Make changes to parquet java to pass this info down when opening
> the
> > >> file.
> > >> 2/ Each underlying input stream implementation would have to make
> > changes
> > >> to make use of this info.
> > >>
> > >> A common project focused on optimisations means we should only need to
> > do
> > >> this once and can share the work/maintenance.
> > >>
> > >> Hopefully I understood what you were saying correctly! But please do
> let
> > >> me know in case I’ve missed the point completely 😊
> > >>
> > >> Thanks,
> > >> Ahmar
> > >>
> > >> [1]:
> > >>
> >
> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
> <
> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
> >
> > >> [2]:
> > >>
> >
> https://github.com/apache/iceberg/blob/main/aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java
> <
> https://github.com/apache/iceberg/blob/main/aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java
> >
> > >> [3]:
> > >>
> >
> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-gcp/src/main/java/org/apache/hadoop/fs/gs/GoogleHadoopFSInputStream.java
> <
> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-gcp/src/main/java/org/apache/hadoop/fs/gs/GoogleHadoopFSInputStream.java
> >
> > >>
> > >> From: Andrew Lamb <[email protected] <mailto:
> [email protected]>>
> > >> Reply to: "[email protected] <mailto:[email protected]>" <
> [email protected] <mailto:[email protected]>>
> > >> Date: Thursday, 20 November 2025 at 11:10
> > >> To: "[email protected] <mailto:[email protected]>" <
> [email protected] <mailto:[email protected]>>
> > >> Cc: "[email protected] <mailto:
> [email protected]>" <[email protected] <mailto:
> [email protected]>>, "
> > >> [email protected] <mailto:[email protected]>" <
> [email protected] <mailto:[email protected]>>, "
> [email protected] <mailto:[email protected]>
> > "
> > >> <[email protected] <mailto:[email protected]>>, "
> [email protected] <mailto:[email protected]>" <[email protected] <mailto:
> [email protected]>>, "
> > >> [email protected] <mailto:[email protected]>" <[email protected]
> <mailto:[email protected]>>, "Ratnasingham, Kannan" <
> > >> [email protected] <mailto:[email protected]>>, "Summers,
> Carl" <[email protected] <mailto:[email protected]>>, "Peace,
> > >> Andrew" <[email protected] <mailto:[email protected]>>, "
> [email protected] <mailto:[email protected]>" <
> > >> [email protected] <mailto:[email protected]>>, "Basik, Fuat" <
> [email protected] <mailto:[email protected]>>, "
> > >> [email protected] <mailto:[email protected]>" <
> [email protected] <mailto:[email protected]>>, "[email protected]
> <mailto:[email protected]>" <
> > >> [email protected] <mailto:[email protected]>>, "
> [email protected] <mailto:[email protected]>" <
> [email protected] <mailto:[email protected]>>,
> > "
> > >> [email protected] <mailto:[email protected]>" <[email protected]
> <mailto:[email protected]>>
> > >> Subject: RE: [EXTERNAL] [DISCUSS] Creating an Apache project for
> Parquet
> > >> reader optimisations
> > >>
> > >>
> > >> CAUTION: This email originated from outside of the organization. Do
> not
> > >> click links or open attachments unless you can confirm the sender and
> > know
> > >> the content is safe.
> > >>
> > >> One approach, which I think has served us well in the Rust ecosystem,
> > has
> > >> been to keep the Parquet implementation in a separate library, and
> > >> carefully design APIs that enable downstream optimizations, rather
> than
> > >> multiple more tightly integrated implementations in different query
> > engines.
> > >>
> > >> Specifically, have you considered adding the appropriate APIs to the
> > >> parquet-java codebase (for example, to get the ranges needed to
> prefetch
> > >> given a set of filters)? It would take non trivial care to design
> these
> > >> APIs correctly, but you could then plausibly use them to implement the
> > >> system specific optimizations you describe. It may be hard to
> implement
> > >> parquet optimizations as a stream without more detailed information
> > known
> > >> to the decoder.
> > >>
> > >> I realize it is more common to have the Parquet reader/writer in the
> > >> actual engines (e.g. Spark and Trino) but doing so means trying to
> > optimize
> > >> / implement best practices requires duplicated effort. Of course this
> > comes
> > >> with tradeoffs of having to manage requirements across multiple
> engines
> > and
> > >> coordinate release schedules, etc
> > >>
> > >> Examples of some generic APIs in arrow-rs's Parquet reader are:
> > >> 1. Filter evaluation API (not it is not part of a query engine)[1]
> > >> 2. PushDecoder to separate IO from parquet decoding[2]
> > >>
> > >> Andrew
> > >>
> > >> [1]:
> > >>
> >
> https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html
> <
> https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html
> >
> > >> [2]:
> > >>
> >
> https://github.com/apache/arrow-rs/blob/fea605cb16f7524cb69a197bfa581a1d4f5fe5d0/parquet/src/arrow/push_decoder/mod.rs#L218-L233
> <
> https://github.com/apache/arrow-rs/blob/fea605cb16f7524cb69a197bfa581a1d4f5fe5d0/parquet/src/arrow/push_decoder/mod.rs#L218-L233
> >
> > >>
> > >> On Wed, Nov 19, 2025 at 8:28 AM Ahmar Suhail <[email protected]
> <mailto:[email protected]><mailto:
> > >> [email protected] <mailto:[email protected]>>> wrote:
> > >> Hey everyone,
> > >>
> > >> I'm part of the S3 team at AWS, and a PMC on the Hadoop project,
> > >> contributing mainly to S3A. I would like to start a discussion on
> > >> collaborating on a single Apache level project, which will implement
> > >> parquet input stream level optimisations like readVectored() in a
> > unified
> > >> place, rather than having vendor specific implementations.
> > >>
> > >> Last year, my team started working on an analytics accelerator for S3
> > >> <https://github.com/awslabs/analytics-accelerator-s3> <
> https://github.com/awslabs/analytics-accelerator-s3&gt;> (AAL), with the
> > >> goal
> > >> of improving query performance for Spark workloads by implementing
> > client
> > >> side best practices. You can find more details about the project in
> this
> > >> doc
> > >> <
> > >>
> >
> https://docs.google.com/document/d/13shy0RWotwfWC_qQksb95PXdi-vSUCKQyDzjoExQEN0/edit?tab=t.0#heading=h.3lc3p7s26rnw
> <
> https://docs.google.com/document/d/13shy0RWotwfWC_qQksb95PXdi-vSUCKQyDzjoExQEN0/edit?tab=t.0#heading=h.3lc3p7s26rnw
> >
> > >> >,
> > >> which was shared on the Iceberg mailing lists earlier this year, and
> the
> > >> Iceberg issue to integrate this as the default stream here
> > >> <https://github.com/apache/iceberg/issues/14350> <
> https://github.com/apache/iceberg/issues/14350&gt;>.
> > >>
> > >> The team at Google has gcs-analytics-core
> > >> <https://github.com/GoogleCloudPlatform/gcs-analytics-core> <
> https://github.com/GoogleCloudPlatform/gcs-analytics-core&gt;> which
> > >> implements Parquet stream level optimizations, and was released in
> > >> September of this year, iceberg issue here
> > >> <https://github.com/apache/iceberg/issues/14326> <
> https://github.com/apache/iceberg/issues/14326&gt;>.
> > >>
> > >> Most parquet reader optimisations are not vendor specific, with the
> > major
> > >> feature set required being:
> > >>
> > >> - Parquet footer prefetching and caching - Prefetch the last X
> > >> bytes (eg: 32KB) to avoid the "Parquet Footer dance" and cache them.
> > >> - Vectored reads - Lets the parquet-reader pass in a list of columns
> > >> that can be prefetched in parallel.
> > >> - Sequential Prefetching - Useful for speeding up things where the
> > >> whole
> > >> Parquet object is going to be read eg: DistCP, and should help with
> > >> compaction as well.
> > >>
> > >>
> > >> With this in mind, I would like to propose the following:
> > >>
> > >> - A new ASF project (top level or a sub project of the existing
> > >> hadoop/iceberg projects).
> > >> - Project has a goal of bringing stream reading best practices into
> > one
> > >> place. Eg: For parquet, it implements footer prefetching and caching,
> > >> vectored reads etc.
> > >> - Implements non-format specific best practices/optimisations: eg:
> > >> Sequential prefetching and reading small objects in a single GET.
> > >> - Is integrated into upstream projects like Iceberg and Hadoop as a
> > >> replacement/alternative for the current input stream implementations.
> > >>
> > >> We can structure it similar to how Hadoop and Iceberg are today:
> > >>
> > >> - A shared logical layer (think of it similar to hadoop-common),
> > where
> > >> the common logic goes. Ideally, 80% of the code ends up here
> > >> (optimisations, memory management, thread pools etc.)
> > >> - A light vendor specific client layer (Kind of like the
> > >> hadoop-aws/gcp/abfs modules), where any store specific logic ends
> > up. I
> > >> imagine different cloud stores will have different requirements on
> > >> things
> > >> like optimal request sizes, concurrency and certain features that are
> > >> not
> > >> common.
> > >>
> > >> Note: These are all high level ideas, influenced by the direction AAL
> > has
> > >> taken in the last year, and perhaps there is a different, more optimal
> > way
> > >> to this all together.
> > >>
> > >> From TPC-DS benchmarking my team has done, there looks to be a 10%
> query
> > >> read performance gain that can be achieved through the above listed
> > >> optimisations, and through collaboration, we can likely drive this
> > number
> > >> up further. For example, it would be great to discuss how Spark and
> the
> > >> Parquet reader can pass any additional information they have to the
> > stream
> > >> (similar to vectored reads), which can help read performance.
> > >>
> > >> In my opinion, there is a lot of opportunity here, and collaborating
> on
> > a
> > >> single, shared ASF project helps us achieve it faster, both in terms
> of
> > >> adoption across upstream projects (eg: Hadoop, Iceberg, Trino), and
> long
> > >> term maintenance of libraries like these. It also gives us an
> > opportunity
> > >> to combine our knowledge in this space, and react to upcoming changes
> in
> > >> the Parquet format.
> > >>
> > >> If this sounds good, as a next step I can schedule a sync post
> > >> thanksgiving
> > >> to brainstorm ideas and next steps.
> > >>
> > >> Thank you, and looking forward to hearing your thoughts.
> > >>
> > >> Ahmar
> > >>
> > >
> >
>
>
>
>

Re: [DISCUSS] Creating an Apache project for Parquet reader optimisations

Reply via email to