Re: [DISCUSS] Creating an Apache project for Parquet reader optimisations

Micah Kornfield Mon, 24 Nov 2025 19:09:59 -0800

>
> Main advantages of having a separate project is  1/ Library can easily be
> extended to other formats in the future, 2/ Being independent of
> parquet-java versions, as there are customers on much older parquet/hadoop
> versions who won't be able to benefit and can right now use the analytics
> accelerator library



I don't have a strong opinion here.  As you know as an apache project PMC
member there are trade-offs of trying to start a standalone project vs at
least doing initial work in an existing project.  I think there is an
assumption that if it lands in Parquet it would be coupled into the
parquet-java release process.  I would guess this isn't a requirement and
it could have its own release cycle (and possibly a separate repo to
prevent dependencies from creeping in).  Active Parquet-java maintainers
should chime in here though (Hadoop also seems like a potential home here
as well, but I would have similar concerns if the new I/O library was
tightly coupled to Hadoop packages).

I'll join the parquet sync this Wednesday, and if possible we can discuss
> more there?


The sync might be lightly attended this week due to the thanksgiving
holiday. It might pay to see if there are any objections with a specifical
proposal on the parquet mailing list (i.e. possibly a new thread)?

Cheers,
Micah

On Mon, Nov 24, 2025 at 4:08 AM Suhail, Ahmar <[email protected]> wrote:

> Thanks Micah,
>
> Yes, that is quite close to what is being proposed. For reference, you can
> take a quick look at the existing project [1], and its integration into
> Iceberg [2]
>
> There are pros and cons to both approaches: adding this into the parquet
> project vs maintaining a separate project.
>
> The project has a decent amount of code currently (some of it can be cut
> for sure), but because there is prefetching involved for the optimisations
> you end up needing:
>
> 1/ A block manager, where blocks of prefetched data can be stored.
> 2/ Someway to manage memory and cleanup when limits are hit (we're using
> the caffeine cache for this)
> 3/ Some logic to manage sequential prefetching (how much to prefetch/when
> to prefetch)
>
> Just wondering if the parquet project would be ok with all this code?
>
> Main advantages of having a separate project is  1/ Library can easily be
> extended to other formats in the future, 2/ Being independent of
> parquet-java versions, as there are customers on much older parquet/hadoop
> versions who won't be able to benefit and can right now use the analytics
> accelerator library
>
> I'll join the parquet sync this Wednesday, and if possible we can discuss
> more there?
>
> [1]: https://github.com/ahmarsuhail/analytics-accelerator-s3
> [2]:
> https://github.com/apache/iceberg/blob/main/aws/src/main/java/org/apache/iceberg/aws/s3/S3InputFile.java#L77
>
> On 21/11/2025, 19:58, "Micah Kornfield" <[email protected] <mailto:
> [email protected]>> wrote:
>
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
>
>
>
> >
> > 1/ Make changes to parquet java to pass this info down when opening the
> > file.
> > 2/ Each underlying input stream implementation would have to make changes
> > to make use of this info.
>
>
>
>
> I'm still trying to understand exactly what is being proposed. Would it be
> be correct (or at least close to say) the goal is to have effectively make
> a new abstract InputStream that is object store aware (and then have the
> object store pluggable) so the business logic of reading (i.e. vectored
> reads, closed range reads, etc) are expressed in the input stream, then the
> backing store is pluggable? I think the assumption here is that the
> business logic would likely change more quickly then the underlying object
> storage APIs? Is the scope broader or narrower then this?
>
>
> IIUC, and this is specific to Parquet file reading, the Parquet project
> might be a good place to at least start prototyping what this would look
> like. Or is there a reason that a separate project would be necessary in
> the short term?
>
>
> Thanks,
> Micah
>
>
> On Fri, Nov 21, 2025 at 6:49 AM Andrew Lamb <[email protected]
> <mailto:[email protected]>> wrote:
>
>
> > > What I’m suggesting here is that we work to get rid of this
> duplication,
> > and have a common Apache project with a single implementation of an
> > optimized stream. In my mind, this brings the Parquet java library closer
> > to the underlying data stream it relies on. And If we can establish some
> > common ground here, in the future, we can start looking at more changes
> we
> > can make to the parquet java library itself.
> >
> > Makes total sense to me.
> >
> > Thanks for the clarification
> >
> > Andrew
> >
> > On Fri, Nov 21, 2025 at 9:18 AM Suhail, Ahmar <[email protected]
> <mailto:[email protected]>>
> > wrote:
> >
> >> Thanks Andrew,
> >>
> >> I think you’re referring to adding the right API’s into parquet-java
> >> library. The readVectored() API was added in to parquet-java a couple of
> >> years ago (thanks to Mukund and Steve), PR here:
> >> https://github.com/apache/parquet-java/pull/1139 <
> https://github.com/apache/parquet-java/pull/1139>.
> >>
> >> The issue then becomes that the underlying streams, eg: the
> >> S3AInputStream [1] in S3A, or the S3InputStream [2] in S3FileIO, must
> >> provide implementations for this. And currently we end up with
> >> implementations by each cloud provider, for each file system. Eg:
> Google’s
> >> S3A implementation is: GoogleHadoopFSInputStream [3].
> >>
> >> What I’m suggesting here is that we work to get rid of this duplication,
> >> and have a common Apache project with a single implementation of an
> >> optimized stream. In my mind, this brings the Parquet java library
> closer
> >> to the underlying data stream it relies on. And If we can establish some
> >> common ground here, in the future, we can start looking at more changes
> we
> >> can make to the parquet java library itself.
> >>
> >> As an example, if we wanted to make a change to allow parquet-java to
> >> pass down the boundaries of the current split, so optimized input
> streams
> >> can get all the relevant columns for all row groups in the current
> split we
> >> would have to:
> >>
> >> 1/ Make changes to parquet java to pass this info down when opening the
> >> file.
> >> 2/ Each underlying input stream implementation would have to make
> changes
> >> to make use of this info.
> >>
> >> A common project focused on optimisations means we should only need to
> do
> >> this once and can share the work/maintenance.
> >>
> >> Hopefully I understood what you were saying correctly! But please do let
> >> me know in case I’ve missed the point completely 😊
> >>
> >> Thanks,
> >> Ahmar
> >>
> >> [1]:
> >>
> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
> <
> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
> >
> >> [2]:
> >>
> https://github.com/apache/iceberg/blob/main/aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java
> <
> https://github.com/apache/iceberg/blob/main/aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java
> >
> >> [3]:
> >>
> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-gcp/src/main/java/org/apache/hadoop/fs/gs/GoogleHadoopFSInputStream.java
> <
> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-gcp/src/main/java/org/apache/hadoop/fs/gs/GoogleHadoopFSInputStream.java
> >
> >>
> >> From: Andrew Lamb <[email protected] <mailto:
> [email protected]>>
> >> Reply to: "[email protected] <mailto:[email protected]>" <
> [email protected] <mailto:[email protected]>>
> >> Date: Thursday, 20 November 2025 at 11:10
> >> To: "[email protected] <mailto:[email protected]>" <
> [email protected] <mailto:[email protected]>>
> >> Cc: "[email protected] <mailto:[email protected]>"
> <[email protected] <mailto:[email protected]>>, "
> >> [email protected] <mailto:[email protected]>" <
> [email protected] <mailto:[email protected]>>, "
> [email protected] <mailto:[email protected]>"
> >> <[email protected] <mailto:[email protected]>>, "[email protected]
> <mailto:[email protected]>" <[email protected] <mailto:[email protected]>>, "
> >> [email protected] <mailto:[email protected]>" <[email protected]
> <mailto:[email protected]>>, "Ratnasingham, Kannan" <
> >> [email protected] <mailto:[email protected]>>, "Summers, Carl"
> <[email protected] <mailto:[email protected]>>, "Peace,
> >> Andrew" <[email protected] <mailto:[email protected]>>, "
> [email protected] <mailto:[email protected]>" <
> >> [email protected] <mailto:[email protected]>>, "Basik, Fuat" <
> [email protected] <mailto:[email protected]>>, "
> >> [email protected] <mailto:[email protected]>" <[email protected]
> <mailto:[email protected]>>, "[email protected] <mailto:
> [email protected]>" <
> >> [email protected] <mailto:[email protected]>>, "
> [email protected] <mailto:[email protected]>" <
> [email protected] <mailto:[email protected]>>, "
> >> [email protected] <mailto:[email protected]>" <[email protected]
> <mailto:[email protected]>>
> >> Subject: RE: [EXTERNAL] [DISCUSS] Creating an Apache project for Parquet
> >> reader optimisations
> >>
> >>
> >> CAUTION: This email originated from outside of the organization. Do not
> >> click links or open attachments unless you can confirm the sender and
> know
> >> the content is safe.
> >>
> >> One approach, which I think has served us well in the Rust ecosystem,
> has
> >> been to keep the Parquet implementation in a separate library, and
> >> carefully design APIs that enable downstream optimizations, rather than
> >> multiple more tightly integrated implementations in different query
> engines.
> >>
> >> Specifically, have you considered adding the appropriate APIs to the
> >> parquet-java codebase (for example, to get the ranges needed to prefetch
> >> given a set of filters)? It would take non trivial care to design these
> >> APIs correctly, but you could then plausibly use them to implement the
> >> system specific optimizations you describe. It may be hard to implement
> >> parquet optimizations as a stream without more detailed information
> known
> >> to the decoder.
> >>
> >> I realize it is more common to have the Parquet reader/writer in the
> >> actual engines (e.g. Spark and Trino) but doing so means trying to
> optimize
> >> / implement best practices requires duplicated effort. Of course this
> comes
> >> with tradeoffs of having to manage requirements across multiple engines
> and
> >> coordinate release schedules, etc
> >>
> >> Examples of some generic APIs in arrow-rs's Parquet reader are:
> >> 1. Filter evaluation API (not it is not part of a query engine)[1]
> >> 2. PushDecoder to separate IO from parquet decoding[2]
> >>
> >> Andrew
> >>
> >> [1]:
> >>
> https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html
> <
> https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html
> >
> >> [2]:
> >>
> https://github.com/apache/arrow-rs/blob/fea605cb16f7524cb69a197bfa581a1d4f5fe5d0/parquet/src/arrow/push_decoder/mod.rs#L218-L233
> <
> https://github.com/apache/arrow-rs/blob/fea605cb16f7524cb69a197bfa581a1d4f5fe5d0/parquet/src/arrow/push_decoder/mod.rs#L218-L233
> >
> >>
> >> On Wed, Nov 19, 2025 at 8:28 AM Ahmar Suhail <[email protected] <mailto:
> [email protected]><mailto:
> >> [email protected] <mailto:[email protected]>>> wrote:
> >> Hey everyone,
> >>
> >> I'm part of the S3 team at AWS, and a PMC on the Hadoop project,
> >> contributing mainly to S3A. I would like to start a discussion on
> >> collaborating on a single Apache level project, which will implement
> >> parquet input stream level optimisations like readVectored() in a
> unified
> >> place, rather than having vendor specific implementations.
> >>
> >> Last year, my team started working on an analytics accelerator for S3
> >> <https://github.com/awslabs/analytics-accelerator-s3> <
> https://github.com/awslabs/analytics-accelerator-s3&gt;> (AAL), with the
> >> goal
> >> of improving query performance for Spark workloads by implementing
> client
> >> side best practices. You can find more details about the project in this
> >> doc
> >> <
> >>
> https://docs.google.com/document/d/13shy0RWotwfWC_qQksb95PXdi-vSUCKQyDzjoExQEN0/edit?tab=t.0#heading=h.3lc3p7s26rnw
> <
> https://docs.google.com/document/d/13shy0RWotwfWC_qQksb95PXdi-vSUCKQyDzjoExQEN0/edit?tab=t.0#heading=h.3lc3p7s26rnw
> >
> >> >,
> >> which was shared on the Iceberg mailing lists earlier this year, and the
> >> Iceberg issue to integrate this as the default stream here
> >> <https://github.com/apache/iceberg/issues/14350> <
> https://github.com/apache/iceberg/issues/14350&gt;>.
> >>
> >> The team at Google has gcs-analytics-core
> >> <https://github.com/GoogleCloudPlatform/gcs-analytics-core> <
> https://github.com/GoogleCloudPlatform/gcs-analytics-core&gt;> which
> >> implements Parquet stream level optimizations, and was released in
> >> September of this year, iceberg issue here
> >> <https://github.com/apache/iceberg/issues/14326> <
> https://github.com/apache/iceberg/issues/14326&gt;>.
> >>
> >> Most parquet reader optimisations are not vendor specific, with the
> major
> >> feature set required being:
> >>
> >> - Parquet footer prefetching and caching - Prefetch the last X
> >> bytes (eg: 32KB) to avoid the "Parquet Footer dance" and cache them.
> >> - Vectored reads - Lets the parquet-reader pass in a list of columns
> >> that can be prefetched in parallel.
> >> - Sequential Prefetching - Useful for speeding up things where the
> >> whole
> >> Parquet object is going to be read eg: DistCP, and should help with
> >> compaction as well.
> >>
> >>
> >> With this in mind, I would like to propose the following:
> >>
> >> - A new ASF project (top level or a sub project of the existing
> >> hadoop/iceberg projects).
> >> - Project has a goal of bringing stream reading best practices into one
> >> place. Eg: For parquet, it implements footer prefetching and caching,
> >> vectored reads etc.
> >> - Implements non-format specific best practices/optimisations: eg:
> >> Sequential prefetching and reading small objects in a single GET.
> >> - Is integrated into upstream projects like Iceberg and Hadoop as a
> >> replacement/alternative for the current input stream implementations.
> >>
> >> We can structure it similar to how Hadoop and Iceberg are today:
> >>
> >> - A shared logical layer (think of it similar to hadoop-common), where
> >> the common logic goes. Ideally, 80% of the code ends up here
> >> (optimisations, memory management, thread pools etc.)
> >> - A light vendor specific client layer (Kind of like the
> >> hadoop-aws/gcp/abfs modules), where any store specific logic ends up. I
> >> imagine different cloud stores will have different requirements on
> >> things
> >> like optimal request sizes, concurrency and certain features that are
> >> not
> >> common.
> >>
> >> Note: These are all high level ideas, influenced by the direction AAL
> has
> >> taken in the last year, and perhaps there is a different, more optimal
> way
> >> to this all together.
> >>
> >> From TPC-DS benchmarking my team has done, there looks to be a 10% query
> >> read performance gain that can be achieved through the above listed
> >> optimisations, and through collaboration, we can likely drive this
> number
> >> up further. For example, it would be great to discuss how Spark and the
> >> Parquet reader can pass any additional information they have to the
> stream
> >> (similar to vectored reads), which can help read performance.
> >>
> >> In my opinion, there is a lot of opportunity here, and collaborating on
> a
> >> single, shared ASF project helps us achieve it faster, both in terms of
> >> adoption across upstream projects (eg: Hadoop, Iceberg, Trino), and long
> >> term maintenance of libraries like these. It also gives us an
> opportunity
> >> to combine our knowledge in this space, and react to upcoming changes in
> >> the Parquet format.
> >>
> >> If this sounds good, as a next step I can schedule a sync post
> >> thanksgiving
> >> to brainstorm ideas and next steps.
> >>
> >> Thank you, and looking forward to hearing your thoughts.
> >>
> >> Ahmar
> >>
> >
>
>
>
>

Re: [DISCUSS] Creating an Apache project for Parquet reader optimisations

Reply via email to