> > Main advantages of having a separate project is 1/ Library can easily be > extended to other formats in the future, 2/ Being independent of > parquet-java versions, as there are customers on much older parquet/hadoop > versions who won't be able to benefit and can right now use the analytics > accelerator library
I don't have a strong opinion here. As you know as an apache project PMC member there are trade-offs of trying to start a standalone project vs at least doing initial work in an existing project. I think there is an assumption that if it lands in Parquet it would be coupled into the parquet-java release process. I would guess this isn't a requirement and it could have its own release cycle (and possibly a separate repo to prevent dependencies from creeping in). Active Parquet-java maintainers should chime in here though (Hadoop also seems like a potential home here as well, but I would have similar concerns if the new I/O library was tightly coupled to Hadoop packages). I'll join the parquet sync this Wednesday, and if possible we can discuss > more there? The sync might be lightly attended this week due to the thanksgiving holiday. It might pay to see if there are any objections with a specifical proposal on the parquet mailing list (i.e. possibly a new thread)? Cheers, Micah On Mon, Nov 24, 2025 at 4:08 AM Suhail, Ahmar <[email protected]> wrote: > Thanks Micah, > > Yes, that is quite close to what is being proposed. For reference, you can > take a quick look at the existing project [1], and its integration into > Iceberg [2] > > There are pros and cons to both approaches: adding this into the parquet > project vs maintaining a separate project. > > The project has a decent amount of code currently (some of it can be cut > for sure), but because there is prefetching involved for the optimisations > you end up needing: > > 1/ A block manager, where blocks of prefetched data can be stored. > 2/ Someway to manage memory and cleanup when limits are hit (we're using > the caffeine cache for this) > 3/ Some logic to manage sequential prefetching (how much to prefetch/when > to prefetch) > > Just wondering if the parquet project would be ok with all this code? > > Main advantages of having a separate project is 1/ Library can easily be > extended to other formats in the future, 2/ Being independent of > parquet-java versions, as there are customers on much older parquet/hadoop > versions who won't be able to benefit and can right now use the analytics > accelerator library > > I'll join the parquet sync this Wednesday, and if possible we can discuss > more there? > > [1]: https://github.com/ahmarsuhail/analytics-accelerator-s3 > [2]: > https://github.com/apache/iceberg/blob/main/aws/src/main/java/org/apache/iceberg/aws/s3/S3InputFile.java#L77 > > On 21/11/2025, 19:58, "Micah Kornfield" <[email protected] <mailto: > [email protected]>> wrote: > > > CAUTION: This email originated from outside of the organization. Do not > click links or open attachments unless you can confirm the sender and know > the content is safe. > > > > > > > > > > 1/ Make changes to parquet java to pass this info down when opening the > > file. > > 2/ Each underlying input stream implementation would have to make changes > > to make use of this info. > > > > > I'm still trying to understand exactly what is being proposed. Would it be > be correct (or at least close to say) the goal is to have effectively make > a new abstract InputStream that is object store aware (and then have the > object store pluggable) so the business logic of reading (i.e. vectored > reads, closed range reads, etc) are expressed in the input stream, then the > backing store is pluggable? I think the assumption here is that the > business logic would likely change more quickly then the underlying object > storage APIs? Is the scope broader or narrower then this? > > > IIUC, and this is specific to Parquet file reading, the Parquet project > might be a good place to at least start prototyping what this would look > like. Or is there a reason that a separate project would be necessary in > the short term? > > > Thanks, > Micah > > > On Fri, Nov 21, 2025 at 6:49 AM Andrew Lamb <[email protected] > <mailto:[email protected]>> wrote: > > > > > What I’m suggesting here is that we work to get rid of this > duplication, > > and have a common Apache project with a single implementation of an > > optimized stream. In my mind, this brings the Parquet java library closer > > to the underlying data stream it relies on. And If we can establish some > > common ground here, in the future, we can start looking at more changes > we > > can make to the parquet java library itself. > > > > Makes total sense to me. > > > > Thanks for the clarification > > > > Andrew > > > > On Fri, Nov 21, 2025 at 9:18 AM Suhail, Ahmar <[email protected] > <mailto:[email protected]>> > > wrote: > > > >> Thanks Andrew, > >> > >> I think you’re referring to adding the right API’s into parquet-java > >> library. The readVectored() API was added in to parquet-java a couple of > >> years ago (thanks to Mukund and Steve), PR here: > >> https://github.com/apache/parquet-java/pull/1139 < > https://github.com/apache/parquet-java/pull/1139>. > >> > >> The issue then becomes that the underlying streams, eg: the > >> S3AInputStream [1] in S3A, or the S3InputStream [2] in S3FileIO, must > >> provide implementations for this. And currently we end up with > >> implementations by each cloud provider, for each file system. Eg: > Google’s > >> S3A implementation is: GoogleHadoopFSInputStream [3]. > >> > >> What I’m suggesting here is that we work to get rid of this duplication, > >> and have a common Apache project with a single implementation of an > >> optimized stream. In my mind, this brings the Parquet java library > closer > >> to the underlying data stream it relies on. And If we can establish some > >> common ground here, in the future, we can start looking at more changes > we > >> can make to the parquet java library itself. > >> > >> As an example, if we wanted to make a change to allow parquet-java to > >> pass down the boundaries of the current split, so optimized input > streams > >> can get all the relevant columns for all row groups in the current > split we > >> would have to: > >> > >> 1/ Make changes to parquet java to pass this info down when opening the > >> file. > >> 2/ Each underlying input stream implementation would have to make > changes > >> to make use of this info. > >> > >> A common project focused on optimisations means we should only need to > do > >> this once and can share the work/maintenance. > >> > >> Hopefully I understood what you were saying correctly! But please do let > >> me know in case I’ve missed the point completely 😊 > >> > >> Thanks, > >> Ahmar > >> > >> [1]: > >> > https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java > < > https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java > > > >> [2]: > >> > https://github.com/apache/iceberg/blob/main/aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java > < > https://github.com/apache/iceberg/blob/main/aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java > > > >> [3]: > >> > https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-gcp/src/main/java/org/apache/hadoop/fs/gs/GoogleHadoopFSInputStream.java > < > https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-gcp/src/main/java/org/apache/hadoop/fs/gs/GoogleHadoopFSInputStream.java > > > >> > >> From: Andrew Lamb <[email protected] <mailto: > [email protected]>> > >> Reply to: "[email protected] <mailto:[email protected]>" < > [email protected] <mailto:[email protected]>> > >> Date: Thursday, 20 November 2025 at 11:10 > >> To: "[email protected] <mailto:[email protected]>" < > [email protected] <mailto:[email protected]>> > >> Cc: "[email protected] <mailto:[email protected]>" > <[email protected] <mailto:[email protected]>>, " > >> [email protected] <mailto:[email protected]>" < > [email protected] <mailto:[email protected]>>, " > [email protected] <mailto:[email protected]>" > >> <[email protected] <mailto:[email protected]>>, "[email protected] > <mailto:[email protected]>" <[email protected] <mailto:[email protected]>>, " > >> [email protected] <mailto:[email protected]>" <[email protected] > <mailto:[email protected]>>, "Ratnasingham, Kannan" < > >> [email protected] <mailto:[email protected]>>, "Summers, Carl" > <[email protected] <mailto:[email protected]>>, "Peace, > >> Andrew" <[email protected] <mailto:[email protected]>>, " > [email protected] <mailto:[email protected]>" < > >> [email protected] <mailto:[email protected]>>, "Basik, Fuat" < > [email protected] <mailto:[email protected]>>, " > >> [email protected] <mailto:[email protected]>" <[email protected] > <mailto:[email protected]>>, "[email protected] <mailto: > [email protected]>" < > >> [email protected] <mailto:[email protected]>>, " > [email protected] <mailto:[email protected]>" < > [email protected] <mailto:[email protected]>>, " > >> [email protected] <mailto:[email protected]>" <[email protected] > <mailto:[email protected]>> > >> Subject: RE: [EXTERNAL] [DISCUSS] Creating an Apache project for Parquet > >> reader optimisations > >> > >> > >> CAUTION: This email originated from outside of the organization. Do not > >> click links or open attachments unless you can confirm the sender and > know > >> the content is safe. > >> > >> One approach, which I think has served us well in the Rust ecosystem, > has > >> been to keep the Parquet implementation in a separate library, and > >> carefully design APIs that enable downstream optimizations, rather than > >> multiple more tightly integrated implementations in different query > engines. > >> > >> Specifically, have you considered adding the appropriate APIs to the > >> parquet-java codebase (for example, to get the ranges needed to prefetch > >> given a set of filters)? It would take non trivial care to design these > >> APIs correctly, but you could then plausibly use them to implement the > >> system specific optimizations you describe. It may be hard to implement > >> parquet optimizations as a stream without more detailed information > known > >> to the decoder. > >> > >> I realize it is more common to have the Parquet reader/writer in the > >> actual engines (e.g. Spark and Trino) but doing so means trying to > optimize > >> / implement best practices requires duplicated effort. Of course this > comes > >> with tradeoffs of having to manage requirements across multiple engines > and > >> coordinate release schedules, etc > >> > >> Examples of some generic APIs in arrow-rs's Parquet reader are: > >> 1. Filter evaluation API (not it is not part of a query engine)[1] > >> 2. PushDecoder to separate IO from parquet decoding[2] > >> > >> Andrew > >> > >> [1]: > >> > https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html > < > https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html > > > >> [2]: > >> > https://github.com/apache/arrow-rs/blob/fea605cb16f7524cb69a197bfa581a1d4f5fe5d0/parquet/src/arrow/push_decoder/mod.rs#L218-L233 > < > https://github.com/apache/arrow-rs/blob/fea605cb16f7524cb69a197bfa581a1d4f5fe5d0/parquet/src/arrow/push_decoder/mod.rs#L218-L233 > > > >> > >> On Wed, Nov 19, 2025 at 8:28 AM Ahmar Suhail <[email protected] <mailto: > [email protected]><mailto: > >> [email protected] <mailto:[email protected]>>> wrote: > >> Hey everyone, > >> > >> I'm part of the S3 team at AWS, and a PMC on the Hadoop project, > >> contributing mainly to S3A. I would like to start a discussion on > >> collaborating on a single Apache level project, which will implement > >> parquet input stream level optimisations like readVectored() in a > unified > >> place, rather than having vendor specific implementations. > >> > >> Last year, my team started working on an analytics accelerator for S3 > >> <https://github.com/awslabs/analytics-accelerator-s3> < > https://github.com/awslabs/analytics-accelerator-s3>> (AAL), with the > >> goal > >> of improving query performance for Spark workloads by implementing > client > >> side best practices. You can find more details about the project in this > >> doc > >> < > >> > https://docs.google.com/document/d/13shy0RWotwfWC_qQksb95PXdi-vSUCKQyDzjoExQEN0/edit?tab=t.0#heading=h.3lc3p7s26rnw > < > https://docs.google.com/document/d/13shy0RWotwfWC_qQksb95PXdi-vSUCKQyDzjoExQEN0/edit?tab=t.0#heading=h.3lc3p7s26rnw > > > >> >, > >> which was shared on the Iceberg mailing lists earlier this year, and the > >> Iceberg issue to integrate this as the default stream here > >> <https://github.com/apache/iceberg/issues/14350> < > https://github.com/apache/iceberg/issues/14350>>. > >> > >> The team at Google has gcs-analytics-core > >> <https://github.com/GoogleCloudPlatform/gcs-analytics-core> < > https://github.com/GoogleCloudPlatform/gcs-analytics-core>> which > >> implements Parquet stream level optimizations, and was released in > >> September of this year, iceberg issue here > >> <https://github.com/apache/iceberg/issues/14326> < > https://github.com/apache/iceberg/issues/14326>>. > >> > >> Most parquet reader optimisations are not vendor specific, with the > major > >> feature set required being: > >> > >> - Parquet footer prefetching and caching - Prefetch the last X > >> bytes (eg: 32KB) to avoid the "Parquet Footer dance" and cache them. > >> - Vectored reads - Lets the parquet-reader pass in a list of columns > >> that can be prefetched in parallel. > >> - Sequential Prefetching - Useful for speeding up things where the > >> whole > >> Parquet object is going to be read eg: DistCP, and should help with > >> compaction as well. > >> > >> > >> With this in mind, I would like to propose the following: > >> > >> - A new ASF project (top level or a sub project of the existing > >> hadoop/iceberg projects). > >> - Project has a goal of bringing stream reading best practices into one > >> place. Eg: For parquet, it implements footer prefetching and caching, > >> vectored reads etc. > >> - Implements non-format specific best practices/optimisations: eg: > >> Sequential prefetching and reading small objects in a single GET. > >> - Is integrated into upstream projects like Iceberg and Hadoop as a > >> replacement/alternative for the current input stream implementations. > >> > >> We can structure it similar to how Hadoop and Iceberg are today: > >> > >> - A shared logical layer (think of it similar to hadoop-common), where > >> the common logic goes. Ideally, 80% of the code ends up here > >> (optimisations, memory management, thread pools etc.) > >> - A light vendor specific client layer (Kind of like the > >> hadoop-aws/gcp/abfs modules), where any store specific logic ends up. I > >> imagine different cloud stores will have different requirements on > >> things > >> like optimal request sizes, concurrency and certain features that are > >> not > >> common. > >> > >> Note: These are all high level ideas, influenced by the direction AAL > has > >> taken in the last year, and perhaps there is a different, more optimal > way > >> to this all together. > >> > >> From TPC-DS benchmarking my team has done, there looks to be a 10% query > >> read performance gain that can be achieved through the above listed > >> optimisations, and through collaboration, we can likely drive this > number > >> up further. For example, it would be great to discuss how Spark and the > >> Parquet reader can pass any additional information they have to the > stream > >> (similar to vectored reads), which can help read performance. > >> > >> In my opinion, there is a lot of opportunity here, and collaborating on > a > >> single, shared ASF project helps us achieve it faster, both in terms of > >> adoption across upstream projects (eg: Hadoop, Iceberg, Trino), and long > >> term maintenance of libraries like these. It also gives us an > opportunity > >> to combine our knowledge in this space, and react to upcoming changes in > >> the Parquet format. > >> > >> If this sounds good, as a next step I can schedule a sync post > >> thanksgiving > >> to brainstorm ideas and next steps. > >> > >> Thank you, and looking forward to hearing your thoughts. > >> > >> Ahmar > >> > > > > > >
