I received feedback from Alkis regarding their Parquet optimization work. Their internal testing shows promising results for reducing metadata size and improving parsing performance. They plan to formalize a proposal for these Parquet enhancements in the near future.
Meanwhile, I'm putting together our horizontal sharding proposal as a complementary approach. Even with the Parquet metadata improvements, horizontal sharding would provide additional benefits for: - More efficient column-level updates - Streamlined column additions - Better handling of dominant columns that can cause RowGroup size imbalances (placing these in separate files could significantly improve performance) Thanks, Peter Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. máj. 28., Sze, 15:39): > I would be happy to put together a proposal based on the inputs got here. > > Thanks everyone for your thoughts! > I will try to incorporate all of this. > > Thanks, Peter > > Daniel Weeks <dwe...@apache.org> ezt írta (időpont: 2025. máj. 27., K, > 20:07): > >> I feel like we have two different issues we're talking about here that >> aren't necessarily tied (though solutions may address both): 1) wide >> tables, 2) adding columns >> >> Wide tables are definitely a problem where parquet has limitations. I'm >> optimistic about the ongoing work to help improve parquet footers/stats in >> this area that Fokko mentioned. There are always limitations in how this >> scales as wide rows lead to small row groups and the cost to reconstitute a >> row gets more expensive, but for cases that are read heavy and projecting >> subsets of columns should significantly improve performance. >> >> Adding columns to an existing dataset is something that comes up >> periodically, but there's a lot of complexity involved in this. Parquet >> does support referencing columns in separate files per the spec, but >> there's no implementation that takes advantage of this to my knowledge. >> This does allow for approaches where you separate/rewrite just the footers >> or various other tricks, but these approaches get complicated quickly and >> the number of readers that can consume those representations would >> initially be very limited. >> >> A larger problem for splitting columns across files is that there are a >> lot of assumptions about how data is laid out in both readers and writers. >> For example, aligning row groups and correctly handling split calculation >> is very complicated if you're trying to split rows across files. Other >> features are also impacted like deletes, which reference the file to which >> they apply and would need to account for deletes applying to multiple files >> and needing to update those references if columns are added. >> >> I believe there are a lot of interesting approaches to addressing these >> use cases, but we'd really need a thorough proposal that explores all of >> these scenarios. The last thing we would want is to introduce >> incompatibilities within the format that result in incompatible features. >> >> -Dan >> >> On Tue, May 27, 2025 at 10:02 AM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> Point definitely taken. We really should probably POC some of >>> these ideas and see what we are actually dealing with. (He said without >>> volunteering to do the work :P) >>> >>> On Tue, May 27, 2025 at 11:55 AM Selcuk Aya >>> <selcuk....@snowflake.com.invalid> wrote: >>> >>>> Yes having to rewrite the whole file is not ideal but I believe most of >>>> the cost of rewriting a file comes from decompression, encoding, stats >>>> calculations etc. If you are adding new values for some columns but are >>>> keeping the rest of the columns the same in the file, then a bunch of >>>> rewrite cost can be optimized away. I am not saying this is better than >>>> writing to a separate file, I am not sure how much worse it is though. >>>> >>>> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer < >>>> russell.spit...@gmail.com> wrote: >>>> >>>>> I think that "after the fact" modification is one of the requirements >>>>> here, IE: Updating a single column without rewriting the whole file. >>>>> If we have to write new metadata for the file aren't we in the same >>>>> boat as having to rewrite the whole file? >>>>> >>>>> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya >>>>> <selcuk....@snowflake.com.invalid> wrote: >>>>> >>>>>> If files represent column projections of a table rather than the >>>>>> whole columns in the table, then any read that reads across these files >>>>>> needs to identify what constitutes a row. Lance DB for example has >>>>>> vertical >>>>>> partitioning across columns but also horizontal partitioning across rows >>>>>> such that in each horizontal partitioning(fragment), the same number of >>>>>> rows exist in each vertical partition, which I think is necessary to >>>>>> make >>>>>> whole/partial row construction cheap. If this is the case, there is no >>>>>> reason not to achieve the same data layout inside a single columnar file >>>>>> with a lean header. I think the only valid argument for a separate file >>>>>> is >>>>>> adding a new set of columns to an existing table, but even then I am not >>>>>> sure a separate file is absolutely necessary for good performance. >>>>>> >>>>>> Selcuk >>>>>> >>>>>> On Tue, May 27, 2025 at 9:18 AM Devin Smith >>>>>> <devinsm...@deephaven.io.invalid> wrote: >>>>>> >>>>>>> There's a `file_path` field in the parquet ColumnChunk structure, >>>>>>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962 >>>>>>> >>>>>>> I'm not sure what tooling actually supports this though. Could be >>>>>>> interesting to see what the history of this is. >>>>>>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3, >>>>>>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw >>>>>>> >>>>>>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer < >>>>>>> russell.spit...@gmail.com> wrote: >>>>>>> >>>>>>>> I have to agree that while there can be some fixes in Parquet, we >>>>>>>> fundamentally need a way to split a "row group" >>>>>>>> or something like that between separate files. If that's >>>>>>>> something we can do in the parquet project that would be great >>>>>>>> but it feels like we need to start exploring more drastic options >>>>>>>> than footer encoding. >>>>>>>> >>>>>>>> On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com> wrote: >>>>>>>> >>>>>>>>> I agree with Steven that there are limitations that Parquet cannot >>>>>>>>> do. >>>>>>>>> >>>>>>>>> In addition to adding new columns by rewriting all files, files of >>>>>>>>> wide tables may suffer from bad performance like below: >>>>>>>>> - Poor compression of row groups because there are too many >>>>>>>>> columns and even a small number of rows can reach the row group >>>>>>>>> threshold. >>>>>>>>> - Dominating columns (e.g. blobs) may contribute to 99% size of a >>>>>>>>> row group, leading to unbalanced column chunks and deteriorate the row >>>>>>>>> group compression. >>>>>>>>> - Similar to adding new columns, partial update also requires >>>>>>>>> rewriting all columns of the affected rows. >>>>>>>>> >>>>>>>>> IIRC, some table formats already support splitting columns into >>>>>>>>> different files: >>>>>>>>> - Lance manifest splits a fragment [1] into one or more data files. >>>>>>>>> - Apache Hudi has the concept of column family [2]. >>>>>>>>> - Apache Paimon supports sequence groups [3] for partial update. >>>>>>>>> >>>>>>>>> Although Parquet can introduce the concept of logical file and >>>>>>>>> physical file to manage the columns to file mapping, this looks like >>>>>>>>> yet >>>>>>>>> another manifest file design which duplicates the purpose of Iceberg. >>>>>>>>> These might be something worth exploring in Iceberg. >>>>>>>>> >>>>>>>>> [1] https://lancedb.github.io/lance/format.html#fragments >>>>>>>>> [2] >>>>>>>>> https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md >>>>>>>>> [3] >>>>>>>>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Gang >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, May 27, 2025 at 7:03 AM Steven Wu <stevenz...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> The Parquet metadata proposal (linked by Fokko) is mainly >>>>>>>>>> addressing the read performance due to bloated metadata. >>>>>>>>>> >>>>>>>>>> What Peter described in the description seems useful for some ML >>>>>>>>>> workload of feature engineering. A new set of features/columns are >>>>>>>>>> added to >>>>>>>>>> the table. Currently, Iceberg would require rewriting all data >>>>>>>>>> files to >>>>>>>>>> combine old and new columns (write amplification). Similarly, in the >>>>>>>>>> past >>>>>>>>>> the community also talked about the use cases of updating a single >>>>>>>>>> column, >>>>>>>>>> which would require rewriting all data files. >>>>>>>>>> >>>>>>>>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry < >>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Do you have the link at hand for the thread where this was >>>>>>>>>>> discussed on the Parquet list? >>>>>>>>>>> The docs seem quite old, and the PR stale, so I would like to >>>>>>>>>>> understand the situation better. >>>>>>>>>>> If it is possible to do this in Parquet, that would be great, >>>>>>>>>>> but Avro, ORC would still suffer. >>>>>>>>>>> >>>>>>>>>>> Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont: 2025. >>>>>>>>>>> máj. 26., H, 22:07): >>>>>>>>>>> >>>>>>>>>>>> Hey Peter, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for bringing this issue up. I think I agree with Fokko; >>>>>>>>>>>> the issue of wide tables leading to Parquet metadata bloat and >>>>>>>>>>>> poor Thrift >>>>>>>>>>>> deserialization performance is a long standing issue that I >>>>>>>>>>>> believe there's >>>>>>>>>>>> motivation in the community to address. So to me it seems better >>>>>>>>>>>> to address >>>>>>>>>>>> it in Parquet itself rather than Iceberg library facilitate a >>>>>>>>>>>> pattern which >>>>>>>>>>>> works around the limitations. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Amogh Jahagirdar >>>>>>>>>>>> >>>>>>>>>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong < >>>>>>>>>>>> fo...@apache.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Peter, >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for bringing this up. Wouldn't it make more sense to >>>>>>>>>>>>> fix this in Parquet itself? It has been a long-running issue on >>>>>>>>>>>>> Parquet, >>>>>>>>>>>>> and there is still active interest from the community. There is a >>>>>>>>>>>>> PR to >>>>>>>>>>>>> replace the footer with FlatBuffers, which dramatically >>>>>>>>>>>>> improves performance >>>>>>>>>>>>> <https://github.com/apache/arrow/pull/43793>. The underlying >>>>>>>>>>>>> proposal can be found here >>>>>>>>>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa> >>>>>>>>>>>>> . >>>>>>>>>>>>> >>>>>>>>>>>>> Kind regards, >>>>>>>>>>>>> Fokko >>>>>>>>>>>>> >>>>>>>>>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou < >>>>>>>>>>>>> yunzou.colost...@gmail.com>: >>>>>>>>>>>>> >>>>>>>>>>>>>> +1, I am really interested in this topic. Performance has >>>>>>>>>>>>>> always been a problem when dealing with wide tables, not just >>>>>>>>>>>>>> read/write, >>>>>>>>>>>>>> but also during compilation. Most of the ML use cases typically >>>>>>>>>>>>>> exhibit a >>>>>>>>>>>>>> vectorized read/write pattern, I am also wondering if there is >>>>>>>>>>>>>> any way at >>>>>>>>>>>>>> the metadata level to help the whole compilation and execution >>>>>>>>>>>>>> process. I >>>>>>>>>>>>>> do not have any answer fo this yet, but I would be really >>>>>>>>>>>>>> interested in >>>>>>>>>>>>>> exploring this further. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>>> Yun >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang >>>>>>>>>>>>>> <py...@pinterest.com.invalid> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Peter, I am interested in this proposal. What's more, I >>>>>>>>>>>>>>> am curious if there is a similar story on the write side as >>>>>>>>>>>>>>> well (how to >>>>>>>>>>>>>>> generate these splitted files) and specifically, are you >>>>>>>>>>>>>>> targeting feature >>>>>>>>>>>>>>> backfill use cases in ML use? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry < >>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Team, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In machine learning use-cases, it's common to encounter >>>>>>>>>>>>>>>> tables with a very high number of columns - sometimes even in >>>>>>>>>>>>>>>> the range of >>>>>>>>>>>>>>>> several thousand. I've seen cases with up to 15,000 columns. >>>>>>>>>>>>>>>> Storing such >>>>>>>>>>>>>>>> wide tables in a single Parquet file is often suboptimal, as >>>>>>>>>>>>>>>> Parquet can >>>>>>>>>>>>>>>> become a bottleneck, even when only a subset of columns is >>>>>>>>>>>>>>>> queried. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> A common approach to mitigate this is to split the data >>>>>>>>>>>>>>>> across multiple Parquet files. With the upcoming File Format >>>>>>>>>>>>>>>> API, we could >>>>>>>>>>>>>>>> introduce a layer that combines these files into a single >>>>>>>>>>>>>>>> iterator, >>>>>>>>>>>>>>>> enabling efficient reading of wide and very wide tables. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> To support this, we would need to revise the metadata >>>>>>>>>>>>>>>> specification. Instead of the current `_file` column, we could >>>>>>>>>>>>>>>> introduce a >>>>>>>>>>>>>>>> _files column containing: >>>>>>>>>>>>>>>> - `_file_column_ids`: the column IDs present in each file >>>>>>>>>>>>>>>> - `_file_path`: the path to the corresponding file >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Has there been any prior discussion around this idea? >>>>>>>>>>>>>>>> Is anyone else interested in exploring this further? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>> Peter >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>