Fewer commit conflicts meaning the tables representing column families are updated independently, rather than having to serialize commits to a single table. Perhaps with a wide table solution the commit logic could be enhanced to support things like concurrent overwrites to independent column families, but it seems like it would be fairly involved.
> On May 29, 2025, at 7:16 PM, Steven Wu <stevenz...@gmail.com> wrote: > > Bryan, interesting approach to split horizontally across multiple tables. > > A few potential down sides > * operational overhead. tables need to be managed consistently and probably > in some coordinated way > * complex read > * maybe fragile to enforce correctness (during join). It is robust to enforce > the stitching correctness at file group level in file reader and writer if > built in the table format. > > > fewer commit conflicts > > Can you elaborate on this one? Are those tables populated by streaming or > batch pipelines? > > On Thu, May 29, 2025 at 5:03 PM Bryan Keller <brya...@gmail.com > <mailto:brya...@gmail.com>> wrote: >> Hi everyone, >> >> We have been investigating a wide table format internally for a similar use >> case, i.e. we have wide ML tables with features generated by different >> pipelines and teams but want a unified view of the data. We are comparing >> that against separate tables joined together using a shuffle-less join (e.g. >> storage partition join), along with a corresponding view. >> >> The join/view approach seems to give us much of we need, with some added >> benefits like splitting up the metadata, fewer commit conflicts, and ability >> to share, nest, and swap "column families". The downsides are table >> management is split across multiple tables, it requires engine support of >> shuffle-less joins for best performance, and even then, scans probably won't >> be as optimal. >> >> I'm curious if anyone had further thoughts on the two? >> >> -Bryan >> >> >> >>> On May 29, 2025, at 8:18 AM, Péter Váry <peter.vary.apa...@gmail.com >>> <mailto:peter.vary.apa...@gmail.com>> wrote: >>> >>> I received feedback from Alkis regarding their Parquet optimization work. >>> Their internal testing shows promising results for reducing metadata size >>> and improving parsing performance. They plan to formalize a proposal for >>> these Parquet enhancements in the near future. >>> >>> Meanwhile, I'm putting together our horizontal sharding proposal as a >>> complementary approach. Even with the Parquet metadata improvements, >>> horizontal sharding would provide additional benefits for: >>> More efficient column-level updates >>> Streamlined column additions >>> Better handling of dominant columns that can cause RowGroup size imbalances >>> (placing these in separate files could significantly improve performance) >>> Thanks, Peter >>> >>> >>> >>> Péter Váry <peter.vary.apa...@gmail.com >>> <mailto:peter.vary.apa...@gmail.com>> ezt írta (időpont: 2025. máj. 28., >>> Sze, 15:39): >>>> I would be happy to put together a proposal based on the inputs got here. >>>> >>>> Thanks everyone for your thoughts! >>>> I will try to incorporate all of this. >>>> >>>> Thanks, Peter >>>> >>>> Daniel Weeks <dwe...@apache.org <mailto:dwe...@apache.org>> ezt írta >>>> (időpont: 2025. máj. 27., K, 20:07): >>>>> I feel like we have two different issues we're talking about here that >>>>> aren't necessarily tied (though solutions may address both): 1) wide >>>>> tables, 2) adding columns >>>>> >>>>> Wide tables are definitely a problem where parquet has limitations. I'm >>>>> optimistic about the ongoing work to help improve parquet footers/stats >>>>> in this area that Fokko mentioned. There are always limitations in how >>>>> this scales as wide rows lead to small row groups and the cost to >>>>> reconstitute a row gets more expensive, but for cases that are read heavy >>>>> and projecting subsets of columns should significantly improve >>>>> performance. >>>>> >>>>> Adding columns to an existing dataset is something that comes up >>>>> periodically, but there's a lot of complexity involved in this. Parquet >>>>> does support referencing columns in separate files per the spec, but >>>>> there's no implementation that takes advantage of this to my knowledge. >>>>> This does allow for approaches where you separate/rewrite just the >>>>> footers or various other tricks, but these approaches get complicated >>>>> quickly and the number of readers that can consume those representations >>>>> would initially be very limited. >>>>> >>>>> A larger problem for splitting columns across files is that there are a >>>>> lot of assumptions about how data is laid out in both readers and >>>>> writers. For example, aligning row groups and correctly handling split >>>>> calculation is very complicated if you're trying to split rows across >>>>> files. Other features are also impacted like deletes, which reference >>>>> the file to which they apply and would need to account for deletes >>>>> applying to multiple files and needing to update those references if >>>>> columns are added. >>>>> >>>>> I believe there are a lot of interesting approaches to addressing these >>>>> use cases, but we'd really need a thorough proposal that explores all of >>>>> these scenarios. The last thing we would want is to introduce >>>>> incompatibilities within the format that result in incompatible features. >>>>> >>>>> -Dan >>>>> >>>>> On Tue, May 27, 2025 at 10:02 AM Russell Spitzer >>>>> <russell.spit...@gmail.com <mailto:russell.spit...@gmail.com>> wrote: >>>>>> Point definitely taken. We really should probably POC some of these >>>>>> ideas and see what we are actually dealing with. (He said without >>>>>> volunteering to do the work :P) >>>>>> >>>>>> On Tue, May 27, 2025 at 11:55 AM Selcuk Aya >>>>>> <selcuk....@snowflake.com.invalid> wrote: >>>>>>> Yes having to rewrite the whole file is not ideal but I believe most of >>>>>>> the cost of rewriting a file comes from decompression, encoding, stats >>>>>>> calculations etc. If you are adding new values for some columns but are >>>>>>> keeping the rest of the columns the same in the file, then a bunch of >>>>>>> rewrite cost can be optimized away. I am not saying this is better than >>>>>>> writing to a separate file, I am not sure how much worse it is though. >>>>>>> >>>>>>> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer >>>>>>> <russell.spit...@gmail.com <mailto:russell.spit...@gmail.com>> wrote: >>>>>>>> I think that "after the fact" modification is one of the requirements >>>>>>>> here, IE: Updating a single column without rewriting the whole file. >>>>>>>> If we have to write new metadata for the file aren't we in the same >>>>>>>> boat as having to rewrite the whole file? >>>>>>>> >>>>>>>> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya >>>>>>>> <selcuk....@snowflake.com.invalid> wrote: >>>>>>>>> If files represent column projections of a table rather than the >>>>>>>>> whole columns in the table, then any read that reads across these >>>>>>>>> files needs to identify what constitutes a row. Lance DB for example >>>>>>>>> has vertical partitioning across columns but also horizontal >>>>>>>>> partitioning across rows such that in each horizontal >>>>>>>>> partitioning(fragment), the same number of rows exist in each >>>>>>>>> vertical partition, which I think is necessary to make whole/partial >>>>>>>>> row construction cheap. If this is the case, there is no reason not >>>>>>>>> to achieve the same data layout inside a single columnar file with a >>>>>>>>> lean header. I think the only valid argument for a separate file is >>>>>>>>> adding a new set of columns to an existing table, but even then I am >>>>>>>>> not sure a separate file is absolutely necessary for good performance. >>>>>>>>> >>>>>>>>> Selcuk >>>>>>>>> >>>>>>>>> On Tue, May 27, 2025 at 9:18 AM Devin Smith >>>>>>>>> <devinsm...@deephaven.io.invalid> wrote: >>>>>>>>>> There's a `file_path` field in the parquet ColumnChunk structure, >>>>>>>>>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962 >>>>>>>>>> >>>>>>>>>> I'm not sure what tooling actually supports this though. Could be >>>>>>>>>> interesting to see what the history of this is. >>>>>>>>>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3, >>>>>>>>>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw >>>>>>>>>> >>>>>>>>>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer >>>>>>>>>> <russell.spit...@gmail.com <mailto:russell.spit...@gmail.com>> wrote: >>>>>>>>>>> I have to agree that while there can be some fixes in Parquet, we >>>>>>>>>>> fundamentally need a way to split a "row group" >>>>>>>>>>> or something like that between separate files. If that's something >>>>>>>>>>> we can do in the parquet project that would be great >>>>>>>>>>> but it feels like we need to start exploring more drastic options >>>>>>>>>>> than footer encoding. >>>>>>>>>>> >>>>>>>>>>> On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com >>>>>>>>>>> <mailto:ust...@gmail.com>> wrote: >>>>>>>>>>>> I agree with Steven that there are limitations that Parquet cannot >>>>>>>>>>>> do. >>>>>>>>>>>> >>>>>>>>>>>> In addition to adding new columns by rewriting all files, files of >>>>>>>>>>>> wide tables may suffer from bad performance like below: >>>>>>>>>>>> - Poor compression of row groups because there are too many >>>>>>>>>>>> columns and even a small number of rows can reach the row group >>>>>>>>>>>> threshold. >>>>>>>>>>>> - Dominating columns (e.g. blobs) may contribute to 99% size of a >>>>>>>>>>>> row group, leading to unbalanced column chunks and deteriorate the >>>>>>>>>>>> row group compression. >>>>>>>>>>>> - Similar to adding new columns, partial update also requires >>>>>>>>>>>> rewriting all columns of the affected rows. >>>>>>>>>>>> >>>>>>>>>>>> IIRC, some table formats already support splitting columns into >>>>>>>>>>>> different files: >>>>>>>>>>>> - Lance manifest splits a fragment [1] into one or more data files. >>>>>>>>>>>> - Apache Hudi has the concept of column family [2]. >>>>>>>>>>>> - Apache Paimon supports sequence groups [3] for partial update. >>>>>>>>>>>> >>>>>>>>>>>> Although Parquet can introduce the concept of logical file and >>>>>>>>>>>> physical file to manage the columns to file mapping, this looks >>>>>>>>>>>> like yet another manifest file design which duplicates the purpose >>>>>>>>>>>> of Iceberg. >>>>>>>>>>>> These might be something worth exploring in Iceberg. >>>>>>>>>>>> >>>>>>>>>>>> [1] https://lancedb.github.io/lance/format.html#fragments >>>>>>>>>>>> [2] https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md >>>>>>>>>>>> [3] >>>>>>>>>>>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Gang >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 27, 2025 at 7:03 AM Steven Wu <stevenz...@gmail.com >>>>>>>>>>>> <mailto:stevenz...@gmail.com>> wrote: >>>>>>>>>>>>> The Parquet metadata proposal (linked by Fokko) is mainly >>>>>>>>>>>>> addressing the read performance due to bloated metadata. >>>>>>>>>>>>> >>>>>>>>>>>>> What Peter described in the description seems useful for some ML >>>>>>>>>>>>> workload of feature engineering. A new set of features/columns >>>>>>>>>>>>> are added to the table. Currently, Iceberg would require >>>>>>>>>>>>> rewriting all data files to combine old and new columns (write >>>>>>>>>>>>> amplification). Similarly, in the past the community also talked >>>>>>>>>>>>> about the use cases of updating a single column, which would >>>>>>>>>>>>> require rewriting all data files. >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry >>>>>>>>>>>>> <peter.vary.apa...@gmail.com >>>>>>>>>>>>> <mailto:peter.vary.apa...@gmail.com>> wrote: >>>>>>>>>>>>>> Do you have the link at hand for the thread where this was >>>>>>>>>>>>>> discussed on the Parquet list? >>>>>>>>>>>>>> The docs seem quite old, and the PR stale, so I would like to >>>>>>>>>>>>>> understand the situation better. >>>>>>>>>>>>>> If it is possible to do this in Parquet, that would be great, >>>>>>>>>>>>>> but Avro, ORC would still suffer. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Amogh Jahagirdar <2am...@gmail.com <mailto:2am...@gmail.com>> >>>>>>>>>>>>>> ezt írta (időpont: 2025. máj. 26., H, 22:07): >>>>>>>>>>>>>>> Hey Peter, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks for bringing this issue up. I think I agree with Fokko; >>>>>>>>>>>>>>> the issue of wide tables leading to Parquet metadata bloat and >>>>>>>>>>>>>>> poor Thrift deserialization performance is a long standing >>>>>>>>>>>>>>> issue that I believe there's motivation in the community to >>>>>>>>>>>>>>> address. So to me it seems better to address it in Parquet >>>>>>>>>>>>>>> itself rather than Iceberg library facilitate a pattern which >>>>>>>>>>>>>>> works around the limitations. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Amogh Jahagirdar >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong >>>>>>>>>>>>>>> <fo...@apache.org <mailto:fo...@apache.org>> wrote: >>>>>>>>>>>>>>>> Hi Peter, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks for bringing this up. Wouldn't it make more sense to >>>>>>>>>>>>>>>> fix this in Parquet itself? It has been a long-running issue >>>>>>>>>>>>>>>> on Parquet, and there is still active interest from the >>>>>>>>>>>>>>>> community. There is a PR to replace the footer with >>>>>>>>>>>>>>>> FlatBuffers, which dramatically improves performance >>>>>>>>>>>>>>>> <https://github.com/apache/arrow/pull/43793>. The underlying >>>>>>>>>>>>>>>> proposal can be found here >>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa>. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Kind regards, >>>>>>>>>>>>>>>> Fokko >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou >>>>>>>>>>>>>>>> <yunzou.colost...@gmail.com >>>>>>>>>>>>>>>> <mailto:yunzou.colost...@gmail.com>>: >>>>>>>>>>>>>>>>> +1, I am really interested in this topic. Performance has >>>>>>>>>>>>>>>>> always been a problem when dealing with wide tables, not just >>>>>>>>>>>>>>>>> read/write, but also during compilation. Most of the ML use >>>>>>>>>>>>>>>>> cases typically exhibit a vectorized read/write pattern, I am >>>>>>>>>>>>>>>>> also wondering if there is any way at the metadata level to >>>>>>>>>>>>>>>>> help the whole compilation and execution process. I do not >>>>>>>>>>>>>>>>> have any answer fo this yet, but I would be really interested >>>>>>>>>>>>>>>>> in exploring this further. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>>>>>> Yun >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang >>>>>>>>>>>>>>>>> <py...@pinterest.com.invalid> wrote: >>>>>>>>>>>>>>>>>> Hi Peter, I am interested in this proposal. What's more, I >>>>>>>>>>>>>>>>>> am curious if there is a similar story on the write side as >>>>>>>>>>>>>>>>>> well (how to generate these splitted files) and >>>>>>>>>>>>>>>>>> specifically, are you targeting feature backfill use cases >>>>>>>>>>>>>>>>>> in ML use? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry >>>>>>>>>>>>>>>>>> <peter.vary.apa...@gmail.com >>>>>>>>>>>>>>>>>> <mailto:peter.vary.apa...@gmail.com>> wrote: >>>>>>>>>>>>>>>>>>> Hi Team, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> In machine learning use-cases, it's common to encounter >>>>>>>>>>>>>>>>>>> tables with a very high number of columns - sometimes even >>>>>>>>>>>>>>>>>>> in the range of several thousand. I've seen cases with up >>>>>>>>>>>>>>>>>>> to 15,000 columns. Storing such wide tables in a single >>>>>>>>>>>>>>>>>>> Parquet file is often suboptimal, as Parquet can become a >>>>>>>>>>>>>>>>>>> bottleneck, even when only a subset of columns is queried. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> A common approach to mitigate this is to split the data >>>>>>>>>>>>>>>>>>> across multiple Parquet files. With the upcoming File >>>>>>>>>>>>>>>>>>> Format API, we could introduce a layer that combines these >>>>>>>>>>>>>>>>>>> files into a single iterator, enabling efficient reading of >>>>>>>>>>>>>>>>>>> wide and very wide tables. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> To support this, we would need to revise the metadata >>>>>>>>>>>>>>>>>>> specification. Instead of the current `_file` column, we >>>>>>>>>>>>>>>>>>> could introduce a _files column containing: >>>>>>>>>>>>>>>>>>> - `_file_column_ids`: the column IDs present in each file >>>>>>>>>>>>>>>>>>> - `_file_path`: the path to the corresponding file >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Has there been any prior discussion around this idea? >>>>>>>>>>>>>>>>>>> Is anyone else interested in exploring this further? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>>>>> Peter >>