Bryan, interesting approach to split horizontally across multiple tables. A few potential down sides * operational overhead. tables need to be managed consistently and probably in some coordinated way * complex read * maybe fragile to enforce correctness (during join). It is robust to enforce the stitching correctness at file group level in file reader and writer if built in the table format.
> fewer commit conflicts Can you elaborate on this one? Are those tables populated by streaming or batch pipelines? On Thu, May 29, 2025 at 5:03 PM Bryan Keller <brya...@gmail.com> wrote: > Hi everyone, > > We have been investigating a wide table format internally for a similar > use case, i.e. we have wide ML tables with features generated by different > pipelines and teams but want a unified view of the data. We are comparing > that against separate tables joined together using a shuffle-less join > (e.g. storage partition join), along with a corresponding view. > > The join/view approach seems to give us much of we need, with some added > benefits like splitting up the metadata, fewer commit conflicts, and > ability to share, nest, and swap "column families". The downsides are table > management is split across multiple tables, it requires engine support of > shuffle-less joins for best performance, and even then, scans probably > won't be as optimal. > > I'm curious if anyone had further thoughts on the two? > > -Bryan > > > > On May 29, 2025, at 8:18 AM, Péter Váry <peter.vary.apa...@gmail.com> > wrote: > > I received feedback from Alkis regarding their Parquet optimization work. > Their internal testing shows promising results for reducing metadata size > and improving parsing performance. They plan to formalize a proposal for > these Parquet enhancements in the near future. > > Meanwhile, I'm putting together our horizontal sharding proposal as a > complementary approach. Even with the Parquet metadata improvements, > horizontal sharding would provide additional benefits for: > > - More efficient column-level updates > - Streamlined column additions > - Better handling of dominant columns that can cause RowGroup size > imbalances (placing these in separate files could significantly improve > performance) > > Thanks, Peter > > > > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. máj. > 28., Sze, 15:39): > >> I would be happy to put together a proposal based on the inputs got here. >> >> Thanks everyone for your thoughts! >> I will try to incorporate all of this. >> >> Thanks, Peter >> >> Daniel Weeks <dwe...@apache.org> ezt írta (időpont: 2025. máj. 27., K, >> 20:07): >> >>> I feel like we have two different issues we're talking about here that >>> aren't necessarily tied (though solutions may address both): 1) wide >>> tables, 2) adding columns >>> >>> Wide tables are definitely a problem where parquet has limitations. I'm >>> optimistic about the ongoing work to help improve parquet footers/stats in >>> this area that Fokko mentioned. There are always limitations in how this >>> scales as wide rows lead to small row groups and the cost to reconstitute a >>> row gets more expensive, but for cases that are read heavy and projecting >>> subsets of columns should significantly improve performance. >>> >>> Adding columns to an existing dataset is something that comes up >>> periodically, but there's a lot of complexity involved in this. Parquet >>> does support referencing columns in separate files per the spec, but >>> there's no implementation that takes advantage of this to my knowledge. >>> This does allow for approaches where you separate/rewrite just the footers >>> or various other tricks, but these approaches get complicated quickly and >>> the number of readers that can consume those representations would >>> initially be very limited. >>> >>> A larger problem for splitting columns across files is that there are a >>> lot of assumptions about how data is laid out in both readers and writers. >>> For example, aligning row groups and correctly handling split calculation >>> is very complicated if you're trying to split rows across files. Other >>> features are also impacted like deletes, which reference the file to which >>> they apply and would need to account for deletes applying to multiple files >>> and needing to update those references if columns are added. >>> >>> I believe there are a lot of interesting approaches to addressing these >>> use cases, but we'd really need a thorough proposal that explores all of >>> these scenarios. The last thing we would want is to introduce >>> incompatibilities within the format that result in incompatible features. >>> >>> -Dan >>> >>> On Tue, May 27, 2025 at 10:02 AM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> Point definitely taken. We really should probably POC some of >>>> these ideas and see what we are actually dealing with. (He said without >>>> volunteering to do the work :P) >>>> >>>> On Tue, May 27, 2025 at 11:55 AM Selcuk Aya >>>> <selcuk....@snowflake.com.invalid> wrote: >>>> >>>>> Yes having to rewrite the whole file is not ideal but I believe most >>>>> of the cost of rewriting a file comes from decompression, encoding, stats >>>>> calculations etc. If you are adding new values for some columns but are >>>>> keeping the rest of the columns the same in the file, then a bunch of >>>>> rewrite cost can be optimized away. I am not saying this is better than >>>>> writing to a separate file, I am not sure how much worse it is though. >>>>> >>>>> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer < >>>>> russell.spit...@gmail.com> wrote: >>>>> >>>>>> I think that "after the fact" modification is one of the requirements >>>>>> here, IE: Updating a single column without rewriting the whole file. >>>>>> If we have to write new metadata for the file aren't we in the same >>>>>> boat as having to rewrite the whole file? >>>>>> >>>>>> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya >>>>>> <selcuk....@snowflake.com.invalid> wrote: >>>>>> >>>>>>> If files represent column projections of a table rather than the >>>>>>> whole columns in the table, then any read that reads across these files >>>>>>> needs to identify what constitutes a row. Lance DB for example has >>>>>>> vertical >>>>>>> partitioning across columns but also horizontal partitioning across rows >>>>>>> such that in each horizontal partitioning(fragment), the same number of >>>>>>> rows exist in each vertical partition, which I think is necessary to >>>>>>> make >>>>>>> whole/partial row construction cheap. If this is the case, there is no >>>>>>> reason not to achieve the same data layout inside a single columnar file >>>>>>> with a lean header. I think the only valid argument for a separate file >>>>>>> is >>>>>>> adding a new set of columns to an existing table, but even then I am not >>>>>>> sure a separate file is absolutely necessary for good performance. >>>>>>> >>>>>>> Selcuk >>>>>>> >>>>>>> On Tue, May 27, 2025 at 9:18 AM Devin Smith >>>>>>> <devinsm...@deephaven.io.invalid> wrote: >>>>>>> >>>>>>>> There's a `file_path` field in the parquet ColumnChunk structure, >>>>>>>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962 >>>>>>>> >>>>>>>> I'm not sure what tooling actually supports this though. Could be >>>>>>>> interesting to see what the history of this is. >>>>>>>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3, >>>>>>>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw >>>>>>>> >>>>>>>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer < >>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>> >>>>>>>>> I have to agree that while there can be some fixes in Parquet, we >>>>>>>>> fundamentally need a way to split a "row group" >>>>>>>>> or something like that between separate files. If that's >>>>>>>>> something we can do in the parquet project that would be great >>>>>>>>> but it feels like we need to start exploring more drastic options >>>>>>>>> than footer encoding. >>>>>>>>> >>>>>>>>> On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> I agree with Steven that there are limitations that Parquet >>>>>>>>>> cannot do. >>>>>>>>>> >>>>>>>>>> In addition to adding new columns by rewriting all files, files >>>>>>>>>> of wide tables may suffer from bad performance like below: >>>>>>>>>> - Poor compression of row groups because there are too many >>>>>>>>>> columns and even a small number of rows can reach the row group >>>>>>>>>> threshold. >>>>>>>>>> - Dominating columns (e.g. blobs) may contribute to 99% size of a >>>>>>>>>> row group, leading to unbalanced column chunks and deteriorate the >>>>>>>>>> row >>>>>>>>>> group compression. >>>>>>>>>> - Similar to adding new columns, partial update also requires >>>>>>>>>> rewriting all columns of the affected rows. >>>>>>>>>> >>>>>>>>>> IIRC, some table formats already support splitting columns into >>>>>>>>>> different files: >>>>>>>>>> - Lance manifest splits a fragment [1] into one or more data >>>>>>>>>> files. >>>>>>>>>> - Apache Hudi has the concept of column family [2]. >>>>>>>>>> - Apache Paimon supports sequence groups [3] for partial update. >>>>>>>>>> >>>>>>>>>> Although Parquet can introduce the concept of logical file and >>>>>>>>>> physical file to manage the columns to file mapping, this looks like >>>>>>>>>> yet >>>>>>>>>> another manifest file design which duplicates the purpose of Iceberg. >>>>>>>>>> These might be something worth exploring in Iceberg. >>>>>>>>>> >>>>>>>>>> [1] https://lancedb.github.io/lance/format.html#fragments >>>>>>>>>> [2] >>>>>>>>>> https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md >>>>>>>>>> [3] >>>>>>>>>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Gang >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, May 27, 2025 at 7:03 AM Steven Wu <stevenz...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> The Parquet metadata proposal (linked by Fokko) is mainly >>>>>>>>>>> addressing the read performance due to bloated metadata. >>>>>>>>>>> >>>>>>>>>>> What Peter described in the description seems useful for some ML >>>>>>>>>>> workload of feature engineering. A new set of features/columns are >>>>>>>>>>> added to >>>>>>>>>>> the table. Currently, Iceberg would require rewriting all data >>>>>>>>>>> files to >>>>>>>>>>> combine old and new columns (write amplification). Similarly, in >>>>>>>>>>> the past >>>>>>>>>>> the community also talked about the use cases of updating a single >>>>>>>>>>> column, >>>>>>>>>>> which would require rewriting all data files. >>>>>>>>>>> >>>>>>>>>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry < >>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Do you have the link at hand for the thread where this was >>>>>>>>>>>> discussed on the Parquet list? >>>>>>>>>>>> The docs seem quite old, and the PR stale, so I would like to >>>>>>>>>>>> understand the situation better. >>>>>>>>>>>> If it is possible to do this in Parquet, that would be great, >>>>>>>>>>>> but Avro, ORC would still suffer. >>>>>>>>>>>> >>>>>>>>>>>> Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont: 2025. >>>>>>>>>>>> máj. 26., H, 22:07): >>>>>>>>>>>> >>>>>>>>>>>>> Hey Peter, >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for bringing this issue up. I think I agree with Fokko; >>>>>>>>>>>>> the issue of wide tables leading to Parquet metadata bloat and >>>>>>>>>>>>> poor Thrift >>>>>>>>>>>>> deserialization performance is a long standing issue that I >>>>>>>>>>>>> believe there's >>>>>>>>>>>>> motivation in the community to address. So to me it seems better >>>>>>>>>>>>> to address >>>>>>>>>>>>> it in Parquet itself rather than Iceberg library facilitate a >>>>>>>>>>>>> pattern which >>>>>>>>>>>>> works around the limitations. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Amogh Jahagirdar >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong < >>>>>>>>>>>>> fo...@apache.org> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Peter, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for bringing this up. Wouldn't it make more sense to >>>>>>>>>>>>>> fix this in Parquet itself? It has been a long-running issue on >>>>>>>>>>>>>> Parquet, >>>>>>>>>>>>>> and there is still active interest from the community. There is >>>>>>>>>>>>>> a PR to >>>>>>>>>>>>>> replace the footer with FlatBuffers, which dramatically >>>>>>>>>>>>>> improves performance >>>>>>>>>>>>>> <https://github.com/apache/arrow/pull/43793>. The underlying >>>>>>>>>>>>>> proposal can be found here >>>>>>>>>>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa> >>>>>>>>>>>>>> . >>>>>>>>>>>>>> >>>>>>>>>>>>>> Kind regards, >>>>>>>>>>>>>> Fokko >>>>>>>>>>>>>> >>>>>>>>>>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou < >>>>>>>>>>>>>> yunzou.colost...@gmail.com>: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> +1, I am really interested in this topic. Performance has >>>>>>>>>>>>>>> always been a problem when dealing with wide tables, not just >>>>>>>>>>>>>>> read/write, >>>>>>>>>>>>>>> but also during compilation. Most of the ML use cases typically >>>>>>>>>>>>>>> exhibit a >>>>>>>>>>>>>>> vectorized read/write pattern, I am also wondering if there is >>>>>>>>>>>>>>> any way at >>>>>>>>>>>>>>> the metadata level to help the whole compilation and execution >>>>>>>>>>>>>>> process. I >>>>>>>>>>>>>>> do not have any answer fo this yet, but I would be really >>>>>>>>>>>>>>> interested in >>>>>>>>>>>>>>> exploring this further. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>>>> Yun >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang >>>>>>>>>>>>>>> <py...@pinterest.com.invalid> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Peter, I am interested in this proposal. What's more, I >>>>>>>>>>>>>>>> am curious if there is a similar story on the write side as >>>>>>>>>>>>>>>> well (how to >>>>>>>>>>>>>>>> generate these splitted files) and specifically, are you >>>>>>>>>>>>>>>> targeting feature >>>>>>>>>>>>>>>> backfill use cases in ML use? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry < >>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Team, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> In machine learning use-cases, it's common to encounter >>>>>>>>>>>>>>>>> tables with a very high number of columns - sometimes even in >>>>>>>>>>>>>>>>> the range of >>>>>>>>>>>>>>>>> several thousand. I've seen cases with up to 15,000 columns. >>>>>>>>>>>>>>>>> Storing such >>>>>>>>>>>>>>>>> wide tables in a single Parquet file is often suboptimal, as >>>>>>>>>>>>>>>>> Parquet can >>>>>>>>>>>>>>>>> become a bottleneck, even when only a subset of columns is >>>>>>>>>>>>>>>>> queried. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> A common approach to mitigate this is to split the data >>>>>>>>>>>>>>>>> across multiple Parquet files. With the upcoming File Format >>>>>>>>>>>>>>>>> API, we could >>>>>>>>>>>>>>>>> introduce a layer that combines these files into a single >>>>>>>>>>>>>>>>> iterator, >>>>>>>>>>>>>>>>> enabling efficient reading of wide and very wide tables. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> To support this, we would need to revise the metadata >>>>>>>>>>>>>>>>> specification. Instead of the current `_file` column, we >>>>>>>>>>>>>>>>> could introduce a >>>>>>>>>>>>>>>>> _files column containing: >>>>>>>>>>>>>>>>> - `_file_column_ids`: the column IDs present in each file >>>>>>>>>>>>>>>>> - `_file_path`: the path to the corresponding file >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Has there been any prior discussion around this idea? >>>>>>>>>>>>>>>>> Is anyone else interested in exploring this further? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>>> Peter >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >