IMO, the main drawback for the view solution is the complexity of maintaining consistency across tables if we want to use features like time travel, incremental scan, branch & tag, encryption, etc.
On Fri, May 30, 2025 at 12:55 PM Bryan Keller <brya...@gmail.com> wrote: > Fewer commit conflicts meaning the tables representing column families are > updated independently, rather than having to serialize commits to a single > table. Perhaps with a wide table solution the commit logic could be > enhanced to support things like concurrent overwrites to independent column > families, but it seems like it would be fairly involved. > > > On May 29, 2025, at 7:16 PM, Steven Wu <stevenz...@gmail.com> wrote: > > Bryan, interesting approach to split horizontally across multiple tables. > > A few potential down sides > * operational overhead. tables need to be managed consistently and > probably in some coordinated way > * complex read > * maybe fragile to enforce correctness (during join). It is robust to > enforce the stitching correctness at file group level in file reader and > writer if built in the table format. > > > fewer commit conflicts > > Can you elaborate on this one? Are those tables populated by streaming or > batch pipelines? > > On Thu, May 29, 2025 at 5:03 PM Bryan Keller <brya...@gmail.com> wrote: > >> Hi everyone, >> >> We have been investigating a wide table format internally for a similar >> use case, i.e. we have wide ML tables with features generated by different >> pipelines and teams but want a unified view of the data. We are comparing >> that against separate tables joined together using a shuffle-less join >> (e.g. storage partition join), along with a corresponding view. >> >> The join/view approach seems to give us much of we need, with some added >> benefits like splitting up the metadata, fewer commit conflicts, and >> ability to share, nest, and swap "column families". The downsides are table >> management is split across multiple tables, it requires engine support of >> shuffle-less joins for best performance, and even then, scans probably >> won't be as optimal. >> >> I'm curious if anyone had further thoughts on the two? >> >> -Bryan >> >> >> >> On May 29, 2025, at 8:18 AM, Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >> I received feedback from Alkis regarding their Parquet optimization work. >> Their internal testing shows promising results for reducing metadata size >> and improving parsing performance. They plan to formalize a proposal for >> these Parquet enhancements in the near future. >> >> Meanwhile, I'm putting together our horizontal sharding proposal as a >> complementary approach. Even with the Parquet metadata improvements, >> horizontal sharding would provide additional benefits for: >> >> - More efficient column-level updates >> - Streamlined column additions >> - Better handling of dominant columns that can cause RowGroup size >> imbalances (placing these in separate files could significantly improve >> performance) >> >> Thanks, Peter >> >> >> >> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. máj. >> 28., Sze, 15:39): >> >>> I would be happy to put together a proposal based on the inputs got here. >>> >>> Thanks everyone for your thoughts! >>> I will try to incorporate all of this. >>> >>> Thanks, Peter >>> >>> Daniel Weeks <dwe...@apache.org> ezt írta (időpont: 2025. máj. 27., K, >>> 20:07): >>> >>>> I feel like we have two different issues we're talking about here that >>>> aren't necessarily tied (though solutions may address both): 1) wide >>>> tables, 2) adding columns >>>> >>>> Wide tables are definitely a problem where parquet has limitations. I'm >>>> optimistic about the ongoing work to help improve parquet footers/stats in >>>> this area that Fokko mentioned. There are always limitations in how this >>>> scales as wide rows lead to small row groups and the cost to reconstitute a >>>> row gets more expensive, but for cases that are read heavy and projecting >>>> subsets of columns should significantly improve performance. >>>> >>>> Adding columns to an existing dataset is something that comes up >>>> periodically, but there's a lot of complexity involved in this. Parquet >>>> does support referencing columns in separate files per the spec, but >>>> there's no implementation that takes advantage of this to my knowledge. >>>> This does allow for approaches where you separate/rewrite just the footers >>>> or various other tricks, but these approaches get complicated quickly and >>>> the number of readers that can consume those representations would >>>> initially be very limited. >>>> >>>> A larger problem for splitting columns across files is that there are a >>>> lot of assumptions about how data is laid out in both readers and writers. >>>> For example, aligning row groups and correctly handling split calculation >>>> is very complicated if you're trying to split rows across files. Other >>>> features are also impacted like deletes, which reference the file to which >>>> they apply and would need to account for deletes applying to multiple files >>>> and needing to update those references if columns are added. >>>> >>>> I believe there are a lot of interesting approaches to addressing these >>>> use cases, but we'd really need a thorough proposal that explores all of >>>> these scenarios. The last thing we would want is to introduce >>>> incompatibilities within the format that result in incompatible features. >>>> >>>> -Dan >>>> >>>> On Tue, May 27, 2025 at 10:02 AM Russell Spitzer < >>>> russell.spit...@gmail.com> wrote: >>>> >>>>> Point definitely taken. We really should probably POC some of >>>>> these ideas and see what we are actually dealing with. (He said without >>>>> volunteering to do the work :P) >>>>> >>>>> On Tue, May 27, 2025 at 11:55 AM Selcuk Aya >>>>> <selcuk....@snowflake.com.invalid> wrote: >>>>> >>>>>> Yes having to rewrite the whole file is not ideal but I believe most >>>>>> of the cost of rewriting a file comes from decompression, encoding, stats >>>>>> calculations etc. If you are adding new values for some columns but are >>>>>> keeping the rest of the columns the same in the file, then a bunch of >>>>>> rewrite cost can be optimized away. I am not saying this is better than >>>>>> writing to a separate file, I am not sure how much worse it is though. >>>>>> >>>>>> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer < >>>>>> russell.spit...@gmail.com> wrote: >>>>>> >>>>>>> I think that "after the fact" modification is one of the >>>>>>> requirements here, IE: Updating a single column without rewriting the >>>>>>> whole >>>>>>> file. >>>>>>> If we have to write new metadata for the file aren't we in the same >>>>>>> boat as having to rewrite the whole file? >>>>>>> >>>>>>> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya >>>>>>> <selcuk....@snowflake.com.invalid> wrote: >>>>>>> >>>>>>>> If files represent column projections of a table rather than the >>>>>>>> whole columns in the table, then any read that reads across these files >>>>>>>> needs to identify what constitutes a row. Lance DB for example has >>>>>>>> vertical >>>>>>>> partitioning across columns but also horizontal partitioning across >>>>>>>> rows >>>>>>>> such that in each horizontal partitioning(fragment), the same number of >>>>>>>> rows exist in each vertical partition, which I think is necessary to >>>>>>>> make >>>>>>>> whole/partial row construction cheap. If this is the case, there is no >>>>>>>> reason not to achieve the same data layout inside a single columnar >>>>>>>> file >>>>>>>> with a lean header. I think the only valid argument for a separate >>>>>>>> file is >>>>>>>> adding a new set of columns to an existing table, but even then I am >>>>>>>> not >>>>>>>> sure a separate file is absolutely necessary for good performance. >>>>>>>> >>>>>>>> Selcuk >>>>>>>> >>>>>>>> On Tue, May 27, 2025 at 9:18 AM Devin Smith >>>>>>>> <devinsm...@deephaven.io.invalid> wrote: >>>>>>>> >>>>>>>>> There's a `file_path` field in the parquet ColumnChunk structure, >>>>>>>>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962 >>>>>>>>> >>>>>>>>> I'm not sure what tooling actually supports this though. Could be >>>>>>>>> interesting to see what the history of this is. >>>>>>>>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3, >>>>>>>>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw >>>>>>>>> >>>>>>>>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer < >>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> I have to agree that while there can be some fixes in Parquet, we >>>>>>>>>> fundamentally need a way to split a "row group" >>>>>>>>>> or something like that between separate files. If that's >>>>>>>>>> something we can do in the parquet project that would be great >>>>>>>>>> but it feels like we need to start exploring more drastic options >>>>>>>>>> than footer encoding. >>>>>>>>>> >>>>>>>>>> On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> I agree with Steven that there are limitations that Parquet >>>>>>>>>>> cannot do. >>>>>>>>>>> >>>>>>>>>>> In addition to adding new columns by rewriting all files, files >>>>>>>>>>> of wide tables may suffer from bad performance like below: >>>>>>>>>>> - Poor compression of row groups because there are too many >>>>>>>>>>> columns and even a small number of rows can reach the row group >>>>>>>>>>> threshold. >>>>>>>>>>> - Dominating columns (e.g. blobs) may contribute to 99% size of >>>>>>>>>>> a row group, leading to unbalanced column chunks and deteriorate >>>>>>>>>>> the row >>>>>>>>>>> group compression. >>>>>>>>>>> - Similar to adding new columns, partial update also requires >>>>>>>>>>> rewriting all columns of the affected rows. >>>>>>>>>>> >>>>>>>>>>> IIRC, some table formats already support splitting columns into >>>>>>>>>>> different files: >>>>>>>>>>> - Lance manifest splits a fragment [1] into one or more data >>>>>>>>>>> files. >>>>>>>>>>> - Apache Hudi has the concept of column family [2]. >>>>>>>>>>> - Apache Paimon supports sequence groups [3] for partial update. >>>>>>>>>>> >>>>>>>>>>> Although Parquet can introduce the concept of logical file and >>>>>>>>>>> physical file to manage the columns to file mapping, this looks >>>>>>>>>>> like yet >>>>>>>>>>> another manifest file design which duplicates the purpose of >>>>>>>>>>> Iceberg. >>>>>>>>>>> These might be something worth exploring in Iceberg. >>>>>>>>>>> >>>>>>>>>>> [1] https://lancedb.github.io/lance/format.html#fragments >>>>>>>>>>> [2] >>>>>>>>>>> https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md >>>>>>>>>>> [3] >>>>>>>>>>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Gang >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, May 27, 2025 at 7:03 AM Steven Wu <stevenz...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> The Parquet metadata proposal (linked by Fokko) is mainly >>>>>>>>>>>> addressing the read performance due to bloated metadata. >>>>>>>>>>>> >>>>>>>>>>>> What Peter described in the description seems useful for some >>>>>>>>>>>> ML workload of feature engineering. A new set of features/columns >>>>>>>>>>>> are added >>>>>>>>>>>> to the table. Currently, Iceberg would require rewriting all data >>>>>>>>>>>> files to >>>>>>>>>>>> combine old and new columns (write amplification). Similarly, in >>>>>>>>>>>> the past >>>>>>>>>>>> the community also talked about the use cases of updating a single >>>>>>>>>>>> column, >>>>>>>>>>>> which would require rewriting all data files. >>>>>>>>>>>> >>>>>>>>>>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry < >>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Do you have the link at hand for the thread where this was >>>>>>>>>>>>> discussed on the Parquet list? >>>>>>>>>>>>> The docs seem quite old, and the PR stale, so I would like to >>>>>>>>>>>>> understand the situation better. >>>>>>>>>>>>> If it is possible to do this in Parquet, that would be great, >>>>>>>>>>>>> but Avro, ORC would still suffer. >>>>>>>>>>>>> >>>>>>>>>>>>> Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont: 2025. >>>>>>>>>>>>> máj. 26., H, 22:07): >>>>>>>>>>>>> >>>>>>>>>>>>>> Hey Peter, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for bringing this issue up. I think I agree with >>>>>>>>>>>>>> Fokko; the issue of wide tables leading to Parquet metadata >>>>>>>>>>>>>> bloat and poor >>>>>>>>>>>>>> Thrift deserialization performance is a long standing issue that >>>>>>>>>>>>>> I believe >>>>>>>>>>>>>> there's motivation in the community to address. So to me it >>>>>>>>>>>>>> seems better to >>>>>>>>>>>>>> address it in Parquet itself rather than Iceberg library >>>>>>>>>>>>>> facilitate a >>>>>>>>>>>>>> pattern which works around the limitations. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Amogh Jahagirdar >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong < >>>>>>>>>>>>>> fo...@apache.org> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Peter, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks for bringing this up. Wouldn't it make more sense to >>>>>>>>>>>>>>> fix this in Parquet itself? It has been a long-running issue on >>>>>>>>>>>>>>> Parquet, >>>>>>>>>>>>>>> and there is still active interest from the community. There is >>>>>>>>>>>>>>> a PR to >>>>>>>>>>>>>>> replace the footer with FlatBuffers, which dramatically >>>>>>>>>>>>>>> improves performance >>>>>>>>>>>>>>> <https://github.com/apache/arrow/pull/43793>. The >>>>>>>>>>>>>>> underlying proposal can be found here >>>>>>>>>>>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa> >>>>>>>>>>>>>>> . >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Kind regards, >>>>>>>>>>>>>>> Fokko >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou < >>>>>>>>>>>>>>> yunzou.colost...@gmail.com>: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> +1, I am really interested in this topic. Performance has >>>>>>>>>>>>>>>> always been a problem when dealing with wide tables, not just >>>>>>>>>>>>>>>> read/write, >>>>>>>>>>>>>>>> but also during compilation. Most of the ML use cases >>>>>>>>>>>>>>>> typically exhibit a >>>>>>>>>>>>>>>> vectorized read/write pattern, I am also wondering if there is >>>>>>>>>>>>>>>> any way at >>>>>>>>>>>>>>>> the metadata level to help the whole compilation and execution >>>>>>>>>>>>>>>> process. I >>>>>>>>>>>>>>>> do not have any answer fo this yet, but I would be really >>>>>>>>>>>>>>>> interested in >>>>>>>>>>>>>>>> exploring this further. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>>>>> Yun >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang >>>>>>>>>>>>>>>> <py...@pinterest.com.invalid> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Peter, I am interested in this proposal. What's more, I >>>>>>>>>>>>>>>>> am curious if there is a similar story on the write side as >>>>>>>>>>>>>>>>> well (how to >>>>>>>>>>>>>>>>> generate these splitted files) and specifically, are you >>>>>>>>>>>>>>>>> targeting feature >>>>>>>>>>>>>>>>> backfill use cases in ML use? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry < >>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Team, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> In machine learning use-cases, it's common to encounter >>>>>>>>>>>>>>>>>> tables with a very high number of columns - sometimes even >>>>>>>>>>>>>>>>>> in the range of >>>>>>>>>>>>>>>>>> several thousand. I've seen cases with up to 15,000 columns. >>>>>>>>>>>>>>>>>> Storing such >>>>>>>>>>>>>>>>>> wide tables in a single Parquet file is often suboptimal, as >>>>>>>>>>>>>>>>>> Parquet can >>>>>>>>>>>>>>>>>> become a bottleneck, even when only a subset of columns is >>>>>>>>>>>>>>>>>> queried. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> A common approach to mitigate this is to split the data >>>>>>>>>>>>>>>>>> across multiple Parquet files. With the upcoming File Format >>>>>>>>>>>>>>>>>> API, we could >>>>>>>>>>>>>>>>>> introduce a layer that combines these files into a single >>>>>>>>>>>>>>>>>> iterator, >>>>>>>>>>>>>>>>>> enabling efficient reading of wide and very wide tables. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> To support this, we would need to revise the metadata >>>>>>>>>>>>>>>>>> specification. Instead of the current `_file` column, we >>>>>>>>>>>>>>>>>> could introduce a >>>>>>>>>>>>>>>>>> _files column containing: >>>>>>>>>>>>>>>>>> - `_file_column_ids`: the column IDs present in each file >>>>>>>>>>>>>>>>>> - `_file_path`: the path to the corresponding file >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Has there been any prior discussion around this idea? >>>>>>>>>>>>>>>>>> Is anyone else interested in exploring this further? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>>>> Peter >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >> >