I would be happy to put together a proposal based on the inputs got here. Thanks everyone for your thoughts! I will try to incorporate all of this.
Thanks, Peter Daniel Weeks <dwe...@apache.org> ezt írta (időpont: 2025. máj. 27., K, 20:07): > I feel like we have two different issues we're talking about here that > aren't necessarily tied (though solutions may address both): 1) wide > tables, 2) adding columns > > Wide tables are definitely a problem where parquet has limitations. I'm > optimistic about the ongoing work to help improve parquet footers/stats in > this area that Fokko mentioned. There are always limitations in how this > scales as wide rows lead to small row groups and the cost to reconstitute a > row gets more expensive, but for cases that are read heavy and projecting > subsets of columns should significantly improve performance. > > Adding columns to an existing dataset is something that comes up > periodically, but there's a lot of complexity involved in this. Parquet > does support referencing columns in separate files per the spec, but > there's no implementation that takes advantage of this to my knowledge. > This does allow for approaches where you separate/rewrite just the footers > or various other tricks, but these approaches get complicated quickly and > the number of readers that can consume those representations would > initially be very limited. > > A larger problem for splitting columns across files is that there are a > lot of assumptions about how data is laid out in both readers and writers. > For example, aligning row groups and correctly handling split calculation > is very complicated if you're trying to split rows across files. Other > features are also impacted like deletes, which reference the file to which > they apply and would need to account for deletes applying to multiple files > and needing to update those references if columns are added. > > I believe there are a lot of interesting approaches to addressing these > use cases, but we'd really need a thorough proposal that explores all of > these scenarios. The last thing we would want is to introduce > incompatibilities within the format that result in incompatible features. > > -Dan > > On Tue, May 27, 2025 at 10:02 AM Russell Spitzer < > russell.spit...@gmail.com> wrote: > >> Point definitely taken. We really should probably POC some of these ideas >> and see what we are actually dealing with. (He said without volunteering to >> do the work :P) >> >> On Tue, May 27, 2025 at 11:55 AM Selcuk Aya >> <selcuk....@snowflake.com.invalid> wrote: >> >>> Yes having to rewrite the whole file is not ideal but I believe most of >>> the cost of rewriting a file comes from decompression, encoding, stats >>> calculations etc. If you are adding new values for some columns but are >>> keeping the rest of the columns the same in the file, then a bunch of >>> rewrite cost can be optimized away. I am not saying this is better than >>> writing to a separate file, I am not sure how much worse it is though. >>> >>> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> I think that "after the fact" modification is one of the requirements >>>> here, IE: Updating a single column without rewriting the whole file. >>>> If we have to write new metadata for the file aren't we in the same >>>> boat as having to rewrite the whole file? >>>> >>>> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya >>>> <selcuk....@snowflake.com.invalid> wrote: >>>> >>>>> If files represent column projections of a table rather than the whole >>>>> columns in the table, then any read that reads across these files needs to >>>>> identify what constitutes a row. Lance DB for example has vertical >>>>> partitioning across columns but also horizontal partitioning across rows >>>>> such that in each horizontal partitioning(fragment), the same number of >>>>> rows exist in each vertical partition, which I think is necessary to make >>>>> whole/partial row construction cheap. If this is the case, there is no >>>>> reason not to achieve the same data layout inside a single columnar file >>>>> with a lean header. I think the only valid argument for a separate file is >>>>> adding a new set of columns to an existing table, but even then I am not >>>>> sure a separate file is absolutely necessary for good performance. >>>>> >>>>> Selcuk >>>>> >>>>> On Tue, May 27, 2025 at 9:18 AM Devin Smith >>>>> <devinsm...@deephaven.io.invalid> wrote: >>>>> >>>>>> There's a `file_path` field in the parquet ColumnChunk structure, >>>>>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962 >>>>>> >>>>>> I'm not sure what tooling actually supports this though. Could be >>>>>> interesting to see what the history of this is. >>>>>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3, >>>>>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw >>>>>> >>>>>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer < >>>>>> russell.spit...@gmail.com> wrote: >>>>>> >>>>>>> I have to agree that while there can be some fixes in Parquet, we >>>>>>> fundamentally need a way to split a "row group" >>>>>>> or something like that between separate files. If that's >>>>>>> something we can do in the parquet project that would be great >>>>>>> but it feels like we need to start exploring more drastic options >>>>>>> than footer encoding. >>>>>>> >>>>>>> On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com> wrote: >>>>>>> >>>>>>>> I agree with Steven that there are limitations that Parquet cannot >>>>>>>> do. >>>>>>>> >>>>>>>> In addition to adding new columns by rewriting all files, files of >>>>>>>> wide tables may suffer from bad performance like below: >>>>>>>> - Poor compression of row groups because there are too many columns >>>>>>>> and even a small number of rows can reach the row group threshold. >>>>>>>> - Dominating columns (e.g. blobs) may contribute to 99% size of a >>>>>>>> row group, leading to unbalanced column chunks and deteriorate the row >>>>>>>> group compression. >>>>>>>> - Similar to adding new columns, partial update also requires >>>>>>>> rewriting all columns of the affected rows. >>>>>>>> >>>>>>>> IIRC, some table formats already support splitting columns into >>>>>>>> different files: >>>>>>>> - Lance manifest splits a fragment [1] into one or more data files. >>>>>>>> - Apache Hudi has the concept of column family [2]. >>>>>>>> - Apache Paimon supports sequence groups [3] for partial update. >>>>>>>> >>>>>>>> Although Parquet can introduce the concept of logical file and >>>>>>>> physical file to manage the columns to file mapping, this looks like >>>>>>>> yet >>>>>>>> another manifest file design which duplicates the purpose of Iceberg. >>>>>>>> These might be something worth exploring in Iceberg. >>>>>>>> >>>>>>>> [1] https://lancedb.github.io/lance/format.html#fragments >>>>>>>> [2] https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md >>>>>>>> [3] >>>>>>>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group >>>>>>>> >>>>>>>> Best, >>>>>>>> Gang >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, May 27, 2025 at 7:03 AM Steven Wu <stevenz...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> The Parquet metadata proposal (linked by Fokko) is mainly >>>>>>>>> addressing the read performance due to bloated metadata. >>>>>>>>> >>>>>>>>> What Peter described in the description seems useful for some ML >>>>>>>>> workload of feature engineering. A new set of features/columns are >>>>>>>>> added to >>>>>>>>> the table. Currently, Iceberg would require rewriting all data files >>>>>>>>> to >>>>>>>>> combine old and new columns (write amplification). Similarly, in the >>>>>>>>> past >>>>>>>>> the community also talked about the use cases of updating a single >>>>>>>>> column, >>>>>>>>> which would require rewriting all data files. >>>>>>>>> >>>>>>>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry < >>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Do you have the link at hand for the thread where this was >>>>>>>>>> discussed on the Parquet list? >>>>>>>>>> The docs seem quite old, and the PR stale, so I would like to >>>>>>>>>> understand the situation better. >>>>>>>>>> If it is possible to do this in Parquet, that would be great, but >>>>>>>>>> Avro, ORC would still suffer. >>>>>>>>>> >>>>>>>>>> Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont: 2025. >>>>>>>>>> máj. 26., H, 22:07): >>>>>>>>>> >>>>>>>>>>> Hey Peter, >>>>>>>>>>> >>>>>>>>>>> Thanks for bringing this issue up. I think I agree with Fokko; >>>>>>>>>>> the issue of wide tables leading to Parquet metadata bloat and poor >>>>>>>>>>> Thrift >>>>>>>>>>> deserialization performance is a long standing issue that I believe >>>>>>>>>>> there's >>>>>>>>>>> motivation in the community to address. So to me it seems better to >>>>>>>>>>> address >>>>>>>>>>> it in Parquet itself rather than Iceberg library facilitate a >>>>>>>>>>> pattern which >>>>>>>>>>> works around the limitations. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Amogh Jahagirdar >>>>>>>>>>> >>>>>>>>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong < >>>>>>>>>>> fo...@apache.org> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Peter, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for bringing this up. Wouldn't it make more sense to fix >>>>>>>>>>>> this in Parquet itself? It has been a long-running issue on >>>>>>>>>>>> Parquet, and >>>>>>>>>>>> there is still active interest from the community. There is a PR >>>>>>>>>>>> to replace >>>>>>>>>>>> the footer with FlatBuffers, which dramatically improves >>>>>>>>>>>> performance <https://github.com/apache/arrow/pull/43793>. The >>>>>>>>>>>> underlying proposal can be found here >>>>>>>>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa> >>>>>>>>>>>> . >>>>>>>>>>>> >>>>>>>>>>>> Kind regards, >>>>>>>>>>>> Fokko >>>>>>>>>>>> >>>>>>>>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou < >>>>>>>>>>>> yunzou.colost...@gmail.com>: >>>>>>>>>>>> >>>>>>>>>>>>> +1, I am really interested in this topic. Performance has >>>>>>>>>>>>> always been a problem when dealing with wide tables, not just >>>>>>>>>>>>> read/write, >>>>>>>>>>>>> but also during compilation. Most of the ML use cases typically >>>>>>>>>>>>> exhibit a >>>>>>>>>>>>> vectorized read/write pattern, I am also wondering if there is >>>>>>>>>>>>> any way at >>>>>>>>>>>>> the metadata level to help the whole compilation and execution >>>>>>>>>>>>> process. I >>>>>>>>>>>>> do not have any answer fo this yet, but I would be really >>>>>>>>>>>>> interested in >>>>>>>>>>>>> exploring this further. >>>>>>>>>>>>> >>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>> Yun >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang >>>>>>>>>>>>> <py...@pinterest.com.invalid> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Peter, I am interested in this proposal. What's more, I am >>>>>>>>>>>>>> curious if there is a similar story on the write side as well >>>>>>>>>>>>>> (how to >>>>>>>>>>>>>> generate these splitted files) and specifically, are you >>>>>>>>>>>>>> targeting feature >>>>>>>>>>>>>> backfill use cases in ML use? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry < >>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Team, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> In machine learning use-cases, it's common to encounter >>>>>>>>>>>>>>> tables with a very high number of columns - sometimes even in >>>>>>>>>>>>>>> the range of >>>>>>>>>>>>>>> several thousand. I've seen cases with up to 15,000 columns. >>>>>>>>>>>>>>> Storing such >>>>>>>>>>>>>>> wide tables in a single Parquet file is often suboptimal, as >>>>>>>>>>>>>>> Parquet can >>>>>>>>>>>>>>> become a bottleneck, even when only a subset of columns is >>>>>>>>>>>>>>> queried. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> A common approach to mitigate this is to split the data >>>>>>>>>>>>>>> across multiple Parquet files. With the upcoming File Format >>>>>>>>>>>>>>> API, we could >>>>>>>>>>>>>>> introduce a layer that combines these files into a single >>>>>>>>>>>>>>> iterator, >>>>>>>>>>>>>>> enabling efficient reading of wide and very wide tables. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> To support this, we would need to revise the metadata >>>>>>>>>>>>>>> specification. Instead of the current `_file` column, we could >>>>>>>>>>>>>>> introduce a >>>>>>>>>>>>>>> _files column containing: >>>>>>>>>>>>>>> - `_file_column_ids`: the column IDs present in each file >>>>>>>>>>>>>>> - `_file_path`: the path to the corresponding file >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Has there been any prior discussion around this idea? >>>>>>>>>>>>>>> Is anyone else interested in exploring this further? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>> Peter >>>>>>>>>>>>>>> >>>>>>>>>>>>>>