For the record, link from a user requesting this feature: https://github.com/apache/iceberg/issues/11634
On Mon, Jun 2, 2025, 12:34 Péter Váry <peter.vary.apa...@gmail.com> wrote: > Hi Bart, > > Thanks for your answer! > I’ve pulled out some text from your thorough and well-organized response > to make it easier to highlight my comments. > > > It would be well possible to tune parquet writers to write very large > row groups when a large string column dominates. [..] > > What would you do, if there are more "optimal" sizes, let's say a string > column where dictionary encoding could be optimal, and maybe some other > differently sized columns? > > > You know that a row group is very large, so you might then shard it by > row ranges. Each parallel reader would have to filter out the rows that > weren't assigned to it. With Parquet page skipping, each reader could avoid > reading the large-string column pages for rows that weren't assigned to > it. > > I might be wrong, but page skipping relies on page headers which are > stored in-line with the data itself. When downloading data from blob stores > this could be less than ideal. This makes the idea of storing row-group > boundaries in the Iceberg metadata feel more appealing to me. Of course, we > need to perform row-index-range-based skipping for some files, but > page-level skipping could also help optimize it - if we decide it's > necessary. > > > If you use column-specific files, then you actually need to read the > parquet footers of *all the separate column files*. That's 2x the number > of I/Os. > > Agree, this's a valid point, until the footer fits into a single read - > which is true when the configuration is correct. > > > There's a third option, which is to use column-specific files (or > groups of columns in a file) that form a single Parquet structure with > cross-file references (which is already in the Parquet standard, albeit not > implemented anywhere). > > We have talked about this internally, but we saw several disadvantages: > - It is not implemented anywhere - which means if we start using it > everyone needs a new reader > - If I understand correctly the cross-file references are for column > chunks - we want to avoid too much fragmentation > - It becomes hard to ensure that the file is really immutable > - We still have to optimize the page alignment for reads. > > > I agree that it's an interesting idea, but it does add a lot of > complexity, and I'm not convinced that it's better from a performance > standpoint (metadata size increase, more I/Os). If we can get away with a > better row group sizing policy, wouldn't that be preferable? > > That's a great question regarding the complexity. I'm still working > through all the implications myself, but I believe we can encapsulate this > behind the Iceberg File Format API. That way, it becomes available across > all file formats and shields the rest of the codebase from the underlying > complexity. > > Your point about performance is valid, especially in the context of full > table scans. However, with these very wide tables, full scans are quite > rare. If the column families are well-designed, we can actually improve > performance across many columns/queries - not just a select few. > > Additionally, this approach enables frequently requested features like > adding or updating column families without rewriting the entire table. > > Thanks, > Peter > > Bart Samwel <b...@databricks.com.invalid> ezt írta (időpont: 2025. jún. > 2., H, 10:21): > >> On Fri, May 30, 2025 at 8:35 PM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >>> Consider this example >>> Imagine a table with one large string column and many small numeric >>> columns. >>> >>> Scenario 1: Single File >>> >>> - All columns are written into a single file. >>> - The RowGroup size is small due to the large string column >>> dominating the layout. >>> >>> This is an assumption that may not be necessary. It would be well >> possible to tune parquet writers to write very large row groups when a >> large string column dominates. Such a string column would probably not get >> dictionary encoded anyway, so it would effectively end up with a couple of >> values per 1MB Parquet page. The other columns would get decent-sized >> pages, and the overall row group size would be appropriate for getting good >> compression on those smaller columns. >> >> What would be the downside of this approach? >> >> - When you're only reading the integer columns it is exactly the same >> as when the columns would have been in a file by themselves. You just >> don't >> read the large column chunk. >> - I think it adds some complexity to the distributed/parallel reading >> of the row groups when the large string column is included in the selected >> set of columns. You know that a row group is very large, so you might then >> shard it by row ranges. Each parallel reader would have to filter out the >> rows that weren't assigned to it. With Parquet page skipping, each reader >> could avoid reading the large-string column pages for rows that weren't >> assigned to it. >> >> Ultimately I think the parallel reading problem here is *nearly* the >> same regardless of whether you use one XL row group or separate files. You >> need to know the exact row group / page boundaries within each file >> in order to decide how to shard the read. And then you need to do >> row-index-range based skipping on at least *some* of the input columns. >> >> - With XL row groups, in order to shard the row group into evenly >> sized chunks, you need to actually read the parquet footer first, because >> you need to know the row group boundaries within each file, and ideally >> even the page boundaries within each row group so that you can align your >> row ranges with those boundaries. >> - If you use column-specific files, then you actually need to read >> the parquet footers of *all the separate column files*. That's 2x the >> number of I/Os. These I/Os can be done in parallel, but they will >> contribute to throttling on cloud object stores. >> >> So XL row groups distributed read planning can be done in one I/O, while >> column-specific files require more I/Os. Either that, or you need to store >> *even >> more* information in the metadata (namely all of these boundaries). The >> column-specific files also require more I/Os to read later (because you end >> up having to read two footers), which adds up especially if you read the >> large string column which means you parallelize the read into many small >> chunks. >> >>> >>> - The numeric columns are not compacted efficiently. >>> >>> Scenario 2: Column-Specific Files >>> >>> - One file is written for the string column, and another for the >>> numeric columns. >>> - The RowGroup size for the string column remains small, but the >>> numeric columns benefit from optimal RowGroup sizing. >>> >>> There's a third option, which is to use column-specific files (or groups >> of columns in a file) that form a single Parquet structure with cross-file >> references (which is already in the Parquet standard, albeit not >> implemented anywhere). This approach has several advantages over the other >> options: >> >> 1. All of the metadata required for distributed reads is in one place >> (one parquet footer), making distributed read planning require fewer I/Os, >> and reducing the pressure to move all of that information to the >> table-level metadata as well. >> 2. Flexible structure. Different files can have different >> distribution of columns over files, and you don't have to remember the >> per-file distribution in the metadata. >> 3. More scalable: you can have a file per column if you want, if your >> column sizes are wildly variable, without bloating the table-level >> metadata >> with information about more files. >> 4. You can add/replace an entire column just by writing one extra >> file (with the new column contents, plus a new footer for the entire file >> that simply points to the old files for the existing data that wasn't >> modified). >> 5. Relatively simple to implement in existing Parquet readers >> compared to "read multiple parquets and zip them together". >> >> >> >>> Query Performance Impact: >>> >>> - If a query only reads one of the numeric columns: >>> - Scenario 1: Requires reading many small column chunks. >>> - Scenario 2: Reads a single, continuous column chunk - much more >>> efficient. >>> >>> Queries only reading columns which are stored in a single file will have >>> improvements. Cross file queries will have over-reading which might, or >>> might not be balanced out by reading bigger continuous chunks. Full table >>> scans will definitely have a performance penalty, but that is not the goal >>> here. >>> >> >> >>> > And aren't Parquet pages already providing these unaligned sizes? >>> >>> Parquet pages do offer some flexibility in size, but they operate at a >>> lower level and are still bound by the RowGroup structure. What I’m >>> proposing is a higher-level abstraction that allows us to group columns >>> into independently optimized Physical Files, each with its own RowGroup >>> sizing strategy. This could allow us to better optimize for queries where >>> only a small number of columns are projected from a wide table. >>> >> >> I agree that it's an interesting idea, but it does add a lot of >> complexity, and I'm not convinced that it's better from a performance >> standpoint (metadata size increase, more I/Os). If we can get away with a >> better row group sizing policy, wouldn't that be preferable? >> >> >> >>> Bart Samwel <b...@databricks.com.invalid> ezt írta (időpont: 2025. máj. >>> 30., P, 16:03): >>> >>>> >>>> >>>> On Fri, May 30, 2025 at 3:33 PM Péter Váry <peter.vary.apa...@gmail.com> >>>> wrote: >>>> >>>>> One key advantage of introducing Physical Files is the flexibility to >>>>> vary RowGroup sizes across columns. For instance, wide string columns >>>>> could >>>>> benefit from smaller RowGroups to reduce memory pressure, while numeric >>>>> columns could use larger RowGroups to improve compression and scan >>>>> efficiency. Rather than enforcing strict row group alignment across all >>>>> columns, we can explore optimizing read split sizes and write-time >>>>> RowGroup >>>>> sizes independently - striking a balance that maximizes performance and >>>>> storage costs for different data types and queries. >>>>> >>>> >>>> That actually sounds very complicated if you want to split file reads >>>> in a distributed system. If you want to read across column groups, then you >>>> always end up over-reading on one of them if they are not aligned. >>>> >>>> And aren't Parquet pages already providing these unaligned sizes? >>>> >>>> Gang Wu <ust...@gmail.com> ezt írta (időpont: 2025. máj. 30., P, 8:09): >>>>> >>>>>> IMO, the main drawback for the view solution is the complexity of >>>>>> maintaining consistency across tables if we want to use features like >>>>>> time >>>>>> travel, incremental scan, branch & tag, encryption, etc. >>>>>> >>>>>> On Fri, May 30, 2025 at 12:55 PM Bryan Keller <brya...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Fewer commit conflicts meaning the tables representing column >>>>>>> families are updated independently, rather than having to serialize >>>>>>> commits >>>>>>> to a single table. Perhaps with a wide table solution the commit logic >>>>>>> could be enhanced to support things like concurrent overwrites to >>>>>>> independent column families, but it seems like it would be fairly >>>>>>> involved. >>>>>>> >>>>>>> >>>>>>> On May 29, 2025, at 7:16 PM, Steven Wu <stevenz...@gmail.com> wrote: >>>>>>> >>>>>>> Bryan, interesting approach to split horizontally across multiple >>>>>>> tables. >>>>>>> >>>>>>> A few potential down sides >>>>>>> * operational overhead. tables need to be managed consistently and >>>>>>> probably in some coordinated way >>>>>>> * complex read >>>>>>> * maybe fragile to enforce correctness (during join). It is robust >>>>>>> to enforce the stitching correctness at file group level in file reader >>>>>>> and >>>>>>> writer if built in the table format. >>>>>>> >>>>>>> > fewer commit conflicts >>>>>>> >>>>>>> Can you elaborate on this one? Are those tables populated by >>>>>>> streaming or batch pipelines? >>>>>>> >>>>>>> On Thu, May 29, 2025 at 5:03 PM Bryan Keller <brya...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi everyone, >>>>>>>> >>>>>>>> We have been investigating a wide table format internally for a >>>>>>>> similar use case, i.e. we have wide ML tables with features generated >>>>>>>> by >>>>>>>> different pipelines and teams but want a unified view of the data. We >>>>>>>> are >>>>>>>> comparing that against separate tables joined together using a >>>>>>>> shuffle-less >>>>>>>> join (e.g. storage partition join), along with a corresponding view. >>>>>>>> >>>>>>>> The join/view approach seems to give us much of we need, with some >>>>>>>> added benefits like splitting up the metadata, fewer commit conflicts, >>>>>>>> and >>>>>>>> ability to share, nest, and swap "column families". The downsides are >>>>>>>> table >>>>>>>> management is split across multiple tables, it requires engine support >>>>>>>> of >>>>>>>> shuffle-less joins for best performance, and even then, scans probably >>>>>>>> won't be as optimal. >>>>>>>> >>>>>>>> I'm curious if anyone had further thoughts on the two? >>>>>>>> >>>>>>>> -Bryan >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On May 29, 2025, at 8:18 AM, Péter Váry < >>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>> >>>>>>>> I received feedback from Alkis regarding their Parquet optimization >>>>>>>> work. Their internal testing shows promising results for reducing >>>>>>>> metadata >>>>>>>> size and improving parsing performance. They plan to formalize a >>>>>>>> proposal >>>>>>>> for these Parquet enhancements in the near future. >>>>>>>> >>>>>>>> Meanwhile, I'm putting together our horizontal sharding proposal as >>>>>>>> a complementary approach. Even with the Parquet metadata improvements, >>>>>>>> horizontal sharding would provide additional benefits for: >>>>>>>> >>>>>>>> - More efficient column-level updates >>>>>>>> - Streamlined column additions >>>>>>>> - Better handling of dominant columns that can cause RowGroup >>>>>>>> size imbalances (placing these in separate files could significantly >>>>>>>> improve performance) >>>>>>>> >>>>>>>> Thanks, Peter >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. >>>>>>>> máj. 28., Sze, 15:39): >>>>>>>> >>>>>>>>> I would be happy to put together a proposal based on the inputs >>>>>>>>> got here. >>>>>>>>> >>>>>>>>> Thanks everyone for your thoughts! >>>>>>>>> I will try to incorporate all of this. >>>>>>>>> >>>>>>>>> Thanks, Peter >>>>>>>>> >>>>>>>>> Daniel Weeks <dwe...@apache.org> ezt írta (időpont: 2025. máj. >>>>>>>>> 27., K, 20:07): >>>>>>>>> >>>>>>>>>> I feel like we have two different issues we're talking about here >>>>>>>>>> that aren't necessarily tied (though solutions may address both): 1) >>>>>>>>>> wide >>>>>>>>>> tables, 2) adding columns >>>>>>>>>> >>>>>>>>>> Wide tables are definitely a problem where parquet has >>>>>>>>>> limitations. I'm optimistic about the ongoing work to help improve >>>>>>>>>> parquet >>>>>>>>>> footers/stats in this area that Fokko mentioned. There are always >>>>>>>>>> limitations in how this scales as wide rows lead to small row groups >>>>>>>>>> and >>>>>>>>>> the cost to reconstitute a row gets more expensive, but for cases >>>>>>>>>> that are >>>>>>>>>> read heavy and projecting subsets of columns should significantly >>>>>>>>>> improve >>>>>>>>>> performance. >>>>>>>>>> >>>>>>>>>> Adding columns to an existing dataset is something that comes up >>>>>>>>>> periodically, but there's a lot of complexity involved in this. >>>>>>>>>> Parquet >>>>>>>>>> does support referencing columns in separate files per the spec, but >>>>>>>>>> there's no implementation that takes advantage of this to my >>>>>>>>>> knowledge. >>>>>>>>>> This does allow for approaches where you separate/rewrite just the >>>>>>>>>> footers >>>>>>>>>> or various other tricks, but these approaches get complicated >>>>>>>>>> quickly and >>>>>>>>>> the number of readers that can consume those representations would >>>>>>>>>> initially be very limited. >>>>>>>>>> >>>>>>>>>> A larger problem for splitting columns across files is that there >>>>>>>>>> are a lot of assumptions about how data is laid out in both readers >>>>>>>>>> and >>>>>>>>>> writers. For example, aligning row groups and correctly handling >>>>>>>>>> split >>>>>>>>>> calculation is very complicated if you're trying to split rows across >>>>>>>>>> files. Other features are also impacted like deletes, which >>>>>>>>>> reference the >>>>>>>>>> file to which they apply and would need to account for deletes >>>>>>>>>> applying to >>>>>>>>>> multiple files and needing to update those references if columns are >>>>>>>>>> added. >>>>>>>>>> >>>>>>>>>> I believe there are a lot of interesting approaches to addressing >>>>>>>>>> these use cases, but we'd really need a thorough proposal that >>>>>>>>>> explores all >>>>>>>>>> of these scenarios. The last thing we would want is to introduce >>>>>>>>>> incompatibilities within the format that result in incompatible >>>>>>>>>> features. >>>>>>>>>> >>>>>>>>>> -Dan >>>>>>>>>> >>>>>>>>>> On Tue, May 27, 2025 at 10:02 AM Russell Spitzer < >>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Point definitely taken. We really should probably POC some of >>>>>>>>>>> these ideas and see what we are actually dealing with. (He said >>>>>>>>>>> without >>>>>>>>>>> volunteering to do the work :P) >>>>>>>>>>> >>>>>>>>>>> On Tue, May 27, 2025 at 11:55 AM Selcuk Aya >>>>>>>>>>> <selcuk....@snowflake.com.invalid> wrote: >>>>>>>>>>> >>>>>>>>>>>> Yes having to rewrite the whole file is not ideal but I believe >>>>>>>>>>>> most of the cost of rewriting a file comes from decompression, >>>>>>>>>>>> encoding, >>>>>>>>>>>> stats calculations etc. If you are adding new values for some >>>>>>>>>>>> columns but >>>>>>>>>>>> are keeping the rest of the columns the same in the file, then a >>>>>>>>>>>> bunch of >>>>>>>>>>>> rewrite cost can be optimized away. I am not saying this is better >>>>>>>>>>>> than >>>>>>>>>>>> writing to a separate file, I am not sure how much worse it is >>>>>>>>>>>> though. >>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer < >>>>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I think that "after the fact" modification is one of the >>>>>>>>>>>>> requirements here, IE: Updating a single column without rewriting >>>>>>>>>>>>> the whole >>>>>>>>>>>>> file. >>>>>>>>>>>>> If we have to write new metadata for the file aren't we in the >>>>>>>>>>>>> same boat as having to rewrite the whole file? >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya >>>>>>>>>>>>> <selcuk....@snowflake.com.invalid> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> If files represent column projections of a table rather than >>>>>>>>>>>>>> the whole columns in the table, then any read that reads across >>>>>>>>>>>>>> these files >>>>>>>>>>>>>> needs to identify what constitutes a row. Lance DB for example >>>>>>>>>>>>>> has vertical >>>>>>>>>>>>>> partitioning across columns but also horizontal partitioning >>>>>>>>>>>>>> across rows >>>>>>>>>>>>>> such that in each horizontal partitioning(fragment), the same >>>>>>>>>>>>>> number of >>>>>>>>>>>>>> rows exist in each vertical partition, which I think is >>>>>>>>>>>>>> necessary to make >>>>>>>>>>>>>> whole/partial row construction cheap. If this is the case, there >>>>>>>>>>>>>> is no >>>>>>>>>>>>>> reason not to achieve the same data layout inside a single >>>>>>>>>>>>>> columnar file >>>>>>>>>>>>>> with a lean header. I think the only valid argument for a >>>>>>>>>>>>>> separate file is >>>>>>>>>>>>>> adding a new set of columns to an existing table, but even then >>>>>>>>>>>>>> I am not >>>>>>>>>>>>>> sure a separate file is absolutely necessary for good >>>>>>>>>>>>>> performance. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Selcuk >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, May 27, 2025 at 9:18 AM Devin Smith >>>>>>>>>>>>>> <devinsm...@deephaven.io.invalid> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> There's a `file_path` field in the parquet ColumnChunk >>>>>>>>>>>>>>> structure, >>>>>>>>>>>>>>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm not sure what tooling actually supports this though. >>>>>>>>>>>>>>> Could be interesting to see what the history of this is. >>>>>>>>>>>>>>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer < >>>>>>>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have to agree that while there can be some fixes in >>>>>>>>>>>>>>>> Parquet, we fundamentally need a way to split a "row group" >>>>>>>>>>>>>>>> or something like that between separate files. If that's >>>>>>>>>>>>>>>> something we can do in the parquet project that would be great >>>>>>>>>>>>>>>> but it feels like we need to start exploring more drastic >>>>>>>>>>>>>>>> options than footer encoding. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I agree with Steven that there are limitations that >>>>>>>>>>>>>>>>> Parquet cannot do. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> In addition to adding new columns by rewriting all files, >>>>>>>>>>>>>>>>> files of wide tables may suffer from bad performance like >>>>>>>>>>>>>>>>> below: >>>>>>>>>>>>>>>>> - Poor compression of row groups because there are too >>>>>>>>>>>>>>>>> many columns and even a small number of rows can reach the >>>>>>>>>>>>>>>>> row group >>>>>>>>>>>>>>>>> threshold. >>>>>>>>>>>>>>>>> - Dominating columns (e.g. blobs) may contribute to 99% >>>>>>>>>>>>>>>>> size of a row group, leading to unbalanced column chunks and >>>>>>>>>>>>>>>>> deteriorate >>>>>>>>>>>>>>>>> the row group compression. >>>>>>>>>>>>>>>>> - Similar to adding new columns, partial update also >>>>>>>>>>>>>>>>> requires rewriting all columns of the affected rows. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> IIRC, some table formats already support splitting columns >>>>>>>>>>>>>>>>> into different files: >>>>>>>>>>>>>>>>> - Lance manifest splits a fragment [1] into one or more >>>>>>>>>>>>>>>>> data files. >>>>>>>>>>>>>>>>> - Apache Hudi has the concept of column family [2]. >>>>>>>>>>>>>>>>> - Apache Paimon supports sequence groups [3] for partial >>>>>>>>>>>>>>>>> update. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Although Parquet can introduce the concept of logical file >>>>>>>>>>>>>>>>> and physical file to manage the columns to file mapping, this >>>>>>>>>>>>>>>>> looks like >>>>>>>>>>>>>>>>> yet another manifest file design which duplicates the purpose >>>>>>>>>>>>>>>>> of Iceberg. >>>>>>>>>>>>>>>>> These might be something worth exploring in Iceberg. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [1] https://lancedb.github.io/lance/format.html#fragments >>>>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>>>> https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md >>>>>>>>>>>>>>>>> [3] >>>>>>>>>>>>>>>>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>> Gang >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Tue, May 27, 2025 at 7:03 AM Steven Wu < >>>>>>>>>>>>>>>>> stevenz...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The Parquet metadata proposal (linked by Fokko) is mainly >>>>>>>>>>>>>>>>>> addressing the read performance due to bloated metadata. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> What Peter described in the description seems useful for >>>>>>>>>>>>>>>>>> some ML workload of feature engineering. A new set of >>>>>>>>>>>>>>>>>> features/columns are >>>>>>>>>>>>>>>>>> added to the table. Currently, Iceberg would require >>>>>>>>>>>>>>>>>> rewriting all data >>>>>>>>>>>>>>>>>> files to combine old and new columns (write amplification). >>>>>>>>>>>>>>>>>> Similarly, in >>>>>>>>>>>>>>>>>> the past the community also talked about the use cases of >>>>>>>>>>>>>>>>>> updating a single >>>>>>>>>>>>>>>>>> column, which would require rewriting all data files. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry < >>>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Do you have the link at hand for the thread where this >>>>>>>>>>>>>>>>>>> was discussed on the Parquet list? >>>>>>>>>>>>>>>>>>> The docs seem quite old, and the PR stale, so I would >>>>>>>>>>>>>>>>>>> like to understand the situation better. >>>>>>>>>>>>>>>>>>> If it is possible to do this in Parquet, that would be >>>>>>>>>>>>>>>>>>> great, but Avro, ORC would still suffer. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont: >>>>>>>>>>>>>>>>>>> 2025. máj. 26., H, 22:07): >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hey Peter, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks for bringing this issue up. I think I agree with >>>>>>>>>>>>>>>>>>>> Fokko; the issue of wide tables leading to Parquet >>>>>>>>>>>>>>>>>>>> metadata bloat and poor >>>>>>>>>>>>>>>>>>>> Thrift deserialization performance is a long standing >>>>>>>>>>>>>>>>>>>> issue that I believe >>>>>>>>>>>>>>>>>>>> there's motivation in the community to address. So to me >>>>>>>>>>>>>>>>>>>> it seems better to >>>>>>>>>>>>>>>>>>>> address it in Parquet itself rather than Iceberg library >>>>>>>>>>>>>>>>>>>> facilitate a >>>>>>>>>>>>>>>>>>>> pattern which works around the limitations. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong < >>>>>>>>>>>>>>>>>>>> fo...@apache.org> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Peter, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks for bringing this up. Wouldn't it make more >>>>>>>>>>>>>>>>>>>>> sense to fix this in Parquet itself? It has been a >>>>>>>>>>>>>>>>>>>>> long-running issue on >>>>>>>>>>>>>>>>>>>>> Parquet, and there is still active interest from the >>>>>>>>>>>>>>>>>>>>> community. There is a >>>>>>>>>>>>>>>>>>>>> PR to replace the footer with FlatBuffers, which >>>>>>>>>>>>>>>>>>>>> dramatically >>>>>>>>>>>>>>>>>>>>> improves performance >>>>>>>>>>>>>>>>>>>>> <https://github.com/apache/arrow/pull/43793>. The >>>>>>>>>>>>>>>>>>>>> underlying proposal can be found here >>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa> >>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Kind regards, >>>>>>>>>>>>>>>>>>>>> Fokko >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou < >>>>>>>>>>>>>>>>>>>>> yunzou.colost...@gmail.com>: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> +1, I am really interested in this topic. Performance >>>>>>>>>>>>>>>>>>>>>> has always been a problem when dealing with wide tables, >>>>>>>>>>>>>>>>>>>>>> not just >>>>>>>>>>>>>>>>>>>>>> read/write, but also during compilation. Most of the ML >>>>>>>>>>>>>>>>>>>>>> use cases typically >>>>>>>>>>>>>>>>>>>>>> exhibit a vectorized read/write pattern, I am also >>>>>>>>>>>>>>>>>>>>>> wondering if there is >>>>>>>>>>>>>>>>>>>>>> any way at the metadata level to help the whole >>>>>>>>>>>>>>>>>>>>>> compilation and execution >>>>>>>>>>>>>>>>>>>>>> process. I do not have any answer fo this yet, but I >>>>>>>>>>>>>>>>>>>>>> would be really >>>>>>>>>>>>>>>>>>>>>> interested in exploring this further. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>>>>>>>>>>> Yun >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang >>>>>>>>>>>>>>>>>>>>>> <py...@pinterest.com.invalid> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi Peter, I am interested in this proposal. What's >>>>>>>>>>>>>>>>>>>>>>> more, I am curious if there is a similar story on the >>>>>>>>>>>>>>>>>>>>>>> write side as well >>>>>>>>>>>>>>>>>>>>>>> (how to generate these splitted files) and >>>>>>>>>>>>>>>>>>>>>>> specifically, are you targeting >>>>>>>>>>>>>>>>>>>>>>> feature backfill use cases in ML use? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry < >>>>>>>>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hi Team, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> In machine learning use-cases, it's common to >>>>>>>>>>>>>>>>>>>>>>>> encounter tables with a very high number of columns - >>>>>>>>>>>>>>>>>>>>>>>> sometimes even in the >>>>>>>>>>>>>>>>>>>>>>>> range of several thousand. I've seen cases with up to >>>>>>>>>>>>>>>>>>>>>>>> 15,000 columns. >>>>>>>>>>>>>>>>>>>>>>>> Storing such wide tables in a single Parquet file is >>>>>>>>>>>>>>>>>>>>>>>> often suboptimal, as >>>>>>>>>>>>>>>>>>>>>>>> Parquet can become a bottleneck, even when only a >>>>>>>>>>>>>>>>>>>>>>>> subset of columns is >>>>>>>>>>>>>>>>>>>>>>>> queried. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> A common approach to mitigate this is to split the >>>>>>>>>>>>>>>>>>>>>>>> data across multiple Parquet files. With the upcoming >>>>>>>>>>>>>>>>>>>>>>>> File Format API, we >>>>>>>>>>>>>>>>>>>>>>>> could introduce a layer that combines these files into >>>>>>>>>>>>>>>>>>>>>>>> a single iterator, >>>>>>>>>>>>>>>>>>>>>>>> enabling efficient reading of wide and very wide >>>>>>>>>>>>>>>>>>>>>>>> tables. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> To support this, we would need to revise the >>>>>>>>>>>>>>>>>>>>>>>> metadata specification. Instead of the current `_file` >>>>>>>>>>>>>>>>>>>>>>>> column, we could >>>>>>>>>>>>>>>>>>>>>>>> introduce a _files column containing: >>>>>>>>>>>>>>>>>>>>>>>> - `_file_column_ids`: the column IDs present in >>>>>>>>>>>>>>>>>>>>>>>> each file >>>>>>>>>>>>>>>>>>>>>>>> - `_file_path`: the path to the corresponding file >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Has there been any prior discussion around this >>>>>>>>>>>>>>>>>>>>>>>> idea? >>>>>>>>>>>>>>>>>>>>>>>> Is anyone else interested in exploring this further? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>>>>>>>>>> Peter >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>