Re: Wide tables in V4

Bryan Keller Thu, 29 May 2025 21:55:37 -0700

Fewer commit conflicts meaning the tables representing column families are 
updated independently, rather than having to serialize commits to a single 
table. Perhaps with a wide table solution the commit logic could be enhanced to 
support things like concurrent overwrites to independent column families, but 
it seems like it would be fairly involved.



> On May 29, 2025, at 7:16 PM, Steven Wu <stevenz...@gmail.com> wrote:
> 
> Bryan, interesting approach to split horizontally across multiple tables. 
> 
> A few potential down sides
> * operational overhead. tables need to be managed consistently and probably 
> in some coordinated way
> * complex read
> * maybe fragile to enforce correctness (during join). It is robust to enforce 
> the stitching correctness at file group level in file reader and writer if 
> built in the table format.
> 
> > fewer commit conflicts
> 
> Can you elaborate on this one? Are those tables populated by streaming or 
> batch pipelines?
> 
> On Thu, May 29, 2025 at 5:03 PM Bryan Keller <brya...@gmail.com 
> <mailto:brya...@gmail.com>> wrote:
>> Hi everyone,
>> 
>> We have been investigating a wide table format internally for a similar use 
>> case, i.e. we have wide ML tables with features generated by different 
>> pipelines and teams but want a unified view of the data. We are comparing 
>> that against separate tables joined together using a shuffle-less join (e.g. 
>> storage partition join), along with a corresponding view.
>> 
>> The join/view approach seems to give us much of we need, with some added 
>> benefits like splitting up the metadata, fewer commit conflicts, and ability 
>> to share, nest, and swap "column families". The downsides are table 
>> management is split across multiple tables, it requires engine support of 
>> shuffle-less joins for best performance, and even then, scans probably won't 
>> be as optimal.
>> 
>> I'm curious if anyone had further thoughts on the two?
>> 
>> -Bryan
>> 
>> 
>> 
>>> On May 29, 2025, at 8:18 AM, Péter Váry <peter.vary.apa...@gmail.com 
>>> <mailto:peter.vary.apa...@gmail.com>> wrote:
>>> 
>>> I received feedback from Alkis regarding their Parquet optimization work. 
>>> Their internal testing shows promising results for reducing metadata size 
>>> and improving parsing performance. They plan to formalize a proposal for 
>>> these Parquet enhancements in the near future.
>>> 
>>> Meanwhile, I'm putting together our horizontal sharding proposal as a 
>>> complementary approach. Even with the Parquet metadata improvements, 
>>> horizontal sharding would provide additional benefits for:
>>> More efficient column-level updates
>>> Streamlined column additions
>>> Better handling of dominant columns that can cause RowGroup size imbalances 
>>> (placing these in separate files could significantly improve performance)
>>> Thanks, Peter
>>> 
>>> 
>>> 
>>> Péter Váry <peter.vary.apa...@gmail.com 
>>> <mailto:peter.vary.apa...@gmail.com>> ezt írta (időpont: 2025. máj. 28., 
>>> Sze, 15:39):
>>>> I would be happy to put together a proposal based on the inputs got here.
>>>> 
>>>> Thanks everyone for your thoughts!
>>>> I will try to incorporate all of this.
>>>> 
>>>> Thanks, Peter 
>>>> 
>>>> Daniel Weeks <dwe...@apache.org <mailto:dwe...@apache.org>> ezt írta 
>>>> (időpont: 2025. máj. 27., K, 20:07):
>>>>> I feel like we have two different issues we're talking about here that 
>>>>> aren't necessarily tied (though solutions may address both): 1) wide 
>>>>> tables, 2) adding columns
>>>>> 
>>>>> Wide tables are definitely a problem where parquet has limitations. I'm 
>>>>> optimistic about the ongoing work to help improve parquet footers/stats 
>>>>> in this area that Fokko mentioned.  There are always limitations in how 
>>>>> this scales as wide rows lead to small row groups and the cost to 
>>>>> reconstitute a row gets more expensive, but for cases that are read heavy 
>>>>> and projecting subsets of columns should significantly improve 
>>>>> performance.
>>>>> 
>>>>> Adding columns to an existing dataset is something that comes up 
>>>>> periodically, but there's a lot of complexity involved in this.  Parquet 
>>>>> does support referencing columns in separate files per the spec, but 
>>>>> there's no implementation that takes advantage of this to my knowledge.  
>>>>> This does allow for approaches where you separate/rewrite just the 
>>>>> footers or various other tricks, but these approaches get complicated 
>>>>> quickly and the number of readers that can consume those representations 
>>>>> would initially be very limited.
>>>>> 
>>>>> A larger problem for splitting columns across files is that there are a 
>>>>> lot of assumptions about how data is laid out in both readers and 
>>>>> writers.  For example, aligning row groups and correctly handling split 
>>>>> calculation is very complicated if you're trying to split rows across 
>>>>> files.  Other features are also impacted like deletes, which reference 
>>>>> the file to which they apply and would need to account for deletes 
>>>>> applying to multiple files and needing to update those references if 
>>>>> columns are added.
>>>>> 
>>>>> I believe there are a lot of interesting approaches to addressing these 
>>>>> use cases, but we'd really need a thorough proposal that explores all of 
>>>>> these scenarios.  The last thing we would want is to introduce 
>>>>> incompatibilities within the format that result in incompatible features.
>>>>> 
>>>>> -Dan
>>>>> 
>>>>> On Tue, May 27, 2025 at 10:02 AM Russell Spitzer 
>>>>> <russell.spit...@gmail.com <mailto:russell.spit...@gmail.com>> wrote:
>>>>>> Point definitely taken. We really should probably POC some of these 
>>>>>> ideas and see what we are actually dealing with. (He said without 
>>>>>> volunteering to do the work :P)
>>>>>> 
>>>>>> On Tue, May 27, 2025 at 11:55 AM Selcuk Aya 
>>>>>> <selcuk....@snowflake.com.invalid> wrote:
>>>>>>> Yes having to rewrite the whole file is not ideal but I believe most of 
>>>>>>> the cost of rewriting a file comes from decompression, encoding, stats 
>>>>>>> calculations etc. If you are adding new values for some columns but are 
>>>>>>> keeping the rest of the columns the same in the file, then a bunch of 
>>>>>>> rewrite cost can be optimized away. I am not saying this is better than 
>>>>>>> writing to a separate file, I am not sure how much worse it is though.
>>>>>>> 
>>>>>>> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer 
>>>>>>> <russell.spit...@gmail.com <mailto:russell.spit...@gmail.com>> wrote:
>>>>>>>> I think that "after the fact" modification is one of the requirements 
>>>>>>>> here, IE: Updating a single column without rewriting the whole file. 
>>>>>>>> If we have to write new metadata for the file aren't we in the same 
>>>>>>>> boat as having to rewrite the whole file?
>>>>>>>> 
>>>>>>>> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya 
>>>>>>>> <selcuk....@snowflake.com.invalid> wrote:
>>>>>>>>> If files represent column projections of a table rather than the 
>>>>>>>>> whole columns in the table, then any read that reads across these 
>>>>>>>>> files needs to identify what constitutes a row. Lance DB for example 
>>>>>>>>> has vertical partitioning across columns but also horizontal 
>>>>>>>>> partitioning across rows such that in each horizontal 
>>>>>>>>> partitioning(fragment), the same number of rows exist in each 
>>>>>>>>> vertical partition,  which I think is necessary to make whole/partial 
>>>>>>>>> row construction cheap. If this is the case, there is no reason not 
>>>>>>>>> to achieve the same data layout inside a single columnar file with a 
>>>>>>>>> lean header. I think the only valid argument for a separate file is 
>>>>>>>>> adding a new set of columns to an existing table, but even then I am 
>>>>>>>>> not sure a separate file is absolutely necessary for good performance.
>>>>>>>>> 
>>>>>>>>> Selcuk
>>>>>>>>> 
>>>>>>>>> On Tue, May 27, 2025 at 9:18 AM Devin Smith 
>>>>>>>>> <devinsm...@deephaven.io.invalid> wrote:
>>>>>>>>>> There's a `file_path` field in the parquet ColumnChunk structure, 
>>>>>>>>>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962
>>>>>>>>>> 
>>>>>>>>>> I'm not sure what tooling actually supports this though. Could be 
>>>>>>>>>> interesting to see what the history of this is. 
>>>>>>>>>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3, 
>>>>>>>>>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw
>>>>>>>>>> 
>>>>>>>>>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer 
>>>>>>>>>> <russell.spit...@gmail.com <mailto:russell.spit...@gmail.com>> wrote:
>>>>>>>>>>> I have to agree that while there can be some fixes in Parquet, we 
>>>>>>>>>>> fundamentally need a way to split a "row group" 
>>>>>>>>>>> or something like that between separate files. If that's something 
>>>>>>>>>>> we can do in the parquet project that would be great
>>>>>>>>>>> but it feels like we need to start exploring more drastic options 
>>>>>>>>>>> than footer encoding.
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com 
>>>>>>>>>>> <mailto:ust...@gmail.com>> wrote:
>>>>>>>>>>>> I agree with Steven that there are limitations that Parquet cannot 
>>>>>>>>>>>> do.
>>>>>>>>>>>> 
>>>>>>>>>>>> In addition to adding new columns by rewriting all files, files of 
>>>>>>>>>>>> wide tables may suffer from bad performance like below:
>>>>>>>>>>>> - Poor compression of row groups because there are too many 
>>>>>>>>>>>> columns and even a small number of rows can reach the row group 
>>>>>>>>>>>> threshold.
>>>>>>>>>>>> - Dominating columns (e.g. blobs) may contribute to 99% size of a 
>>>>>>>>>>>> row group, leading to unbalanced column chunks and deteriorate the 
>>>>>>>>>>>> row group compression.
>>>>>>>>>>>> - Similar to adding new columns, partial update also requires 
>>>>>>>>>>>> rewriting all columns of the affected rows.
>>>>>>>>>>>> 
>>>>>>>>>>>> IIRC, some table formats already support splitting columns into 
>>>>>>>>>>>> different files:
>>>>>>>>>>>> - Lance manifest splits a fragment [1] into one or more data files.
>>>>>>>>>>>> - Apache Hudi has the concept of column family [2].
>>>>>>>>>>>> - Apache Paimon supports sequence groups [3] for partial update.
>>>>>>>>>>>> 
>>>>>>>>>>>> Although Parquet can introduce the concept of logical file and 
>>>>>>>>>>>> physical file to manage the columns to file mapping, this looks 
>>>>>>>>>>>> like yet another manifest file design which duplicates the purpose 
>>>>>>>>>>>> of Iceberg. 
>>>>>>>>>>>> These might be something worth exploring in Iceberg.
>>>>>>>>>>>> 
>>>>>>>>>>>> [1] https://lancedb.github.io/lance/format.html#fragments
>>>>>>>>>>>> [2] https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md
>>>>>>>>>>>> [3] 
>>>>>>>>>>>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group
>>>>>>>>>>>> 
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Gang
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, May 27, 2025 at 7:03 AM Steven Wu <stevenz...@gmail.com 
>>>>>>>>>>>> <mailto:stevenz...@gmail.com>> wrote:
>>>>>>>>>>>>> The Parquet metadata proposal (linked by Fokko) is mainly 
>>>>>>>>>>>>> addressing the read performance due to bloated metadata. 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What Peter described in the description seems useful for some ML 
>>>>>>>>>>>>> workload of feature engineering. A new set of features/columns 
>>>>>>>>>>>>> are added to the table. Currently, Iceberg  would require 
>>>>>>>>>>>>> rewriting all data files to combine old and new columns (write 
>>>>>>>>>>>>> amplification). Similarly, in the past the community also talked 
>>>>>>>>>>>>> about the use cases of updating a single column, which would 
>>>>>>>>>>>>> require rewriting all data files.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry 
>>>>>>>>>>>>> <peter.vary.apa...@gmail.com 
>>>>>>>>>>>>> <mailto:peter.vary.apa...@gmail.com>> wrote:
>>>>>>>>>>>>>> Do you have the link at hand for the thread where this was 
>>>>>>>>>>>>>> discussed on the Parquet list?
>>>>>>>>>>>>>> The docs seem quite old, and the PR stale, so I would like to 
>>>>>>>>>>>>>> understand the situation better.
>>>>>>>>>>>>>> If it is possible to do this in Parquet, that would be great, 
>>>>>>>>>>>>>> but Avro, ORC would still suffer.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Amogh Jahagirdar <2am...@gmail.com <mailto:2am...@gmail.com>> 
>>>>>>>>>>>>>> ezt írta (időpont: 2025. máj. 26., H, 22:07):
>>>>>>>>>>>>>>> Hey Peter,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks for bringing this issue up. I think I agree with Fokko; 
>>>>>>>>>>>>>>> the issue of wide tables leading to Parquet metadata bloat and 
>>>>>>>>>>>>>>> poor Thrift deserialization performance is a long standing 
>>>>>>>>>>>>>>> issue that I believe there's motivation in the community to 
>>>>>>>>>>>>>>> address. So to me it seems better to address it in Parquet 
>>>>>>>>>>>>>>> itself rather than Iceberg library facilitate a pattern which 
>>>>>>>>>>>>>>> works around the limitations.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong 
>>>>>>>>>>>>>>> <fo...@apache.org <mailto:fo...@apache.org>> wrote:
>>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for bringing this up. Wouldn't it make more sense to 
>>>>>>>>>>>>>>>> fix this in Parquet itself? It has been a long-running issue 
>>>>>>>>>>>>>>>> on Parquet, and there is still active interest from the 
>>>>>>>>>>>>>>>> community. There is a PR to replace the footer with 
>>>>>>>>>>>>>>>> FlatBuffers, which dramatically improves performance 
>>>>>>>>>>>>>>>> <https://github.com/apache/arrow/pull/43793>. The underlying 
>>>>>>>>>>>>>>>> proposal can be found here 
>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa>.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>> Fokko
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou 
>>>>>>>>>>>>>>>> <yunzou.colost...@gmail.com 
>>>>>>>>>>>>>>>> <mailto:yunzou.colost...@gmail.com>>:
>>>>>>>>>>>>>>>>> +1, I am really interested in this topic. Performance has 
>>>>>>>>>>>>>>>>> always been a problem when dealing with wide tables, not just 
>>>>>>>>>>>>>>>>> read/write, but also during compilation. Most of the ML use 
>>>>>>>>>>>>>>>>> cases typically exhibit a vectorized read/write pattern, I am 
>>>>>>>>>>>>>>>>> also wondering if there is any way at the metadata level to 
>>>>>>>>>>>>>>>>> help the whole compilation and execution process. I do not 
>>>>>>>>>>>>>>>>> have any answer fo this yet, but I would be really interested 
>>>>>>>>>>>>>>>>> in exploring this further. 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>> Yun
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang 
>>>>>>>>>>>>>>>>> <py...@pinterest.com.invalid> wrote:
>>>>>>>>>>>>>>>>>> Hi Peter, I am interested in this proposal. What's more, I 
>>>>>>>>>>>>>>>>>> am curious if there is a similar story on the write side as 
>>>>>>>>>>>>>>>>>> well (how to generate these splitted files) and 
>>>>>>>>>>>>>>>>>> specifically, are you targeting feature backfill use cases 
>>>>>>>>>>>>>>>>>> in ML use?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry 
>>>>>>>>>>>>>>>>>> <peter.vary.apa...@gmail.com 
>>>>>>>>>>>>>>>>>> <mailto:peter.vary.apa...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>> Hi Team,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> In machine learning use-cases, it's common to encounter 
>>>>>>>>>>>>>>>>>>> tables with a very high number of columns - sometimes even 
>>>>>>>>>>>>>>>>>>> in the range of several thousand. I've seen cases with up 
>>>>>>>>>>>>>>>>>>> to 15,000 columns. Storing such wide tables in a single 
>>>>>>>>>>>>>>>>>>> Parquet file is often suboptimal, as Parquet can become a 
>>>>>>>>>>>>>>>>>>> bottleneck, even when only a subset of columns is queried.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> A common approach to mitigate this is to split the data 
>>>>>>>>>>>>>>>>>>> across multiple Parquet files. With the upcoming File 
>>>>>>>>>>>>>>>>>>> Format API, we could introduce a layer that combines these 
>>>>>>>>>>>>>>>>>>> files into a single iterator, enabling efficient reading of 
>>>>>>>>>>>>>>>>>>> wide and very wide tables.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> To support this, we would need to revise the metadata 
>>>>>>>>>>>>>>>>>>> specification. Instead of the current `_file` column, we 
>>>>>>>>>>>>>>>>>>> could introduce a _files column containing:
>>>>>>>>>>>>>>>>>>> - `_file_column_ids`: the column IDs present in each file
>>>>>>>>>>>>>>>>>>> - `_file_path`: the path to the corresponding file
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Has there been any prior discussion around this idea?
>>>>>>>>>>>>>>>>>>> Is anyone else interested in exploring this further?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>> Peter
>>

Re: Wide tables in V4

Reply via email to