Re: Wide tables in V4

Gang Wu Thu, 29 May 2025 23:18:57 -0700

IMO, the main drawback for the view solution is the complexity of
maintaining consistency across tables if we want to use features like time
travel, incremental scan, branch & tag, encryption, etc.


On Fri, May 30, 2025 at 12:55 PM Bryan Keller <brya...@gmail.com> wrote:

> Fewer commit conflicts meaning the tables representing column families are
> updated independently, rather than having to serialize commits to a single
> table. Perhaps with a wide table solution the commit logic could be
> enhanced to support things like concurrent overwrites to independent column
> families, but it seems like it would be fairly involved.
>
>
> On May 29, 2025, at 7:16 PM, Steven Wu <stevenz...@gmail.com> wrote:
>
> Bryan, interesting approach to split horizontally across multiple tables.
>
> A few potential down sides
> * operational overhead. tables need to be managed consistently and
> probably in some coordinated way
> * complex read
> * maybe fragile to enforce correctness (during join). It is robust to
> enforce the stitching correctness at file group level in file reader and
> writer if built in the table format.
>
> > fewer commit conflicts
>
> Can you elaborate on this one? Are those tables populated by streaming or
> batch pipelines?
>
> On Thu, May 29, 2025 at 5:03 PM Bryan Keller <brya...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> We have been investigating a wide table format internally for a similar
>> use case, i.e. we have wide ML tables with features generated by different
>> pipelines and teams but want a unified view of the data. We are comparing
>> that against separate tables joined together using a shuffle-less join
>> (e.g. storage partition join), along with a corresponding view.
>>
>> The join/view approach seems to give us much of we need, with some added
>> benefits like splitting up the metadata, fewer commit conflicts, and
>> ability to share, nest, and swap "column families". The downsides are table
>> management is split across multiple tables, it requires engine support of
>> shuffle-less joins for best performance, and even then, scans probably
>> won't be as optimal.
>>
>> I'm curious if anyone had further thoughts on the two?
>>
>> -Bryan
>>
>>
>>
>> On May 29, 2025, at 8:18 AM, Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>>
>> I received feedback from Alkis regarding their Parquet optimization work.
>> Their internal testing shows promising results for reducing metadata size
>> and improving parsing performance. They plan to formalize a proposal for
>> these Parquet enhancements in the near future.
>>
>> Meanwhile, I'm putting together our horizontal sharding proposal as a
>> complementary approach. Even with the Parquet metadata improvements,
>> horizontal sharding would provide additional benefits for:
>>
>>    - More efficient column-level updates
>>    - Streamlined column additions
>>    - Better handling of dominant columns that can cause RowGroup size
>>    imbalances (placing these in separate files could significantly improve
>>    performance)
>>
>> Thanks, Peter
>>
>>
>>
>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. máj.
>> 28., Sze, 15:39):
>>
>>> I would be happy to put together a proposal based on the inputs got here.
>>>
>>> Thanks everyone for your thoughts!
>>> I will try to incorporate all of this.
>>>
>>> Thanks, Peter
>>>
>>> Daniel Weeks <dwe...@apache.org> ezt írta (időpont: 2025. máj. 27., K,
>>> 20:07):
>>>
>>>> I feel like we have two different issues we're talking about here that
>>>> aren't necessarily tied (though solutions may address both): 1) wide
>>>> tables, 2) adding columns
>>>>
>>>> Wide tables are definitely a problem where parquet has limitations. I'm
>>>> optimistic about the ongoing work to help improve parquet footers/stats in
>>>> this area that Fokko mentioned.  There are always limitations in how this
>>>> scales as wide rows lead to small row groups and the cost to reconstitute a
>>>> row gets more expensive, but for cases that are read heavy and projecting
>>>> subsets of columns should significantly improve performance.
>>>>
>>>> Adding columns to an existing dataset is something that comes up
>>>> periodically, but there's a lot of complexity involved in this.  Parquet
>>>> does support referencing columns in separate files per the spec, but
>>>> there's no implementation that takes advantage of this to my knowledge.
>>>> This does allow for approaches where you separate/rewrite just the footers
>>>> or various other tricks, but these approaches get complicated quickly and
>>>> the number of readers that can consume those representations would
>>>> initially be very limited.
>>>>
>>>> A larger problem for splitting columns across files is that there are a
>>>> lot of assumptions about how data is laid out in both readers and writers.
>>>> For example, aligning row groups and correctly handling split calculation
>>>> is very complicated if you're trying to split rows across files.  Other
>>>> features are also impacted like deletes, which reference the file to which
>>>> they apply and would need to account for deletes applying to multiple files
>>>> and needing to update those references if columns are added.
>>>>
>>>> I believe there are a lot of interesting approaches to addressing these
>>>> use cases, but we'd really need a thorough proposal that explores all of
>>>> these scenarios.  The last thing we would want is to introduce
>>>> incompatibilities within the format that result in incompatible features.
>>>>
>>>> -Dan
>>>>
>>>> On Tue, May 27, 2025 at 10:02 AM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>>> Point definitely taken. We really should probably POC some of
>>>>> these ideas and see what we are actually dealing with. (He said without
>>>>> volunteering to do the work :P)
>>>>>
>>>>> On Tue, May 27, 2025 at 11:55 AM Selcuk Aya
>>>>> <selcuk....@snowflake.com.invalid> wrote:
>>>>>
>>>>>> Yes having to rewrite the whole file is not ideal but I believe most
>>>>>> of the cost of rewriting a file comes from decompression, encoding, stats
>>>>>> calculations etc. If you are adding new values for some columns but are
>>>>>> keeping the rest of the columns the same in the file, then a bunch of
>>>>>> rewrite cost can be optimized away. I am not saying this is better than
>>>>>> writing to a separate file, I am not sure how much worse it is though.
>>>>>>
>>>>>> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer <
>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>
>>>>>>> I think that "after the fact" modification is one of the
>>>>>>> requirements here, IE: Updating a single column without rewriting the 
>>>>>>> whole
>>>>>>> file.
>>>>>>> If we have to write new metadata for the file aren't we in the same
>>>>>>> boat as having to rewrite the whole file?
>>>>>>>
>>>>>>> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya
>>>>>>> <selcuk....@snowflake.com.invalid> wrote:
>>>>>>>
>>>>>>>> If files represent column projections of a table rather than the
>>>>>>>> whole columns in the table, then any read that reads across these files
>>>>>>>> needs to identify what constitutes a row. Lance DB for example has 
>>>>>>>> vertical
>>>>>>>> partitioning across columns but also horizontal partitioning across 
>>>>>>>> rows
>>>>>>>> such that in each horizontal partitioning(fragment), the same number of
>>>>>>>> rows exist in each vertical partition,  which I think is necessary to 
>>>>>>>> make
>>>>>>>> whole/partial row construction cheap. If this is the case, there is no
>>>>>>>> reason not to achieve the same data layout inside a single columnar 
>>>>>>>> file
>>>>>>>> with a lean header. I think the only valid argument for a separate 
>>>>>>>> file is
>>>>>>>> adding a new set of columns to an existing table, but even then I am 
>>>>>>>> not
>>>>>>>> sure a separate file is absolutely necessary for good performance.
>>>>>>>>
>>>>>>>> Selcuk
>>>>>>>>
>>>>>>>> On Tue, May 27, 2025 at 9:18 AM Devin Smith
>>>>>>>> <devinsm...@deephaven.io.invalid> wrote:
>>>>>>>>
>>>>>>>>> There's a `file_path` field in the parquet ColumnChunk structure,
>>>>>>>>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962
>>>>>>>>>
>>>>>>>>> I'm not sure what tooling actually supports this though. Could be
>>>>>>>>> interesting to see what the history of this is.
>>>>>>>>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3,
>>>>>>>>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw
>>>>>>>>>
>>>>>>>>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer <
>>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I have to agree that while there can be some fixes in Parquet, we
>>>>>>>>>> fundamentally need a way to split a "row group"
>>>>>>>>>> or something like that between separate files. If that's
>>>>>>>>>> something we can do in the parquet project that would be great
>>>>>>>>>> but it feels like we need to start exploring more drastic options
>>>>>>>>>> than footer encoding.
>>>>>>>>>>
>>>>>>>>>> On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I agree with Steven that there are limitations that Parquet
>>>>>>>>>>> cannot do.
>>>>>>>>>>>
>>>>>>>>>>> In addition to adding new columns by rewriting all files, files
>>>>>>>>>>> of wide tables may suffer from bad performance like below:
>>>>>>>>>>> - Poor compression of row groups because there are too many
>>>>>>>>>>> columns and even a small number of rows can reach the row group 
>>>>>>>>>>> threshold.
>>>>>>>>>>> - Dominating columns (e.g. blobs) may contribute to 99% size of
>>>>>>>>>>> a row group, leading to unbalanced column chunks and deteriorate 
>>>>>>>>>>> the row
>>>>>>>>>>> group compression.
>>>>>>>>>>> - Similar to adding new columns, partial update also requires
>>>>>>>>>>> rewriting all columns of the affected rows.
>>>>>>>>>>>
>>>>>>>>>>> IIRC, some table formats already support splitting columns into
>>>>>>>>>>> different files:
>>>>>>>>>>> - Lance manifest splits a fragment [1] into one or more data
>>>>>>>>>>> files.
>>>>>>>>>>> - Apache Hudi has the concept of column family [2].
>>>>>>>>>>> - Apache Paimon supports sequence groups [3] for partial update.
>>>>>>>>>>>
>>>>>>>>>>> Although Parquet can introduce the concept of logical file and
>>>>>>>>>>> physical file to manage the columns to file mapping, this looks 
>>>>>>>>>>> like yet
>>>>>>>>>>> another manifest file design which duplicates the purpose of 
>>>>>>>>>>> Iceberg.
>>>>>>>>>>> These might be something worth exploring in Iceberg.
>>>>>>>>>>>
>>>>>>>>>>> [1] https://lancedb.github.io/lance/format.html#fragments
>>>>>>>>>>> [2]
>>>>>>>>>>> https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md
>>>>>>>>>>> [3]
>>>>>>>>>>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Gang
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, May 27, 2025 at 7:03 AM Steven Wu <stevenz...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The Parquet metadata proposal (linked by Fokko) is mainly
>>>>>>>>>>>> addressing the read performance due to bloated metadata.
>>>>>>>>>>>>
>>>>>>>>>>>> What Peter described in the description seems useful for some
>>>>>>>>>>>> ML workload of feature engineering. A new set of features/columns 
>>>>>>>>>>>> are added
>>>>>>>>>>>> to the table. Currently, Iceberg  would require rewriting all data 
>>>>>>>>>>>> files to
>>>>>>>>>>>> combine old and new columns (write amplification). Similarly, in 
>>>>>>>>>>>> the past
>>>>>>>>>>>> the community also talked about the use cases of updating a single 
>>>>>>>>>>>> column,
>>>>>>>>>>>> which would require rewriting all data files.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry <
>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Do you have the link at hand for the thread where this was
>>>>>>>>>>>>> discussed on the Parquet list?
>>>>>>>>>>>>> The docs seem quite old, and the PR stale, so I would like to
>>>>>>>>>>>>> understand the situation better.
>>>>>>>>>>>>> If it is possible to do this in Parquet, that would be great,
>>>>>>>>>>>>> but Avro, ORC would still suffer.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont: 2025.
>>>>>>>>>>>>> máj. 26., H, 22:07):
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey Peter,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for bringing this issue up. I think I agree with
>>>>>>>>>>>>>> Fokko; the issue of wide tables leading to Parquet metadata 
>>>>>>>>>>>>>> bloat and poor
>>>>>>>>>>>>>> Thrift deserialization performance is a long standing issue that 
>>>>>>>>>>>>>> I believe
>>>>>>>>>>>>>> there's motivation in the community to address. So to me it 
>>>>>>>>>>>>>> seems better to
>>>>>>>>>>>>>> address it in Parquet itself rather than Iceberg library 
>>>>>>>>>>>>>> facilitate a
>>>>>>>>>>>>>> pattern which works around the limitations.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong <
>>>>>>>>>>>>>> fo...@apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for bringing this up. Wouldn't it make more sense to
>>>>>>>>>>>>>>> fix this in Parquet itself? It has been a long-running issue on 
>>>>>>>>>>>>>>> Parquet,
>>>>>>>>>>>>>>> and there is still active interest from the community. There is 
>>>>>>>>>>>>>>> a PR to
>>>>>>>>>>>>>>> replace the footer with FlatBuffers, which dramatically
>>>>>>>>>>>>>>> improves performance
>>>>>>>>>>>>>>> <https://github.com/apache/arrow/pull/43793>. The
>>>>>>>>>>>>>>> underlying proposal can be found here
>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa>
>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>> Fokko
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou <
>>>>>>>>>>>>>>> yunzou.colost...@gmail.com>:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +1, I am really interested in this topic. Performance has
>>>>>>>>>>>>>>>> always been a problem when dealing with wide tables, not just 
>>>>>>>>>>>>>>>> read/write,
>>>>>>>>>>>>>>>> but also during compilation. Most of the ML use cases 
>>>>>>>>>>>>>>>> typically exhibit a
>>>>>>>>>>>>>>>> vectorized read/write pattern, I am also wondering if there is 
>>>>>>>>>>>>>>>> any way at
>>>>>>>>>>>>>>>> the metadata level to help the whole compilation and execution 
>>>>>>>>>>>>>>>> process. I
>>>>>>>>>>>>>>>> do not have any answer fo this yet, but I would be really 
>>>>>>>>>>>>>>>> interested in
>>>>>>>>>>>>>>>> exploring this further.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>> Yun
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang
>>>>>>>>>>>>>>>> <py...@pinterest.com.invalid> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Peter, I am interested in this proposal. What's more, I
>>>>>>>>>>>>>>>>> am curious if there is a similar story on the write side as 
>>>>>>>>>>>>>>>>> well (how to
>>>>>>>>>>>>>>>>> generate these splitted files) and specifically, are you 
>>>>>>>>>>>>>>>>> targeting feature
>>>>>>>>>>>>>>>>> backfill use cases in ML use?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry <
>>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Team,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In machine learning use-cases, it's common to encounter
>>>>>>>>>>>>>>>>>> tables with a very high number of columns - sometimes even 
>>>>>>>>>>>>>>>>>> in the range of
>>>>>>>>>>>>>>>>>> several thousand. I've seen cases with up to 15,000 columns. 
>>>>>>>>>>>>>>>>>> Storing such
>>>>>>>>>>>>>>>>>> wide tables in a single Parquet file is often suboptimal, as 
>>>>>>>>>>>>>>>>>> Parquet can
>>>>>>>>>>>>>>>>>> become a bottleneck, even when only a subset of columns is 
>>>>>>>>>>>>>>>>>> queried.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> A common approach to mitigate this is to split the data
>>>>>>>>>>>>>>>>>> across multiple Parquet files. With the upcoming File Format 
>>>>>>>>>>>>>>>>>> API, we could
>>>>>>>>>>>>>>>>>> introduce a layer that combines these files into a single 
>>>>>>>>>>>>>>>>>> iterator,
>>>>>>>>>>>>>>>>>> enabling efficient reading of wide and very wide tables.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> To support this, we would need to revise the metadata
>>>>>>>>>>>>>>>>>> specification. Instead of the current `_file` column, we 
>>>>>>>>>>>>>>>>>> could introduce a
>>>>>>>>>>>>>>>>>> _files column containing:
>>>>>>>>>>>>>>>>>> - `_file_column_ids`: the column IDs present in each file
>>>>>>>>>>>>>>>>>> - `_file_path`: the path to the corresponding file
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Has there been any prior discussion around this idea?
>>>>>>>>>>>>>>>>>> Is anyone else interested in exploring this further?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>
>

Re: Wide tables in V4

Reply via email to