Bryan, interesting approach to split horizontally across multiple tables.

A few potential down sides
* operational overhead. tables need to be managed consistently and probably
in some coordinated way
* complex read
* maybe fragile to enforce correctness (during join). It is robust to
enforce the stitching correctness at file group level in file reader and
writer if built in the table format.

> fewer commit conflicts

Can you elaborate on this one? Are those tables populated by streaming or
batch pipelines?

On Thu, May 29, 2025 at 5:03 PM Bryan Keller <brya...@gmail.com> wrote:

> Hi everyone,
>
> We have been investigating a wide table format internally for a similar
> use case, i.e. we have wide ML tables with features generated by different
> pipelines and teams but want a unified view of the data. We are comparing
> that against separate tables joined together using a shuffle-less join
> (e.g. storage partition join), along with a corresponding view.
>
> The join/view approach seems to give us much of we need, with some added
> benefits like splitting up the metadata, fewer commit conflicts, and
> ability to share, nest, and swap "column families". The downsides are table
> management is split across multiple tables, it requires engine support of
> shuffle-less joins for best performance, and even then, scans probably
> won't be as optimal.
>
> I'm curious if anyone had further thoughts on the two?
>
> -Bryan
>
>
>
> On May 29, 2025, at 8:18 AM, Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
>
> I received feedback from Alkis regarding their Parquet optimization work.
> Their internal testing shows promising results for reducing metadata size
> and improving parsing performance. They plan to formalize a proposal for
> these Parquet enhancements in the near future.
>
> Meanwhile, I'm putting together our horizontal sharding proposal as a
> complementary approach. Even with the Parquet metadata improvements,
> horizontal sharding would provide additional benefits for:
>
>    - More efficient column-level updates
>    - Streamlined column additions
>    - Better handling of dominant columns that can cause RowGroup size
>    imbalances (placing these in separate files could significantly improve
>    performance)
>
> Thanks, Peter
>
>
>
> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. máj.
> 28., Sze, 15:39):
>
>> I would be happy to put together a proposal based on the inputs got here.
>>
>> Thanks everyone for your thoughts!
>> I will try to incorporate all of this.
>>
>> Thanks, Peter
>>
>> Daniel Weeks <dwe...@apache.org> ezt írta (időpont: 2025. máj. 27., K,
>> 20:07):
>>
>>> I feel like we have two different issues we're talking about here that
>>> aren't necessarily tied (though solutions may address both): 1) wide
>>> tables, 2) adding columns
>>>
>>> Wide tables are definitely a problem where parquet has limitations. I'm
>>> optimistic about the ongoing work to help improve parquet footers/stats in
>>> this area that Fokko mentioned.  There are always limitations in how this
>>> scales as wide rows lead to small row groups and the cost to reconstitute a
>>> row gets more expensive, but for cases that are read heavy and projecting
>>> subsets of columns should significantly improve performance.
>>>
>>> Adding columns to an existing dataset is something that comes up
>>> periodically, but there's a lot of complexity involved in this.  Parquet
>>> does support referencing columns in separate files per the spec, but
>>> there's no implementation that takes advantage of this to my knowledge.
>>> This does allow for approaches where you separate/rewrite just the footers
>>> or various other tricks, but these approaches get complicated quickly and
>>> the number of readers that can consume those representations would
>>> initially be very limited.
>>>
>>> A larger problem for splitting columns across files is that there are a
>>> lot of assumptions about how data is laid out in both readers and writers.
>>> For example, aligning row groups and correctly handling split calculation
>>> is very complicated if you're trying to split rows across files.  Other
>>> features are also impacted like deletes, which reference the file to which
>>> they apply and would need to account for deletes applying to multiple files
>>> and needing to update those references if columns are added.
>>>
>>> I believe there are a lot of interesting approaches to addressing these
>>> use cases, but we'd really need a thorough proposal that explores all of
>>> these scenarios.  The last thing we would want is to introduce
>>> incompatibilities within the format that result in incompatible features.
>>>
>>> -Dan
>>>
>>> On Tue, May 27, 2025 at 10:02 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> Point definitely taken. We really should probably POC some of
>>>> these ideas and see what we are actually dealing with. (He said without
>>>> volunteering to do the work :P)
>>>>
>>>> On Tue, May 27, 2025 at 11:55 AM Selcuk Aya
>>>> <selcuk....@snowflake.com.invalid> wrote:
>>>>
>>>>> Yes having to rewrite the whole file is not ideal but I believe most
>>>>> of the cost of rewriting a file comes from decompression, encoding, stats
>>>>> calculations etc. If you are adding new values for some columns but are
>>>>> keeping the rest of the columns the same in the file, then a bunch of
>>>>> rewrite cost can be optimized away. I am not saying this is better than
>>>>> writing to a separate file, I am not sure how much worse it is though.
>>>>>
>>>>> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> I think that "after the fact" modification is one of the requirements
>>>>>> here, IE: Updating a single column without rewriting the whole file.
>>>>>> If we have to write new metadata for the file aren't we in the same
>>>>>> boat as having to rewrite the whole file?
>>>>>>
>>>>>> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya
>>>>>> <selcuk....@snowflake.com.invalid> wrote:
>>>>>>
>>>>>>> If files represent column projections of a table rather than the
>>>>>>> whole columns in the table, then any read that reads across these files
>>>>>>> needs to identify what constitutes a row. Lance DB for example has 
>>>>>>> vertical
>>>>>>> partitioning across columns but also horizontal partitioning across rows
>>>>>>> such that in each horizontal partitioning(fragment), the same number of
>>>>>>> rows exist in each vertical partition,  which I think is necessary to 
>>>>>>> make
>>>>>>> whole/partial row construction cheap. If this is the case, there is no
>>>>>>> reason not to achieve the same data layout inside a single columnar file
>>>>>>> with a lean header. I think the only valid argument for a separate file 
>>>>>>> is
>>>>>>> adding a new set of columns to an existing table, but even then I am not
>>>>>>> sure a separate file is absolutely necessary for good performance.
>>>>>>>
>>>>>>> Selcuk
>>>>>>>
>>>>>>> On Tue, May 27, 2025 at 9:18 AM Devin Smith
>>>>>>> <devinsm...@deephaven.io.invalid> wrote:
>>>>>>>
>>>>>>>> There's a `file_path` field in the parquet ColumnChunk structure,
>>>>>>>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962
>>>>>>>>
>>>>>>>> I'm not sure what tooling actually supports this though. Could be
>>>>>>>> interesting to see what the history of this is.
>>>>>>>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3,
>>>>>>>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw
>>>>>>>>
>>>>>>>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer <
>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I have to agree that while there can be some fixes in Parquet, we
>>>>>>>>> fundamentally need a way to split a "row group"
>>>>>>>>> or something like that between separate files. If that's
>>>>>>>>> something we can do in the parquet project that would be great
>>>>>>>>> but it feels like we need to start exploring more drastic options
>>>>>>>>> than footer encoding.
>>>>>>>>>
>>>>>>>>> On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I agree with Steven that there are limitations that Parquet
>>>>>>>>>> cannot do.
>>>>>>>>>>
>>>>>>>>>> In addition to adding new columns by rewriting all files, files
>>>>>>>>>> of wide tables may suffer from bad performance like below:
>>>>>>>>>> - Poor compression of row groups because there are too many
>>>>>>>>>> columns and even a small number of rows can reach the row group 
>>>>>>>>>> threshold.
>>>>>>>>>> - Dominating columns (e.g. blobs) may contribute to 99% size of a
>>>>>>>>>> row group, leading to unbalanced column chunks and deteriorate the 
>>>>>>>>>> row
>>>>>>>>>> group compression.
>>>>>>>>>> - Similar to adding new columns, partial update also requires
>>>>>>>>>> rewriting all columns of the affected rows.
>>>>>>>>>>
>>>>>>>>>> IIRC, some table formats already support splitting columns into
>>>>>>>>>> different files:
>>>>>>>>>> - Lance manifest splits a fragment [1] into one or more data
>>>>>>>>>> files.
>>>>>>>>>> - Apache Hudi has the concept of column family [2].
>>>>>>>>>> - Apache Paimon supports sequence groups [3] for partial update.
>>>>>>>>>>
>>>>>>>>>> Although Parquet can introduce the concept of logical file and
>>>>>>>>>> physical file to manage the columns to file mapping, this looks like 
>>>>>>>>>> yet
>>>>>>>>>> another manifest file design which duplicates the purpose of Iceberg.
>>>>>>>>>> These might be something worth exploring in Iceberg.
>>>>>>>>>>
>>>>>>>>>> [1] https://lancedb.github.io/lance/format.html#fragments
>>>>>>>>>> [2]
>>>>>>>>>> https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md
>>>>>>>>>> [3]
>>>>>>>>>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Gang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, May 27, 2025 at 7:03 AM Steven Wu <stevenz...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> The Parquet metadata proposal (linked by Fokko) is mainly
>>>>>>>>>>> addressing the read performance due to bloated metadata.
>>>>>>>>>>>
>>>>>>>>>>> What Peter described in the description seems useful for some ML
>>>>>>>>>>> workload of feature engineering. A new set of features/columns are 
>>>>>>>>>>> added to
>>>>>>>>>>> the table. Currently, Iceberg  would require rewriting all data 
>>>>>>>>>>> files to
>>>>>>>>>>> combine old and new columns (write amplification). Similarly, in 
>>>>>>>>>>> the past
>>>>>>>>>>> the community also talked about the use cases of updating a single 
>>>>>>>>>>> column,
>>>>>>>>>>> which would require rewriting all data files.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry <
>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Do you have the link at hand for the thread where this was
>>>>>>>>>>>> discussed on the Parquet list?
>>>>>>>>>>>> The docs seem quite old, and the PR stale, so I would like to
>>>>>>>>>>>> understand the situation better.
>>>>>>>>>>>> If it is possible to do this in Parquet, that would be great,
>>>>>>>>>>>> but Avro, ORC would still suffer.
>>>>>>>>>>>>
>>>>>>>>>>>> Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont: 2025.
>>>>>>>>>>>> máj. 26., H, 22:07):
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey Peter,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for bringing this issue up. I think I agree with Fokko;
>>>>>>>>>>>>> the issue of wide tables leading to Parquet metadata bloat and 
>>>>>>>>>>>>> poor Thrift
>>>>>>>>>>>>> deserialization performance is a long standing issue that I 
>>>>>>>>>>>>> believe there's
>>>>>>>>>>>>> motivation in the community to address. So to me it seems better 
>>>>>>>>>>>>> to address
>>>>>>>>>>>>> it in Parquet itself rather than Iceberg library facilitate a 
>>>>>>>>>>>>> pattern which
>>>>>>>>>>>>> works around the limitations.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong <
>>>>>>>>>>>>> fo...@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for bringing this up. Wouldn't it make more sense to
>>>>>>>>>>>>>> fix this in Parquet itself? It has been a long-running issue on 
>>>>>>>>>>>>>> Parquet,
>>>>>>>>>>>>>> and there is still active interest from the community. There is 
>>>>>>>>>>>>>> a PR to
>>>>>>>>>>>>>> replace the footer with FlatBuffers, which dramatically
>>>>>>>>>>>>>> improves performance
>>>>>>>>>>>>>> <https://github.com/apache/arrow/pull/43793>. The underlying
>>>>>>>>>>>>>> proposal can be found here
>>>>>>>>>>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa>
>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>> Fokko
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou <
>>>>>>>>>>>>>> yunzou.colost...@gmail.com>:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +1, I am really interested in this topic. Performance has
>>>>>>>>>>>>>>> always been a problem when dealing with wide tables, not just 
>>>>>>>>>>>>>>> read/write,
>>>>>>>>>>>>>>> but also during compilation. Most of the ML use cases typically 
>>>>>>>>>>>>>>> exhibit a
>>>>>>>>>>>>>>> vectorized read/write pattern, I am also wondering if there is 
>>>>>>>>>>>>>>> any way at
>>>>>>>>>>>>>>> the metadata level to help the whole compilation and execution 
>>>>>>>>>>>>>>> process. I
>>>>>>>>>>>>>>> do not have any answer fo this yet, but I would be really 
>>>>>>>>>>>>>>> interested in
>>>>>>>>>>>>>>> exploring this further.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>> Yun
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang
>>>>>>>>>>>>>>> <py...@pinterest.com.invalid> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Peter, I am interested in this proposal. What's more, I
>>>>>>>>>>>>>>>> am curious if there is a similar story on the write side as 
>>>>>>>>>>>>>>>> well (how to
>>>>>>>>>>>>>>>> generate these splitted files) and specifically, are you 
>>>>>>>>>>>>>>>> targeting feature
>>>>>>>>>>>>>>>> backfill use cases in ML use?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry <
>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Team,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In machine learning use-cases, it's common to encounter
>>>>>>>>>>>>>>>>> tables with a very high number of columns - sometimes even in 
>>>>>>>>>>>>>>>>> the range of
>>>>>>>>>>>>>>>>> several thousand. I've seen cases with up to 15,000 columns. 
>>>>>>>>>>>>>>>>> Storing such
>>>>>>>>>>>>>>>>> wide tables in a single Parquet file is often suboptimal, as 
>>>>>>>>>>>>>>>>> Parquet can
>>>>>>>>>>>>>>>>> become a bottleneck, even when only a subset of columns is 
>>>>>>>>>>>>>>>>> queried.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> A common approach to mitigate this is to split the data
>>>>>>>>>>>>>>>>> across multiple Parquet files. With the upcoming File Format 
>>>>>>>>>>>>>>>>> API, we could
>>>>>>>>>>>>>>>>> introduce a layer that combines these files into a single 
>>>>>>>>>>>>>>>>> iterator,
>>>>>>>>>>>>>>>>> enabling efficient reading of wide and very wide tables.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> To support this, we would need to revise the metadata
>>>>>>>>>>>>>>>>> specification. Instead of the current `_file` column, we 
>>>>>>>>>>>>>>>>> could introduce a
>>>>>>>>>>>>>>>>> _files column containing:
>>>>>>>>>>>>>>>>> - `_file_column_ids`: the column IDs present in each file
>>>>>>>>>>>>>>>>> - `_file_path`: the path to the corresponding file
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Has there been any prior discussion around this idea?
>>>>>>>>>>>>>>>>> Is anyone else interested in exploring this further?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>

Reply via email to