I would be happy to put together a proposal based on the inputs got here.

Thanks everyone for your thoughts!
I will try to incorporate all of this.

Thanks, Peter

Daniel Weeks <dwe...@apache.org> ezt írta (időpont: 2025. máj. 27., K,
20:07):

> I feel like we have two different issues we're talking about here that
> aren't necessarily tied (though solutions may address both): 1) wide
> tables, 2) adding columns
>
> Wide tables are definitely a problem where parquet has limitations. I'm
> optimistic about the ongoing work to help improve parquet footers/stats in
> this area that Fokko mentioned.  There are always limitations in how this
> scales as wide rows lead to small row groups and the cost to reconstitute a
> row gets more expensive, but for cases that are read heavy and projecting
> subsets of columns should significantly improve performance.
>
> Adding columns to an existing dataset is something that comes up
> periodically, but there's a lot of complexity involved in this.  Parquet
> does support referencing columns in separate files per the spec, but
> there's no implementation that takes advantage of this to my knowledge.
> This does allow for approaches where you separate/rewrite just the footers
> or various other tricks, but these approaches get complicated quickly and
> the number of readers that can consume those representations would
> initially be very limited.
>
> A larger problem for splitting columns across files is that there are a
> lot of assumptions about how data is laid out in both readers and writers.
> For example, aligning row groups and correctly handling split calculation
> is very complicated if you're trying to split rows across files.  Other
> features are also impacted like deletes, which reference the file to which
> they apply and would need to account for deletes applying to multiple files
> and needing to update those references if columns are added.
>
> I believe there are a lot of interesting approaches to addressing these
> use cases, but we'd really need a thorough proposal that explores all of
> these scenarios.  The last thing we would want is to introduce
> incompatibilities within the format that result in incompatible features.
>
> -Dan
>
> On Tue, May 27, 2025 at 10:02 AM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> Point definitely taken. We really should probably POC some of these ideas
>> and see what we are actually dealing with. (He said without volunteering to
>> do the work :P)
>>
>> On Tue, May 27, 2025 at 11:55 AM Selcuk Aya
>> <selcuk....@snowflake.com.invalid> wrote:
>>
>>> Yes having to rewrite the whole file is not ideal but I believe most of
>>> the cost of rewriting a file comes from decompression, encoding, stats
>>> calculations etc. If you are adding new values for some columns but are
>>> keeping the rest of the columns the same in the file, then a bunch of
>>> rewrite cost can be optimized away. I am not saying this is better than
>>> writing to a separate file, I am not sure how much worse it is though.
>>>
>>> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> I think that "after the fact" modification is one of the requirements
>>>> here, IE: Updating a single column without rewriting the whole file.
>>>> If we have to write new metadata for the file aren't we in the same
>>>> boat as having to rewrite the whole file?
>>>>
>>>> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya
>>>> <selcuk....@snowflake.com.invalid> wrote:
>>>>
>>>>> If files represent column projections of a table rather than the whole
>>>>> columns in the table, then any read that reads across these files needs to
>>>>> identify what constitutes a row. Lance DB for example has vertical
>>>>> partitioning across columns but also horizontal partitioning across rows
>>>>> such that in each horizontal partitioning(fragment), the same number of
>>>>> rows exist in each vertical partition,  which I think is necessary to make
>>>>> whole/partial row construction cheap. If this is the case, there is no
>>>>> reason not to achieve the same data layout inside a single columnar file
>>>>> with a lean header. I think the only valid argument for a separate file is
>>>>> adding a new set of columns to an existing table, but even then I am not
>>>>> sure a separate file is absolutely necessary for good performance.
>>>>>
>>>>> Selcuk
>>>>>
>>>>> On Tue, May 27, 2025 at 9:18 AM Devin Smith
>>>>> <devinsm...@deephaven.io.invalid> wrote:
>>>>>
>>>>>> There's a `file_path` field in the parquet ColumnChunk structure,
>>>>>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962
>>>>>>
>>>>>> I'm not sure what tooling actually supports this though. Could be
>>>>>> interesting to see what the history of this is.
>>>>>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3,
>>>>>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw
>>>>>>
>>>>>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer <
>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>
>>>>>>> I have to agree that while there can be some fixes in Parquet, we
>>>>>>> fundamentally need a way to split a "row group"
>>>>>>> or something like that between separate files. If that's
>>>>>>> something we can do in the parquet project that would be great
>>>>>>> but it feels like we need to start exploring more drastic options
>>>>>>> than footer encoding.
>>>>>>>
>>>>>>> On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I agree with Steven that there are limitations that Parquet cannot
>>>>>>>> do.
>>>>>>>>
>>>>>>>> In addition to adding new columns by rewriting all files, files of
>>>>>>>> wide tables may suffer from bad performance like below:
>>>>>>>> - Poor compression of row groups because there are too many columns
>>>>>>>> and even a small number of rows can reach the row group threshold.
>>>>>>>> - Dominating columns (e.g. blobs) may contribute to 99% size of a
>>>>>>>> row group, leading to unbalanced column chunks and deteriorate the row
>>>>>>>> group compression.
>>>>>>>> - Similar to adding new columns, partial update also requires
>>>>>>>> rewriting all columns of the affected rows.
>>>>>>>>
>>>>>>>> IIRC, some table formats already support splitting columns into
>>>>>>>> different files:
>>>>>>>> - Lance manifest splits a fragment [1] into one or more data files.
>>>>>>>> - Apache Hudi has the concept of column family [2].
>>>>>>>> - Apache Paimon supports sequence groups [3] for partial update.
>>>>>>>>
>>>>>>>> Although Parquet can introduce the concept of logical file and
>>>>>>>> physical file to manage the columns to file mapping, this looks like 
>>>>>>>> yet
>>>>>>>> another manifest file design which duplicates the purpose of Iceberg.
>>>>>>>> These might be something worth exploring in Iceberg.
>>>>>>>>
>>>>>>>> [1] https://lancedb.github.io/lance/format.html#fragments
>>>>>>>> [2] https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md
>>>>>>>> [3]
>>>>>>>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Gang
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, May 27, 2025 at 7:03 AM Steven Wu <stevenz...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The Parquet metadata proposal (linked by Fokko) is mainly
>>>>>>>>> addressing the read performance due to bloated metadata.
>>>>>>>>>
>>>>>>>>> What Peter described in the description seems useful for some ML
>>>>>>>>> workload of feature engineering. A new set of features/columns are 
>>>>>>>>> added to
>>>>>>>>> the table. Currently, Iceberg  would require rewriting all data files 
>>>>>>>>> to
>>>>>>>>> combine old and new columns (write amplification). Similarly, in the 
>>>>>>>>> past
>>>>>>>>> the community also talked about the use cases of updating a single 
>>>>>>>>> column,
>>>>>>>>> which would require rewriting all data files.
>>>>>>>>>
>>>>>>>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry <
>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Do you have the link at hand for the thread where this was
>>>>>>>>>> discussed on the Parquet list?
>>>>>>>>>> The docs seem quite old, and the PR stale, so I would like to
>>>>>>>>>> understand the situation better.
>>>>>>>>>> If it is possible to do this in Parquet, that would be great, but
>>>>>>>>>> Avro, ORC would still suffer.
>>>>>>>>>>
>>>>>>>>>> Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont: 2025.
>>>>>>>>>> máj. 26., H, 22:07):
>>>>>>>>>>
>>>>>>>>>>> Hey Peter,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for bringing this issue up. I think I agree with Fokko;
>>>>>>>>>>> the issue of wide tables leading to Parquet metadata bloat and poor 
>>>>>>>>>>> Thrift
>>>>>>>>>>> deserialization performance is a long standing issue that I believe 
>>>>>>>>>>> there's
>>>>>>>>>>> motivation in the community to address. So to me it seems better to 
>>>>>>>>>>> address
>>>>>>>>>>> it in Parquet itself rather than Iceberg library facilitate a 
>>>>>>>>>>> pattern which
>>>>>>>>>>> works around the limitations.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>
>>>>>>>>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong <
>>>>>>>>>>> fo...@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for bringing this up. Wouldn't it make more sense to fix
>>>>>>>>>>>> this in Parquet itself? It has been a long-running issue on 
>>>>>>>>>>>> Parquet, and
>>>>>>>>>>>> there is still active interest from the community. There is a PR 
>>>>>>>>>>>> to replace
>>>>>>>>>>>> the footer with FlatBuffers, which dramatically improves
>>>>>>>>>>>> performance <https://github.com/apache/arrow/pull/43793>. The
>>>>>>>>>>>> underlying proposal can be found here
>>>>>>>>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa>
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>> Fokko
>>>>>>>>>>>>
>>>>>>>>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou <
>>>>>>>>>>>> yunzou.colost...@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>> +1, I am really interested in this topic. Performance has
>>>>>>>>>>>>> always been a problem when dealing with wide tables, not just 
>>>>>>>>>>>>> read/write,
>>>>>>>>>>>>> but also during compilation. Most of the ML use cases typically 
>>>>>>>>>>>>> exhibit a
>>>>>>>>>>>>> vectorized read/write pattern, I am also wondering if there is 
>>>>>>>>>>>>> any way at
>>>>>>>>>>>>> the metadata level to help the whole compilation and execution 
>>>>>>>>>>>>> process. I
>>>>>>>>>>>>> do not have any answer fo this yet, but I would be really 
>>>>>>>>>>>>> interested in
>>>>>>>>>>>>> exploring this further.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>> Yun
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang
>>>>>>>>>>>>> <py...@pinterest.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Peter, I am interested in this proposal. What's more, I am
>>>>>>>>>>>>>> curious if there is a similar story on the write side as well 
>>>>>>>>>>>>>> (how to
>>>>>>>>>>>>>> generate these splitted files) and specifically, are you 
>>>>>>>>>>>>>> targeting feature
>>>>>>>>>>>>>> backfill use cases in ML use?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry <
>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Team,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In machine learning use-cases, it's common to encounter
>>>>>>>>>>>>>>> tables with a very high number of columns - sometimes even in 
>>>>>>>>>>>>>>> the range of
>>>>>>>>>>>>>>> several thousand. I've seen cases with up to 15,000 columns. 
>>>>>>>>>>>>>>> Storing such
>>>>>>>>>>>>>>> wide tables in a single Parquet file is often suboptimal, as 
>>>>>>>>>>>>>>> Parquet can
>>>>>>>>>>>>>>> become a bottleneck, even when only a subset of columns is 
>>>>>>>>>>>>>>> queried.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> A common approach to mitigate this is to split the data
>>>>>>>>>>>>>>> across multiple Parquet files. With the upcoming File Format 
>>>>>>>>>>>>>>> API, we could
>>>>>>>>>>>>>>> introduce a layer that combines these files into a single 
>>>>>>>>>>>>>>> iterator,
>>>>>>>>>>>>>>> enabling efficient reading of wide and very wide tables.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> To support this, we would need to revise the metadata
>>>>>>>>>>>>>>> specification. Instead of the current `_file` column, we could 
>>>>>>>>>>>>>>> introduce a
>>>>>>>>>>>>>>> _files column containing:
>>>>>>>>>>>>>>> - `_file_column_ids`: the column IDs present in each file
>>>>>>>>>>>>>>> - `_file_path`: the path to the corresponding file
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Has there been any prior discussion around this idea?
>>>>>>>>>>>>>>> Is anyone else interested in exploring this further?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Reply via email to