I received feedback from Alkis regarding their Parquet optimization work.
Their internal testing shows promising results for reducing metadata size
and improving parsing performance. They plan to formalize a proposal for
these Parquet enhancements in the near future.

Meanwhile, I'm putting together our horizontal sharding proposal as a
complementary approach. Even with the Parquet metadata improvements,
horizontal sharding would provide additional benefits for:

   - More efficient column-level updates
   - Streamlined column additions
   - Better handling of dominant columns that can cause RowGroup size
   imbalances (placing these in separate files could significantly improve
   performance)

Thanks, Peter



Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. máj. 28.,
Sze, 15:39):

> I would be happy to put together a proposal based on the inputs got here.
>
> Thanks everyone for your thoughts!
> I will try to incorporate all of this.
>
> Thanks, Peter
>
> Daniel Weeks <dwe...@apache.org> ezt írta (időpont: 2025. máj. 27., K,
> 20:07):
>
>> I feel like we have two different issues we're talking about here that
>> aren't necessarily tied (though solutions may address both): 1) wide
>> tables, 2) adding columns
>>
>> Wide tables are definitely a problem where parquet has limitations. I'm
>> optimistic about the ongoing work to help improve parquet footers/stats in
>> this area that Fokko mentioned.  There are always limitations in how this
>> scales as wide rows lead to small row groups and the cost to reconstitute a
>> row gets more expensive, but for cases that are read heavy and projecting
>> subsets of columns should significantly improve performance.
>>
>> Adding columns to an existing dataset is something that comes up
>> periodically, but there's a lot of complexity involved in this.  Parquet
>> does support referencing columns in separate files per the spec, but
>> there's no implementation that takes advantage of this to my knowledge.
>> This does allow for approaches where you separate/rewrite just the footers
>> or various other tricks, but these approaches get complicated quickly and
>> the number of readers that can consume those representations would
>> initially be very limited.
>>
>> A larger problem for splitting columns across files is that there are a
>> lot of assumptions about how data is laid out in both readers and writers.
>> For example, aligning row groups and correctly handling split calculation
>> is very complicated if you're trying to split rows across files.  Other
>> features are also impacted like deletes, which reference the file to which
>> they apply and would need to account for deletes applying to multiple files
>> and needing to update those references if columns are added.
>>
>> I believe there are a lot of interesting approaches to addressing these
>> use cases, but we'd really need a thorough proposal that explores all of
>> these scenarios.  The last thing we would want is to introduce
>> incompatibilities within the format that result in incompatible features.
>>
>> -Dan
>>
>> On Tue, May 27, 2025 at 10:02 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> Point definitely taken. We really should probably POC some of
>>> these ideas and see what we are actually dealing with. (He said without
>>> volunteering to do the work :P)
>>>
>>> On Tue, May 27, 2025 at 11:55 AM Selcuk Aya
>>> <selcuk....@snowflake.com.invalid> wrote:
>>>
>>>> Yes having to rewrite the whole file is not ideal but I believe most of
>>>> the cost of rewriting a file comes from decompression, encoding, stats
>>>> calculations etc. If you are adding new values for some columns but are
>>>> keeping the rest of the columns the same in the file, then a bunch of
>>>> rewrite cost can be optimized away. I am not saying this is better than
>>>> writing to a separate file, I am not sure how much worse it is though.
>>>>
>>>> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>>> I think that "after the fact" modification is one of the requirements
>>>>> here, IE: Updating a single column without rewriting the whole file.
>>>>> If we have to write new metadata for the file aren't we in the same
>>>>> boat as having to rewrite the whole file?
>>>>>
>>>>> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya
>>>>> <selcuk....@snowflake.com.invalid> wrote:
>>>>>
>>>>>> If files represent column projections of a table rather than the
>>>>>> whole columns in the table, then any read that reads across these files
>>>>>> needs to identify what constitutes a row. Lance DB for example has 
>>>>>> vertical
>>>>>> partitioning across columns but also horizontal partitioning across rows
>>>>>> such that in each horizontal partitioning(fragment), the same number of
>>>>>> rows exist in each vertical partition,  which I think is necessary to 
>>>>>> make
>>>>>> whole/partial row construction cheap. If this is the case, there is no
>>>>>> reason not to achieve the same data layout inside a single columnar file
>>>>>> with a lean header. I think the only valid argument for a separate file 
>>>>>> is
>>>>>> adding a new set of columns to an existing table, but even then I am not
>>>>>> sure a separate file is absolutely necessary for good performance.
>>>>>>
>>>>>> Selcuk
>>>>>>
>>>>>> On Tue, May 27, 2025 at 9:18 AM Devin Smith
>>>>>> <devinsm...@deephaven.io.invalid> wrote:
>>>>>>
>>>>>>> There's a `file_path` field in the parquet ColumnChunk structure,
>>>>>>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962
>>>>>>>
>>>>>>> I'm not sure what tooling actually supports this though. Could be
>>>>>>> interesting to see what the history of this is.
>>>>>>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3,
>>>>>>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw
>>>>>>>
>>>>>>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer <
>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I have to agree that while there can be some fixes in Parquet, we
>>>>>>>> fundamentally need a way to split a "row group"
>>>>>>>> or something like that between separate files. If that's
>>>>>>>> something we can do in the parquet project that would be great
>>>>>>>> but it feels like we need to start exploring more drastic options
>>>>>>>> than footer encoding.
>>>>>>>>
>>>>>>>> On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I agree with Steven that there are limitations that Parquet cannot
>>>>>>>>> do.
>>>>>>>>>
>>>>>>>>> In addition to adding new columns by rewriting all files, files of
>>>>>>>>> wide tables may suffer from bad performance like below:
>>>>>>>>> - Poor compression of row groups because there are too many
>>>>>>>>> columns and even a small number of rows can reach the row group 
>>>>>>>>> threshold.
>>>>>>>>> - Dominating columns (e.g. blobs) may contribute to 99% size of a
>>>>>>>>> row group, leading to unbalanced column chunks and deteriorate the row
>>>>>>>>> group compression.
>>>>>>>>> - Similar to adding new columns, partial update also requires
>>>>>>>>> rewriting all columns of the affected rows.
>>>>>>>>>
>>>>>>>>> IIRC, some table formats already support splitting columns into
>>>>>>>>> different files:
>>>>>>>>> - Lance manifest splits a fragment [1] into one or more data files.
>>>>>>>>> - Apache Hudi has the concept of column family [2].
>>>>>>>>> - Apache Paimon supports sequence groups [3] for partial update.
>>>>>>>>>
>>>>>>>>> Although Parquet can introduce the concept of logical file and
>>>>>>>>> physical file to manage the columns to file mapping, this looks like 
>>>>>>>>> yet
>>>>>>>>> another manifest file design which duplicates the purpose of Iceberg.
>>>>>>>>> These might be something worth exploring in Iceberg.
>>>>>>>>>
>>>>>>>>> [1] https://lancedb.github.io/lance/format.html#fragments
>>>>>>>>> [2]
>>>>>>>>> https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md
>>>>>>>>> [3]
>>>>>>>>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Gang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, May 27, 2025 at 7:03 AM Steven Wu <stevenz...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The Parquet metadata proposal (linked by Fokko) is mainly
>>>>>>>>>> addressing the read performance due to bloated metadata.
>>>>>>>>>>
>>>>>>>>>> What Peter described in the description seems useful for some ML
>>>>>>>>>> workload of feature engineering. A new set of features/columns are 
>>>>>>>>>> added to
>>>>>>>>>> the table. Currently, Iceberg  would require rewriting all data 
>>>>>>>>>> files to
>>>>>>>>>> combine old and new columns (write amplification). Similarly, in the 
>>>>>>>>>> past
>>>>>>>>>> the community also talked about the use cases of updating a single 
>>>>>>>>>> column,
>>>>>>>>>> which would require rewriting all data files.
>>>>>>>>>>
>>>>>>>>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry <
>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Do you have the link at hand for the thread where this was
>>>>>>>>>>> discussed on the Parquet list?
>>>>>>>>>>> The docs seem quite old, and the PR stale, so I would like to
>>>>>>>>>>> understand the situation better.
>>>>>>>>>>> If it is possible to do this in Parquet, that would be great,
>>>>>>>>>>> but Avro, ORC would still suffer.
>>>>>>>>>>>
>>>>>>>>>>> Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont: 2025.
>>>>>>>>>>> máj. 26., H, 22:07):
>>>>>>>>>>>
>>>>>>>>>>>> Hey Peter,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for bringing this issue up. I think I agree with Fokko;
>>>>>>>>>>>> the issue of wide tables leading to Parquet metadata bloat and 
>>>>>>>>>>>> poor Thrift
>>>>>>>>>>>> deserialization performance is a long standing issue that I 
>>>>>>>>>>>> believe there's
>>>>>>>>>>>> motivation in the community to address. So to me it seems better 
>>>>>>>>>>>> to address
>>>>>>>>>>>> it in Parquet itself rather than Iceberg library facilitate a 
>>>>>>>>>>>> pattern which
>>>>>>>>>>>> works around the limitations.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong <
>>>>>>>>>>>> fo...@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for bringing this up. Wouldn't it make more sense to
>>>>>>>>>>>>> fix this in Parquet itself? It has been a long-running issue on 
>>>>>>>>>>>>> Parquet,
>>>>>>>>>>>>> and there is still active interest from the community. There is a 
>>>>>>>>>>>>> PR to
>>>>>>>>>>>>> replace the footer with FlatBuffers, which dramatically
>>>>>>>>>>>>> improves performance
>>>>>>>>>>>>> <https://github.com/apache/arrow/pull/43793>. The underlying
>>>>>>>>>>>>> proposal can be found here
>>>>>>>>>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa>
>>>>>>>>>>>>> .
>>>>>>>>>>>>>
>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>> Fokko
>>>>>>>>>>>>>
>>>>>>>>>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou <
>>>>>>>>>>>>> yunzou.colost...@gmail.com>:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1, I am really interested in this topic. Performance has
>>>>>>>>>>>>>> always been a problem when dealing with wide tables, not just 
>>>>>>>>>>>>>> read/write,
>>>>>>>>>>>>>> but also during compilation. Most of the ML use cases typically 
>>>>>>>>>>>>>> exhibit a
>>>>>>>>>>>>>> vectorized read/write pattern, I am also wondering if there is 
>>>>>>>>>>>>>> any way at
>>>>>>>>>>>>>> the metadata level to help the whole compilation and execution 
>>>>>>>>>>>>>> process. I
>>>>>>>>>>>>>> do not have any answer fo this yet, but I would be really 
>>>>>>>>>>>>>> interested in
>>>>>>>>>>>>>> exploring this further.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>> Yun
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang
>>>>>>>>>>>>>> <py...@pinterest.com.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Peter, I am interested in this proposal. What's more, I
>>>>>>>>>>>>>>> am curious if there is a similar story on the write side as 
>>>>>>>>>>>>>>> well (how to
>>>>>>>>>>>>>>> generate these splitted files) and specifically, are you 
>>>>>>>>>>>>>>> targeting feature
>>>>>>>>>>>>>>> backfill use cases in ML use?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry <
>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Team,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In machine learning use-cases, it's common to encounter
>>>>>>>>>>>>>>>> tables with a very high number of columns - sometimes even in 
>>>>>>>>>>>>>>>> the range of
>>>>>>>>>>>>>>>> several thousand. I've seen cases with up to 15,000 columns. 
>>>>>>>>>>>>>>>> Storing such
>>>>>>>>>>>>>>>> wide tables in a single Parquet file is often suboptimal, as 
>>>>>>>>>>>>>>>> Parquet can
>>>>>>>>>>>>>>>> become a bottleneck, even when only a subset of columns is 
>>>>>>>>>>>>>>>> queried.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> A common approach to mitigate this is to split the data
>>>>>>>>>>>>>>>> across multiple Parquet files. With the upcoming File Format 
>>>>>>>>>>>>>>>> API, we could
>>>>>>>>>>>>>>>> introduce a layer that combines these files into a single 
>>>>>>>>>>>>>>>> iterator,
>>>>>>>>>>>>>>>> enabling efficient reading of wide and very wide tables.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To support this, we would need to revise the metadata
>>>>>>>>>>>>>>>> specification. Instead of the current `_file` column, we could 
>>>>>>>>>>>>>>>> introduce a
>>>>>>>>>>>>>>>> _files column containing:
>>>>>>>>>>>>>>>> - `_file_column_ids`: the column IDs present in each file
>>>>>>>>>>>>>>>> - `_file_path`: the path to the corresponding file
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Has there been any prior discussion around this idea?
>>>>>>>>>>>>>>>> Is anyone else interested in exploring this further?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Reply via email to