Re: Wide tables in V4

Péter Váry Thu, 05 Jun 2025 21:59:34 -0700

For the record, link from a user requesting this feature:
https://github.com/apache/iceberg/issues/11634


On Mon, Jun 2, 2025, 12:34 Péter Váry <peter.vary.apa...@gmail.com> wrote:

> Hi Bart,
>
> Thanks for your answer!
> I’ve pulled out some text from your thorough and well-organized response
> to make it easier to highlight my comments.
>
> > It would be well possible to tune parquet writers to write very large
> row groups when a large string column dominates. [..]
>
> What would you do, if there are more "optimal" sizes, let's say a string
> column where dictionary encoding could be optimal, and maybe some other
> differently sized columns?
>
> > You know that a row group is very large, so you might then shard it by
> row ranges. Each parallel reader would have to filter out the rows that
> weren't assigned to it. With Parquet page skipping, each reader could avoid
> reading the large-string column pages for rows that weren't assigned to
> it.
>
> I might be wrong, but page skipping relies on page headers which are
> stored in-line with the data itself. When downloading data from blob stores
> this could be less than ideal. This makes the idea of storing row-group
> boundaries in the Iceberg metadata feel more appealing to me. Of course, we
> need to perform row-index-range-based skipping for some files, but
> page-level skipping could also help optimize it - if we decide it's
> necessary.
>
> > If you use column-specific files, then you actually need to read the
> parquet footers of *all the separate column files*. That's 2x the number
> of I/Os.
>
> Agree, this's a valid point, until the footer fits into a single read -
> which is true when the configuration is correct.
>
> > There's a third option, which is to use column-specific files (or
> groups of columns in a file) that form a single Parquet structure with
> cross-file references (which is already in the Parquet standard, albeit not
> implemented anywhere).
>
> We have talked about this internally, but we saw several disadvantages:
> - It is not implemented anywhere - which means if we start using it
> everyone needs a new reader
> - If I understand correctly the cross-file references are for column
> chunks - we want to avoid too much fragmentation
> - It becomes hard to ensure that the file is really immutable
> - We still have to optimize the page alignment for reads.
>
> > I agree that it's an interesting idea, but it does add a lot of
> complexity, and I'm not convinced that it's better from a performance
> standpoint (metadata size increase, more I/Os). If we can get away with a
> better row group sizing policy, wouldn't that be preferable?
>
> That's a great question regarding the complexity. I'm still working
> through all the implications myself, but I believe we can encapsulate this
> behind the Iceberg File Format API. That way, it becomes available across
> all file formats and shields the rest of the codebase from the underlying
> complexity.
>
> Your point about performance is valid, especially in the context of full
> table scans. However, with these very wide tables, full scans are quite
> rare. If the column families are well-designed, we can actually improve
> performance across many columns/queries - not just a select few.
>
> Additionally, this approach enables frequently requested features like
> adding or updating column families without rewriting the entire table.
>
> Thanks,
> Peter
>
> Bart Samwel <b...@databricks.com.invalid> ezt írta (időpont: 2025. jún.
> 2., H, 10:21):
>
>> On Fri, May 30, 2025 at 8:35 PM Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>>
>>> Consider this example
>>> Imagine a table with one large string column and many small numeric
>>> columns.
>>>
>>> Scenario 1: Single File
>>>
>>>    - All columns are written into a single file.
>>>    - The RowGroup size is small due to the large string column
>>>    dominating the layout.
>>>
>>> This is an assumption that may not be necessary. It would be well
>> possible to tune parquet writers to write very large row groups when a
>> large string column dominates. Such a string column would probably not get
>> dictionary encoded anyway, so it would effectively end up with a couple of
>> values per 1MB Parquet page. The other columns would get decent-sized
>> pages, and the overall row group size would be appropriate for getting good
>> compression on those smaller columns.
>>
>> What would be the downside of this approach?
>>
>>    - When you're only reading the integer columns it is exactly the same
>>    as when the columns would have been in a file by themselves. You just 
>> don't
>>    read the large column chunk.
>>    - I think it adds some complexity to the distributed/parallel reading
>>    of the row groups when the large string column is included in the selected
>>    set of columns. You know that a row group is very large, so you might then
>>    shard it by row ranges. Each parallel reader would have to filter out the
>>    rows that weren't assigned to it. With Parquet page skipping, each reader
>>    could avoid reading the large-string column pages for rows that weren't
>>    assigned to it.
>>
>> Ultimately I think the parallel reading problem here is *nearly* the
>> same regardless of whether you use one XL row group or separate files. You
>> need to know the exact row group / page boundaries within each file
>> in order to decide how to shard the read. And then you need to do
>> row-index-range based skipping on at least *some* of the input columns.
>>
>>    - With XL row groups, in order to shard the row group into evenly
>>    sized chunks, you need to actually read the parquet footer first, because
>>    you need to know the row group boundaries within each file, and ideally
>>    even the page boundaries within each row group so that you can align your
>>    row ranges with those boundaries.
>>    - If you use column-specific files, then you actually need to read
>>    the parquet footers of *all the separate column files*. That's 2x the
>>    number of I/Os. These I/Os can be done in parallel, but they will
>>    contribute to throttling on cloud object stores.
>>
>> So XL row groups distributed read planning can be done in one I/O, while
>> column-specific files require more I/Os. Either that, or you need to store 
>> *even
>> more* information in the metadata (namely all of these boundaries). The
>> column-specific files also require more I/Os to read later (because you end
>> up having to read two footers), which adds up especially if you read the
>> large string column which means you parallelize the read into many small
>> chunks.
>>
>>>
>>>    - The numeric columns are not compacted efficiently.
>>>
>>> Scenario 2: Column-Specific Files
>>>
>>>    - One file is written for the string column, and another for the
>>>    numeric columns.
>>>    - The RowGroup size for the string column remains small, but the
>>>    numeric columns benefit from optimal RowGroup sizing.
>>>
>>> There's a third option, which is to use column-specific files (or groups
>> of columns in a file) that form a single Parquet structure with cross-file
>> references (which is already in the Parquet standard, albeit not
>> implemented anywhere). This approach has several advantages over the other
>> options:
>>
>>    1. All of the metadata required for distributed reads is in one place
>>    (one parquet footer), making distributed read planning require fewer I/Os,
>>    and reducing the pressure to move all of that information to the
>>    table-level metadata as well.
>>    2. Flexible structure. Different files can have different
>>    distribution of columns over files, and you don't have to remember the
>>    per-file distribution in the metadata.
>>    3. More scalable: you can have a file per column if you want, if your
>>    column sizes are wildly variable, without bloating the table-level 
>> metadata
>>    with information about more files.
>>    4. You can add/replace an entire column just by writing one extra
>>    file (with the new column contents, plus a new footer for the entire file
>>    that simply points to the old files for the existing data that wasn't
>>    modified).
>>    5. Relatively simple to implement in existing Parquet readers
>>    compared to "read multiple parquets and zip them together".
>>
>>
>>
>>> Query Performance Impact:
>>>
>>>    - If a query only reads one of the numeric columns:
>>>       - Scenario 1: Requires reading many small column chunks.
>>>       - Scenario 2: Reads a single, continuous column chunk - much more
>>>       efficient.
>>>
>>> Queries only reading columns which are stored in a single file will have
>>> improvements. Cross file queries will have over-reading which might, or
>>> might not be balanced out by reading bigger continuous chunks. Full table
>>> scans will definitely have a performance penalty, but that is not the goal
>>> here.
>>>
>>
>>
>>> > And aren't Parquet pages already providing these unaligned sizes?
>>>
>>> Parquet pages do offer some flexibility in size, but they operate at a
>>> lower level and are still bound by the RowGroup structure. What I’m
>>> proposing is a higher-level abstraction that allows us to group columns
>>> into independently optimized Physical Files, each with its own RowGroup
>>> sizing strategy. This could allow us to better optimize for queries where
>>> only a small number of columns are projected from a wide table.
>>>
>>
>> I agree that it's an interesting idea, but it does add a lot of
>> complexity, and I'm not convinced that it's better from a performance
>> standpoint (metadata size increase, more I/Os). If we can get away with a
>> better row group sizing policy, wouldn't that be preferable?
>>
>>
>>
>>> Bart Samwel <b...@databricks.com.invalid> ezt írta (időpont: 2025. máj.
>>> 30., P, 16:03):
>>>
>>>>
>>>>
>>>> On Fri, May 30, 2025 at 3:33 PM Péter Váry <peter.vary.apa...@gmail.com>
>>>> wrote:
>>>>
>>>>> One key advantage of introducing Physical Files is the flexibility to
>>>>> vary RowGroup sizes across columns. For instance, wide string columns 
>>>>> could
>>>>> benefit from smaller RowGroups to reduce memory pressure, while numeric
>>>>> columns could use larger RowGroups to improve compression and scan
>>>>> efficiency. Rather than enforcing strict row group alignment across all
>>>>> columns, we can explore optimizing read split sizes and write-time 
>>>>> RowGroup
>>>>> sizes independently - striking a balance that maximizes performance and
>>>>> storage costs for different data types and queries.
>>>>>
>>>>
>>>> That actually sounds very complicated if you want to split file reads
>>>> in a distributed system. If you want to read across column groups, then you
>>>> always end up over-reading on one of them if they are not aligned.
>>>>
>>>> And aren't Parquet pages already providing these unaligned sizes?
>>>>
>>>> Gang Wu <ust...@gmail.com> ezt írta (időpont: 2025. máj. 30., P, 8:09):
>>>>>
>>>>>> IMO, the main drawback for the view solution is the complexity of
>>>>>> maintaining consistency across tables if we want to use features like 
>>>>>> time
>>>>>> travel, incremental scan, branch & tag, encryption, etc.
>>>>>>
>>>>>> On Fri, May 30, 2025 at 12:55 PM Bryan Keller <brya...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Fewer commit conflicts meaning the tables representing column
>>>>>>> families are updated independently, rather than having to serialize 
>>>>>>> commits
>>>>>>> to a single table. Perhaps with a wide table solution the commit logic
>>>>>>> could be enhanced to support things like concurrent overwrites to
>>>>>>> independent column families, but it seems like it would be fairly 
>>>>>>> involved.
>>>>>>>
>>>>>>>
>>>>>>> On May 29, 2025, at 7:16 PM, Steven Wu <stevenz...@gmail.com> wrote:
>>>>>>>
>>>>>>> Bryan, interesting approach to split horizontally across multiple
>>>>>>> tables.
>>>>>>>
>>>>>>> A few potential down sides
>>>>>>> * operational overhead. tables need to be managed consistently and
>>>>>>> probably in some coordinated way
>>>>>>> * complex read
>>>>>>> * maybe fragile to enforce correctness (during join). It is robust
>>>>>>> to enforce the stitching correctness at file group level in file reader 
>>>>>>> and
>>>>>>> writer if built in the table format.
>>>>>>>
>>>>>>> > fewer commit conflicts
>>>>>>>
>>>>>>> Can you elaborate on this one? Are those tables populated by
>>>>>>> streaming or batch pipelines?
>>>>>>>
>>>>>>> On Thu, May 29, 2025 at 5:03 PM Bryan Keller <brya...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> We have been investigating a wide table format internally for a
>>>>>>>> similar use case, i.e. we have wide ML tables with features generated 
>>>>>>>> by
>>>>>>>> different pipelines and teams but want a unified view of the data. We 
>>>>>>>> are
>>>>>>>> comparing that against separate tables joined together using a 
>>>>>>>> shuffle-less
>>>>>>>> join (e.g. storage partition join), along with a corresponding view.
>>>>>>>>
>>>>>>>> The join/view approach seems to give us much of we need, with some
>>>>>>>> added benefits like splitting up the metadata, fewer commit conflicts, 
>>>>>>>> and
>>>>>>>> ability to share, nest, and swap "column families". The downsides are 
>>>>>>>> table
>>>>>>>> management is split across multiple tables, it requires engine support 
>>>>>>>> of
>>>>>>>> shuffle-less joins for best performance, and even then, scans probably
>>>>>>>> won't be as optimal.
>>>>>>>>
>>>>>>>> I'm curious if anyone had further thoughts on the two?
>>>>>>>>
>>>>>>>> -Bryan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On May 29, 2025, at 8:18 AM, Péter Váry <
>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> I received feedback from Alkis regarding their Parquet optimization
>>>>>>>> work. Their internal testing shows promising results for reducing 
>>>>>>>> metadata
>>>>>>>> size and improving parsing performance. They plan to formalize a 
>>>>>>>> proposal
>>>>>>>> for these Parquet enhancements in the near future.
>>>>>>>>
>>>>>>>> Meanwhile, I'm putting together our horizontal sharding proposal as
>>>>>>>> a complementary approach. Even with the Parquet metadata improvements,
>>>>>>>> horizontal sharding would provide additional benefits for:
>>>>>>>>
>>>>>>>>    - More efficient column-level updates
>>>>>>>>    - Streamlined column additions
>>>>>>>>    - Better handling of dominant columns that can cause RowGroup
>>>>>>>>    size imbalances (placing these in separate files could significantly
>>>>>>>>    improve performance)
>>>>>>>>
>>>>>>>> Thanks, Peter
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025.
>>>>>>>> máj. 28., Sze, 15:39):
>>>>>>>>
>>>>>>>>> I would be happy to put together a proposal based on the inputs
>>>>>>>>> got here.
>>>>>>>>>
>>>>>>>>> Thanks everyone for your thoughts!
>>>>>>>>> I will try to incorporate all of this.
>>>>>>>>>
>>>>>>>>> Thanks, Peter
>>>>>>>>>
>>>>>>>>> Daniel Weeks <dwe...@apache.org> ezt írta (időpont: 2025. máj.
>>>>>>>>> 27., K, 20:07):
>>>>>>>>>
>>>>>>>>>> I feel like we have two different issues we're talking about here
>>>>>>>>>> that aren't necessarily tied (though solutions may address both): 1) 
>>>>>>>>>> wide
>>>>>>>>>> tables, 2) adding columns
>>>>>>>>>>
>>>>>>>>>> Wide tables are definitely a problem where parquet has
>>>>>>>>>> limitations. I'm optimistic about the ongoing work to help improve 
>>>>>>>>>> parquet
>>>>>>>>>> footers/stats in this area that Fokko mentioned.  There are always
>>>>>>>>>> limitations in how this scales as wide rows lead to small row groups 
>>>>>>>>>> and
>>>>>>>>>> the cost to reconstitute a row gets more expensive, but for cases 
>>>>>>>>>> that are
>>>>>>>>>> read heavy and projecting subsets of columns should significantly 
>>>>>>>>>> improve
>>>>>>>>>> performance.
>>>>>>>>>>
>>>>>>>>>> Adding columns to an existing dataset is something that comes up
>>>>>>>>>> periodically, but there's a lot of complexity involved in this.  
>>>>>>>>>> Parquet
>>>>>>>>>> does support referencing columns in separate files per the spec, but
>>>>>>>>>> there's no implementation that takes advantage of this to my 
>>>>>>>>>> knowledge.
>>>>>>>>>> This does allow for approaches where you separate/rewrite just the 
>>>>>>>>>> footers
>>>>>>>>>> or various other tricks, but these approaches get complicated 
>>>>>>>>>> quickly and
>>>>>>>>>> the number of readers that can consume those representations would
>>>>>>>>>> initially be very limited.
>>>>>>>>>>
>>>>>>>>>> A larger problem for splitting columns across files is that there
>>>>>>>>>> are a lot of assumptions about how data is laid out in both readers 
>>>>>>>>>> and
>>>>>>>>>> writers.  For example, aligning row groups and correctly handling 
>>>>>>>>>> split
>>>>>>>>>> calculation is very complicated if you're trying to split rows across
>>>>>>>>>> files.  Other features are also impacted like deletes, which 
>>>>>>>>>> reference the
>>>>>>>>>> file to which they apply and would need to account for deletes 
>>>>>>>>>> applying to
>>>>>>>>>> multiple files and needing to update those references if columns are 
>>>>>>>>>> added.
>>>>>>>>>>
>>>>>>>>>> I believe there are a lot of interesting approaches to addressing
>>>>>>>>>> these use cases, but we'd really need a thorough proposal that 
>>>>>>>>>> explores all
>>>>>>>>>> of these scenarios.  The last thing we would want is to introduce
>>>>>>>>>> incompatibilities within the format that result in incompatible 
>>>>>>>>>> features.
>>>>>>>>>>
>>>>>>>>>> -Dan
>>>>>>>>>>
>>>>>>>>>> On Tue, May 27, 2025 at 10:02 AM Russell Spitzer <
>>>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Point definitely taken. We really should probably POC some of
>>>>>>>>>>> these ideas and see what we are actually dealing with. (He said 
>>>>>>>>>>> without
>>>>>>>>>>> volunteering to do the work :P)
>>>>>>>>>>>
>>>>>>>>>>> On Tue, May 27, 2025 at 11:55 AM Selcuk Aya
>>>>>>>>>>> <selcuk....@snowflake.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes having to rewrite the whole file is not ideal but I believe
>>>>>>>>>>>> most of the cost of rewriting a file comes from decompression, 
>>>>>>>>>>>> encoding,
>>>>>>>>>>>> stats calculations etc. If you are adding new values for some 
>>>>>>>>>>>> columns but
>>>>>>>>>>>> are keeping the rest of the columns the same in the file, then a 
>>>>>>>>>>>> bunch of
>>>>>>>>>>>> rewrite cost can be optimized away. I am not saying this is better 
>>>>>>>>>>>> than
>>>>>>>>>>>> writing to a separate file, I am not sure how much worse it is 
>>>>>>>>>>>> though.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer <
>>>>>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I think that "after the fact" modification is one of the
>>>>>>>>>>>>> requirements here, IE: Updating a single column without rewriting 
>>>>>>>>>>>>> the whole
>>>>>>>>>>>>> file.
>>>>>>>>>>>>> If we have to write new metadata for the file aren't we in the
>>>>>>>>>>>>> same boat as having to rewrite the whole file?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya
>>>>>>>>>>>>> <selcuk....@snowflake.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> If files represent column projections of a table rather than
>>>>>>>>>>>>>> the whole columns in the table, then any read that reads across 
>>>>>>>>>>>>>> these files
>>>>>>>>>>>>>> needs to identify what constitutes a row. Lance DB for example 
>>>>>>>>>>>>>> has vertical
>>>>>>>>>>>>>> partitioning across columns but also horizontal partitioning 
>>>>>>>>>>>>>> across rows
>>>>>>>>>>>>>> such that in each horizontal partitioning(fragment), the same 
>>>>>>>>>>>>>> number of
>>>>>>>>>>>>>> rows exist in each vertical partition,  which I think is 
>>>>>>>>>>>>>> necessary to make
>>>>>>>>>>>>>> whole/partial row construction cheap. If this is the case, there 
>>>>>>>>>>>>>> is no
>>>>>>>>>>>>>> reason not to achieve the same data layout inside a single 
>>>>>>>>>>>>>> columnar file
>>>>>>>>>>>>>> with a lean header. I think the only valid argument for a 
>>>>>>>>>>>>>> separate file is
>>>>>>>>>>>>>> adding a new set of columns to an existing table, but even then 
>>>>>>>>>>>>>> I am not
>>>>>>>>>>>>>> sure a separate file is absolutely necessary for good 
>>>>>>>>>>>>>> performance.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Selcuk
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, May 27, 2025 at 9:18 AM Devin Smith
>>>>>>>>>>>>>> <devinsm...@deephaven.io.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There's a `file_path` field in the parquet ColumnChunk
>>>>>>>>>>>>>>> structure,
>>>>>>>>>>>>>>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not sure what tooling actually supports this though.
>>>>>>>>>>>>>>> Could be interesting to see what the history of this is.
>>>>>>>>>>>>>>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer <
>>>>>>>>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have to agree that while there can be some fixes in
>>>>>>>>>>>>>>>> Parquet, we fundamentally need a way to split a "row group"
>>>>>>>>>>>>>>>> or something like that between separate files. If that's
>>>>>>>>>>>>>>>> something we can do in the parquet project that would be great
>>>>>>>>>>>>>>>> but it feels like we need to start exploring more drastic
>>>>>>>>>>>>>>>> options than footer encoding.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I agree with Steven that there are limitations that
>>>>>>>>>>>>>>>>> Parquet cannot do.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In addition to adding new columns by rewriting all files,
>>>>>>>>>>>>>>>>> files of wide tables may suffer from bad performance like 
>>>>>>>>>>>>>>>>> below:
>>>>>>>>>>>>>>>>> - Poor compression of row groups because there are too
>>>>>>>>>>>>>>>>> many columns and even a small number of rows can reach the 
>>>>>>>>>>>>>>>>> row group
>>>>>>>>>>>>>>>>> threshold.
>>>>>>>>>>>>>>>>> - Dominating columns (e.g. blobs) may contribute to 99%
>>>>>>>>>>>>>>>>> size of a row group, leading to unbalanced column chunks and 
>>>>>>>>>>>>>>>>> deteriorate
>>>>>>>>>>>>>>>>> the row group compression.
>>>>>>>>>>>>>>>>> - Similar to adding new columns, partial update also
>>>>>>>>>>>>>>>>> requires rewriting all columns of the affected rows.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> IIRC, some table formats already support splitting columns
>>>>>>>>>>>>>>>>> into different files:
>>>>>>>>>>>>>>>>> - Lance manifest splits a fragment [1] into one or more
>>>>>>>>>>>>>>>>> data files.
>>>>>>>>>>>>>>>>> - Apache Hudi has the concept of column family [2].
>>>>>>>>>>>>>>>>> - Apache Paimon supports sequence groups [3] for partial
>>>>>>>>>>>>>>>>> update.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Although Parquet can introduce the concept of logical file
>>>>>>>>>>>>>>>>> and physical file to manage the columns to file mapping, this 
>>>>>>>>>>>>>>>>> looks like
>>>>>>>>>>>>>>>>> yet another manifest file design which duplicates the purpose 
>>>>>>>>>>>>>>>>> of Iceberg.
>>>>>>>>>>>>>>>>> These might be something worth exploring in Iceberg.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1] https://lancedb.github.io/lance/format.html#fragments
>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>> https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md
>>>>>>>>>>>>>>>>> [3]
>>>>>>>>>>>>>>>>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Gang
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, May 27, 2025 at 7:03 AM Steven Wu <
>>>>>>>>>>>>>>>>> stevenz...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The Parquet metadata proposal (linked by Fokko) is mainly
>>>>>>>>>>>>>>>>>> addressing the read performance due to bloated metadata.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> What Peter described in the description seems useful for
>>>>>>>>>>>>>>>>>> some ML workload of feature engineering. A new set of 
>>>>>>>>>>>>>>>>>> features/columns are
>>>>>>>>>>>>>>>>>> added to the table. Currently, Iceberg  would require 
>>>>>>>>>>>>>>>>>> rewriting all data
>>>>>>>>>>>>>>>>>> files to combine old and new columns (write amplification). 
>>>>>>>>>>>>>>>>>> Similarly, in
>>>>>>>>>>>>>>>>>> the past the community also talked about the use cases of 
>>>>>>>>>>>>>>>>>> updating a single
>>>>>>>>>>>>>>>>>> column, which would require rewriting all data files.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry <
>>>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Do you have the link at hand for the thread where this
>>>>>>>>>>>>>>>>>>> was discussed on the Parquet list?
>>>>>>>>>>>>>>>>>>> The docs seem quite old, and the PR stale, so I would
>>>>>>>>>>>>>>>>>>> like to understand the situation better.
>>>>>>>>>>>>>>>>>>> If it is possible to do this in Parquet, that would be
>>>>>>>>>>>>>>>>>>> great, but Avro, ORC would still suffer.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont:
>>>>>>>>>>>>>>>>>>> 2025. máj. 26., H, 22:07):
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hey Peter,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks for bringing this issue up. I think I agree with
>>>>>>>>>>>>>>>>>>>> Fokko; the issue of wide tables leading to Parquet 
>>>>>>>>>>>>>>>>>>>> metadata bloat and poor
>>>>>>>>>>>>>>>>>>>> Thrift deserialization performance is a long standing 
>>>>>>>>>>>>>>>>>>>> issue that I believe
>>>>>>>>>>>>>>>>>>>> there's motivation in the community to address. So to me 
>>>>>>>>>>>>>>>>>>>> it seems better to
>>>>>>>>>>>>>>>>>>>> address it in Parquet itself rather than Iceberg library 
>>>>>>>>>>>>>>>>>>>> facilitate a
>>>>>>>>>>>>>>>>>>>> pattern which works around the limitations.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong <
>>>>>>>>>>>>>>>>>>>> fo...@apache.org> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks for bringing this up. Wouldn't it make more
>>>>>>>>>>>>>>>>>>>>> sense to fix this in Parquet itself? It has been a 
>>>>>>>>>>>>>>>>>>>>> long-running issue on
>>>>>>>>>>>>>>>>>>>>> Parquet, and there is still active interest from the 
>>>>>>>>>>>>>>>>>>>>> community. There is a
>>>>>>>>>>>>>>>>>>>>> PR to replace the footer with FlatBuffers, which 
>>>>>>>>>>>>>>>>>>>>> dramatically
>>>>>>>>>>>>>>>>>>>>> improves performance
>>>>>>>>>>>>>>>>>>>>> <https://github.com/apache/arrow/pull/43793>. The
>>>>>>>>>>>>>>>>>>>>> underlying proposal can be found here
>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa>
>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>> Fokko
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou <
>>>>>>>>>>>>>>>>>>>>> yunzou.colost...@gmail.com>:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> +1, I am really interested in this topic. Performance
>>>>>>>>>>>>>>>>>>>>>> has always been a problem when dealing with wide tables, 
>>>>>>>>>>>>>>>>>>>>>> not just
>>>>>>>>>>>>>>>>>>>>>> read/write, but also during compilation. Most of the ML 
>>>>>>>>>>>>>>>>>>>>>> use cases typically
>>>>>>>>>>>>>>>>>>>>>> exhibit a vectorized read/write pattern, I am also 
>>>>>>>>>>>>>>>>>>>>>> wondering if there is
>>>>>>>>>>>>>>>>>>>>>> any way at the metadata level to help the whole 
>>>>>>>>>>>>>>>>>>>>>> compilation and execution
>>>>>>>>>>>>>>>>>>>>>> process. I do not have any answer fo this yet, but I 
>>>>>>>>>>>>>>>>>>>>>> would be really
>>>>>>>>>>>>>>>>>>>>>> interested in exploring this further.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>>>>>>> Yun
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang
>>>>>>>>>>>>>>>>>>>>>> <py...@pinterest.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi Peter, I am interested in this proposal. What's
>>>>>>>>>>>>>>>>>>>>>>> more, I am curious if there is a similar story on the 
>>>>>>>>>>>>>>>>>>>>>>> write side as well
>>>>>>>>>>>>>>>>>>>>>>> (how to generate these splitted files) and 
>>>>>>>>>>>>>>>>>>>>>>> specifically, are you targeting
>>>>>>>>>>>>>>>>>>>>>>> feature backfill use cases in ML use?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry <
>>>>>>>>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi Team,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> In machine learning use-cases, it's common to
>>>>>>>>>>>>>>>>>>>>>>>> encounter tables with a very high number of columns - 
>>>>>>>>>>>>>>>>>>>>>>>> sometimes even in the
>>>>>>>>>>>>>>>>>>>>>>>> range of several thousand. I've seen cases with up to 
>>>>>>>>>>>>>>>>>>>>>>>> 15,000 columns.
>>>>>>>>>>>>>>>>>>>>>>>> Storing such wide tables in a single Parquet file is 
>>>>>>>>>>>>>>>>>>>>>>>> often suboptimal, as
>>>>>>>>>>>>>>>>>>>>>>>> Parquet can become a bottleneck, even when only a 
>>>>>>>>>>>>>>>>>>>>>>>> subset of columns is
>>>>>>>>>>>>>>>>>>>>>>>> queried.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> A common approach to mitigate this is to split the
>>>>>>>>>>>>>>>>>>>>>>>> data across multiple Parquet files. With the upcoming 
>>>>>>>>>>>>>>>>>>>>>>>> File Format API, we
>>>>>>>>>>>>>>>>>>>>>>>> could introduce a layer that combines these files into 
>>>>>>>>>>>>>>>>>>>>>>>> a single iterator,
>>>>>>>>>>>>>>>>>>>>>>>> enabling efficient reading of wide and very wide 
>>>>>>>>>>>>>>>>>>>>>>>> tables.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> To support this, we would need to revise the
>>>>>>>>>>>>>>>>>>>>>>>> metadata specification. Instead of the current `_file` 
>>>>>>>>>>>>>>>>>>>>>>>> column, we could
>>>>>>>>>>>>>>>>>>>>>>>> introduce a _files column containing:
>>>>>>>>>>>>>>>>>>>>>>>> - `_file_column_ids`: the column IDs present in
>>>>>>>>>>>>>>>>>>>>>>>> each file
>>>>>>>>>>>>>>>>>>>>>>>> - `_file_path`: the path to the corresponding file
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Has there been any prior discussion around this
>>>>>>>>>>>>>>>>>>>>>>>> idea?
>>>>>>>>>>>>>>>>>>>>>>>> Is anyone else interested in exploring this further?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>
>>>>>>>

Re: Wide tables in V4

Reply via email to