Re: Wide tables in V4

2025-06-13 Thread Péter Váry
Hi everyone, I did some experiments with splitting up wide Parquet files into multiple column families. You can check the PR here: https://github.com/apache/iceberg/pull/13306. What the test does: - Creates tables with 100/1000/1 columns, where the column type is double - Generates

Re: Wide tables in V4

2025-06-06 Thread Micah Kornfield
At a high-level we should probably work out if supporting wide tables with performant appends is something we want to invest effort into and focus on the lower level questions once that is resolved. I think it would be great to make this work, I think the main question is whether any PMC/community

Re: Wide tables in V4

2025-06-06 Thread Jean-Baptiste Onofré
Hi Peter Thanks for your message. It's an interesting topic. Would it not be more a data file/parquet "issue" ? Especially with the data file API you are proposing, I think Iceberg should "delegate" to the data file layer (Parquet here) and Iceberg could be "agnostic". Regards JB On Mon, May 26

Re: Wide tables in V4

2025-06-05 Thread Péter Váry
For the record, link from a user requesting this feature: https://github.com/apache/iceberg/issues/11634 On Mon, Jun 2, 2025, 12:34 Péter Váry wrote: > Hi Bart, > > Thanks for your answer! > I’ve pulled out some text from your thorough and well-organized response > to make it easier to highlight

Re: Wide tables in V4

2025-06-02 Thread Péter Váry
Hi Bart, Thanks for your answer! I’ve pulled out some text from your thorough and well-organized response to make it easier to highlight my comments. > It would be well possible to tune parquet writers to write very large row groups when a large string column dominates. [..] What would you do, i

Re: Wide tables in V4

2025-06-02 Thread Bart Samwel
On Fri, May 30, 2025 at 8:35 PM Péter Váry wrote: > Consider this example > Imagine a table with one large string column and many small numeric > columns. > > Scenario 1: Single File > >- All columns are written into a single file. >- The RowGroup size is small due to the large string col

Re: Wide tables in V4

2025-05-30 Thread Péter Váry
Consider this example Imagine a table with one large string column and many small numeric columns. Scenario 1: Single File - All columns are written into a single file. - The RowGroup size is small due to the large string column dominating the layout. - The numeric columns are not com

Re: Wide tables in V4

2025-05-30 Thread Péter Váry
> A larger problem for splitting columns across files is that there are a lot of assumptions about how data is laid out in both readers and writers. For example, aligning row groups and correctly handling split calculation is very complicated if you're trying to split rows across files. Other feat

Re: Wide tables in V4

2025-05-30 Thread Bart Samwel
On Fri, May 30, 2025 at 3:33 PM Péter Váry wrote: > One key advantage of introducing Physical Files is the flexibility to vary > RowGroup sizes across columns. For instance, wide string columns could > benefit from smaller RowGroups to reduce memory pressure, while numeric > columns could use lar

Re: Wide tables in V4

2025-05-29 Thread Gang Wu
IMO, the main drawback for the view solution is the complexity of maintaining consistency across tables if we want to use features like time travel, incremental scan, branch & tag, encryption, etc. On Fri, May 30, 2025 at 12:55 PM Bryan Keller wrote: > Fewer commit conflicts meaning the tables r

Re: Wide tables in V4

2025-05-29 Thread Bryan Keller
Fewer commit conflicts meaning the tables representing column families are updated independently, rather than having to serialize commits to a single table. Perhaps with a wide table solution the commit logic could be enhanced to support things like concurrent overwrites to independent column fa

Re: Wide tables in V4

2025-05-29 Thread Steven Wu
Bryan, interesting approach to split horizontally across multiple tables. A few potential down sides * operational overhead. tables need to be managed consistently and probably in some coordinated way * complex read * maybe fragile to enforce correctness (during join). It is robust to enforce the

Re: Wide tables in V4

2025-05-29 Thread Bryan Keller
Hi everyone, We have been investigating a wide table format internally for a similar use case, i.e. we have wide ML tables with features generated by different pipelines and teams but want a unified view of the data. We are comparing that against separate tables joined together using a shuffle-

Re: Wide tables in V4

2025-05-29 Thread Péter Váry
I received feedback from Alkis regarding their Parquet optimization work. Their internal testing shows promising results for reducing metadata size and improving parsing performance. They plan to formalize a proposal for these Parquet enhancements in the near future. Meanwhile, I'm putting togethe

Re: Wide tables in V4

2025-05-28 Thread Péter Váry
I would be happy to put together a proposal based on the inputs got here. Thanks everyone for your thoughts! I will try to incorporate all of this. Thanks, Peter Daniel Weeks ezt írta (időpont: 2025. máj. 27., K, 20:07): > I feel like we have two different issues we're talking about here that

Re: Wide tables in V4

2025-05-27 Thread Daniel Weeks
I feel like we have two different issues we're talking about here that aren't necessarily tied (though solutions may address both): 1) wide tables, 2) adding columns Wide tables are definitely a problem where parquet has limitations. I'm optimistic about the ongoing work to help improve parquet fo

Re: Wide tables in V4

2025-05-27 Thread Selcuk Aya
Yes having to rewrite the whole file is not ideal but I believe most of the cost of rewriting a file comes from decompression, encoding, stats calculations etc. If you are adding new values for some columns but are keeping the rest of the columns the same in the file, then a bunch of rewrite cost c

Re: Wide tables in V4

2025-05-27 Thread Russell Spitzer
Point definitely taken. We really should probably POC some of these ideas and see what we are actually dealing with. (He said without volunteering to do the work :P) On Tue, May 27, 2025 at 11:55 AM Selcuk Aya wrote: > Yes having to rewrite the whole file is not ideal but I believe most of > the

Re: Wide tables in V4

2025-05-27 Thread Russell Spitzer
I think that "after the fact" modification is one of the requirements here, IE: Updating a single column without rewriting the whole file. If we have to write new metadata for the file aren't we in the same boat as having to rewrite the whole file? On Tue, May 27, 2025 at 11:27 AM Selcuk Aya wrot

Re: Wide tables in V4

2025-05-27 Thread Selcuk Aya
If files represent column projections of a table rather than the whole columns in the table, then any read that reads across these files needs to identify what constitutes a row. Lance DB for example has vertical partitioning across columns but also horizontal partitioning across rows such that in

Re: Wide tables in V4

2025-05-27 Thread Devin Smith
There's a `file_path` field in the parquet ColumnChunk structure, https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962 I'm not sure what tooling actually supports this though. Could be interesting to see what the history of this is. ht

Re: Wide tables in V4

2025-05-27 Thread Russell Spitzer
I have to agree that while there can be some fixes in Parquet, we fundamentally need a way to split a "row group" or something like that between separate files. If that's something we can do in the parquet project that would be great but it feels like we need to start exploring more drastic options

Re: Wide tables in V4

2025-05-26 Thread Gang Wu
I agree with Steven that there are limitations that Parquet cannot do. In addition to adding new columns by rewriting all files, files of wide tables may suffer from bad performance like below: - Poor compression of row groups because there are too many columns and even a small number of rows can

Re: Wide tables in V4

2025-05-26 Thread Steven Wu
The Parquet metadata proposal (linked by Fokko) is mainly addressing the read performance due to bloated metadata. What Peter described in the description seems useful for some ML workload of feature engineering. A new set of features/columns are added to the table. Currently, Iceberg would requi

Re: Wide tables in V4

2025-05-26 Thread Péter Váry
Do you have the link at hand for the thread where this was discussed on the Parquet list? The docs seem quite old, and the PR stale, so I would like to understand the situation better. If it is possible to do this in Parquet, that would be great, but Avro, ORC would still suffer. Amogh Jahagirdar

Re: Wide tables in V4

2025-05-26 Thread Amogh Jahagirdar
Hey Peter, Thanks for bringing this issue up. I think I agree with Fokko; the issue of wide tables leading to Parquet metadata bloat and poor Thrift deserialization performance is a long standing issue that I believe there's motivation in the community to address. So to me it seems better to addre

Re: Wide tables in V4

2025-05-26 Thread Fokko Driesprong
Hi Peter, Thanks for bringing this up. Wouldn't it make more sense to fix this in Parquet itself? It has been a long-running issue on Parquet, and there is still active interest from the community. There is a PR to replace the footer with FlatBuffers, which dramatically improves performance

Re: Wide tables in V4

2025-05-26 Thread Pucheng Yang
Hi Peter, I am interested in this proposal. What's more, I am curious if there is a similar story on the write side as well (how to generate these splitted files) and specifically, are you targeting feature backfill use cases in ML use? On Mon, May 26, 2025 at 6:29 AM Péter Váry wrote: > Hi Team

Re: Wide tables in V4

2025-05-26 Thread yun zou
+1, I am really interested in this topic. Performance has always been a problem when dealing with wide tables, not just read/write, but also during compilation. Most of the ML use cases typically exhibit a vectorized read/write pattern, I am also wondering if there is any way at the metadata level