Hi everyone,
I did some experiments with splitting up wide Parquet files into multiple
column families. You can check the PR here:
https://github.com/apache/iceberg/pull/13306.
What the test does:
- Creates tables with 100/1000/1 columns, where the column type is
double
- Generates
At a high-level we should probably work out if supporting wide tables with
performant appends is something we want to invest effort into and focus on
the lower level questions once that is resolved. I think it would be great
to make this work, I think the main question is whether any PMC/community
Hi Peter
Thanks for your message. It's an interesting topic.
Would it not be more a data file/parquet "issue" ? Especially with the
data file API you are proposing, I think Iceberg should "delegate" to
the data file layer (Parquet here) and Iceberg could be "agnostic".
Regards
JB
On Mon, May 26
For the record, link from a user requesting this feature:
https://github.com/apache/iceberg/issues/11634
On Mon, Jun 2, 2025, 12:34 Péter Váry wrote:
> Hi Bart,
>
> Thanks for your answer!
> I’ve pulled out some text from your thorough and well-organized response
> to make it easier to highlight
Hi Bart,
Thanks for your answer!
I’ve pulled out some text from your thorough and well-organized response to
make it easier to highlight my comments.
> It would be well possible to tune parquet writers to write very large row
groups when a large string column dominates. [..]
What would you do, i
On Fri, May 30, 2025 at 8:35 PM Péter Váry
wrote:
> Consider this example
> Imagine a table with one large string column and many small numeric
> columns.
>
> Scenario 1: Single File
>
>- All columns are written into a single file.
>- The RowGroup size is small due to the large string col
Consider this example
Imagine a table with one large string column and many small numeric columns.
Scenario 1: Single File
- All columns are written into a single file.
- The RowGroup size is small due to the large string column dominating
the layout.
- The numeric columns are not com
> A larger problem for splitting columns across files is that there are a
lot of assumptions about how data is laid out in both readers and writers.
For example, aligning row groups and correctly handling split calculation
is very complicated if you're trying to split rows across files. Other
feat
On Fri, May 30, 2025 at 3:33 PM Péter Váry
wrote:
> One key advantage of introducing Physical Files is the flexibility to vary
> RowGroup sizes across columns. For instance, wide string columns could
> benefit from smaller RowGroups to reduce memory pressure, while numeric
> columns could use lar
IMO, the main drawback for the view solution is the complexity of
maintaining consistency across tables if we want to use features like time
travel, incremental scan, branch & tag, encryption, etc.
On Fri, May 30, 2025 at 12:55 PM Bryan Keller wrote:
> Fewer commit conflicts meaning the tables r
Fewer commit conflicts meaning the tables representing column families are
updated independently, rather than having to serialize commits to a single
table. Perhaps with a wide table solution the commit logic could be enhanced to
support things like concurrent overwrites to independent column fa
Bryan, interesting approach to split horizontally across multiple tables.
A few potential down sides
* operational overhead. tables need to be managed consistently and probably
in some coordinated way
* complex read
* maybe fragile to enforce correctness (during join). It is robust to
enforce the
Hi everyone,
We have been investigating a wide table format internally for a similar use
case, i.e. we have wide ML tables with features generated by different
pipelines and teams but want a unified view of the data. We are comparing that
against separate tables joined together using a shuffle-
I received feedback from Alkis regarding their Parquet optimization work.
Their internal testing shows promising results for reducing metadata size
and improving parsing performance. They plan to formalize a proposal for
these Parquet enhancements in the near future.
Meanwhile, I'm putting togethe
I would be happy to put together a proposal based on the inputs got here.
Thanks everyone for your thoughts!
I will try to incorporate all of this.
Thanks, Peter
Daniel Weeks ezt írta (időpont: 2025. máj. 27., K,
20:07):
> I feel like we have two different issues we're talking about here that
I feel like we have two different issues we're talking about here that
aren't necessarily tied (though solutions may address both): 1) wide
tables, 2) adding columns
Wide tables are definitely a problem where parquet has limitations. I'm
optimistic about the ongoing work to help improve parquet fo
Yes having to rewrite the whole file is not ideal but I believe most of the
cost of rewriting a file comes from decompression, encoding, stats
calculations etc. If you are adding new values for some columns but are
keeping the rest of the columns the same in the file, then a bunch of
rewrite cost c
Point definitely taken. We really should probably POC some of these ideas
and see what we are actually dealing with. (He said without volunteering to
do the work :P)
On Tue, May 27, 2025 at 11:55 AM Selcuk Aya
wrote:
> Yes having to rewrite the whole file is not ideal but I believe most of
> the
I think that "after the fact" modification is one of the requirements here,
IE: Updating a single column without rewriting the whole file.
If we have to write new metadata for the file aren't we in the same boat as
having to rewrite the whole file?
On Tue, May 27, 2025 at 11:27 AM Selcuk Aya
wrot
If files represent column projections of a table rather than the whole
columns in the table, then any read that reads across these files needs to
identify what constitutes a row. Lance DB for example has vertical
partitioning across columns but also horizontal partitioning across rows
such that in
There's a `file_path` field in the parquet ColumnChunk structure,
https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962
I'm not sure what tooling actually supports this though. Could be
interesting to see what the history of this is.
ht
I have to agree that while there can be some fixes in Parquet, we
fundamentally need a way to split a "row group"
or something like that between separate files. If that's something we can
do in the parquet project that would be great
but it feels like we need to start exploring more drastic options
I agree with Steven that there are limitations that Parquet cannot do.
In addition to adding new columns by rewriting all files, files of wide
tables may suffer from bad performance like below:
- Poor compression of row groups because there are too many columns and
even a small number of rows can
The Parquet metadata proposal (linked by Fokko) is mainly addressing
the read performance due to bloated metadata.
What Peter described in the description seems useful for some ML workload
of feature engineering. A new set of features/columns are added to the
table. Currently, Iceberg would requi
Do you have the link at hand for the thread where this was discussed on the
Parquet list?
The docs seem quite old, and the PR stale, so I would like to understand
the situation better.
If it is possible to do this in Parquet, that would be great, but Avro, ORC
would still suffer.
Amogh Jahagirdar
Hey Peter,
Thanks for bringing this issue up. I think I agree with Fokko; the issue of
wide tables leading to Parquet metadata bloat and poor Thrift
deserialization performance is a long standing issue that I believe there's
motivation in the community to address. So to me it seems better to addre
Hi Peter,
Thanks for bringing this up. Wouldn't it make more sense to fix this in
Parquet itself? It has been a long-running issue on Parquet, and there is
still active interest from the community. There is a PR to replace the
footer with FlatBuffers, which dramatically improves performance
Hi Peter, I am interested in this proposal. What's more, I am curious if
there is a similar story on the write side as well (how to generate these
splitted files) and specifically, are you targeting feature backfill use
cases in ML use?
On Mon, May 26, 2025 at 6:29 AM Péter Váry
wrote:
> Hi Team
+1, I am really interested in this topic. Performance has always been a
problem when dealing with wide tables, not just read/write, but also during
compilation. Most of the ML use cases typically exhibit a vectorized
read/write pattern, I am also wondering if there is any way at the metadata
level
29 matches
Mail list logo