At a high-level we should probably work out if supporting wide tables with
performant appends is something we want to invest effort into and focus on
the lower level questions once that is resolved.  I think it would be great
to make this work, I think the main question is whether any PMC/community
members feel like it would introduce too much complexity to proceed with
further design/analysis.

Some more detailed replies to what has been discussed in the thread:

I might be wrong, but page skipping relies on page headers which are stored
> in-line with the data itself. When downloading data from blob stores this
> could be less than ideal.


No, parquet supports page indices
<https://parquet.apache.org/docs/file-format/pageindex/> [1].  I think it
is reasonable to also think about improvements to Parquet for large blobs
so these can be handled better. There is also general interest in Parquet
in evolving it to better support some of these use-cases in general, so if
there are clear items that can be pushed to the file level, let's have
those conversations.


> Would it not be more a data file/parquet "issue" ? Especially with the
> data file API you are proposing, I think Iceberg should "delegate" to
> the data file layer (Parquet here) and Iceberg could be "agnostic".


I think maybe we should maybe table the discussion on exactly what belongs
in each layer until we have more data.  Roughly the concerns expressed I
think boil down into a few mains buckets:

1.  Read vs Write amplification (I think one could run some rough
experiments with low-level Parquet APIs to see the impact of splitting out
columns into individual objects to answer both sides of this).
     - For large blobs, I think memory pressure becomes a real concern here
as well.

2.  Complexity:
     - If multiple files are needed for performance, what advantages do we
gain from having effectively two level manifests)?  What does it do to
Iceberg metadata to have to track both (V4 is actually a great place to
look at this since it seems like we are looking at major metadata overhauls
anyway, if it is too ambitious we can perhaps postpone some of the work to
v5)?
     - What are the implications for things like time-travel, maintenance,
etc in these cases.  I would guess this probably needs a little bit more
detailed design considering the two options (pushing some concerns down to
parquet vs handling everything in Iceberg metadata).


Some of the complexity questions can be answered by prototyping the APIs
necessary to make this work. Specifically  I think we would at least need:

1.  An `newAppendColumns` API added to the transaction
<https://iceberg.apache.org/javadoc/1.9.1/org/apache/iceberg/Transaction.html>
[2].
Lance's APIs might provide some inspiration [3] here.
2a.  New abstractions to handle columns for the same rows split across
files.
2b.  New File level APIs for
           - Append columns
           - Delete files (if we decide on multiple files for a row-range
and it is pushed down the file level, the deletion logic needs to be
delegated to the file level as well).

Items 2a/2b depend on the ultimate approach taken but trying to sketch
these out and how they relate to the transaction API, might help inform the
decision on complexity.

Other feature interactions probably need a more careful analysis when
proposing the spec changes.

Cheers,
Micah

[1] https://parquet.apache.org/docs/file-format/pageindex/
[2]
https://iceberg.apache.org/javadoc/1.9.1/org/apache/iceberg/Transaction.html
[3]
https://lancedb.github.io/lance/introduction/schema_evolution.html#adding-new-columns

On Fri, Jun 6, 2025 at 10:39 AM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi Peter
>
> Thanks for your message. It's an interesting topic.
>
> Would it not be more a data file/parquet "issue" ? Especially with the
> data file API you are proposing, I think Iceberg should "delegate" to
> the data file layer (Parquet here) and Iceberg could be "agnostic".
>
> Regards
> JB
>
> On Mon, May 26, 2025 at 6:28 AM Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
> >
> > Hi Team,
> >
> > In machine learning use-cases, it's common to encounter tables with a
> very high number of columns - sometimes even in the range of several
> thousand. I've seen cases with up to 15,000 columns. Storing such wide
> tables in a single Parquet file is often suboptimal, as Parquet can become
> a bottleneck, even when only a subset of columns is queried.
> >
> > A common approach to mitigate this is to split the data across multiple
> Parquet files. With the upcoming File Format API, we could introduce a
> layer that combines these files into a single iterator, enabling efficient
> reading of wide and very wide tables.
> >
> > To support this, we would need to revise the metadata specification.
> Instead of the current `_file` column, we could introduce a _files column
> containing:
> > - `_file_column_ids`: the column IDs present in each file
> > - `_file_path`: the path to the corresponding file
> >
> > Has there been any prior discussion around this idea?
> > Is anyone else interested in exploring this further?
> >
> > Best regards,
> > Peter
>

Reply via email to