At a high-level we should probably work out if supporting wide tables with performant appends is something we want to invest effort into and focus on the lower level questions once that is resolved. I think it would be great to make this work, I think the main question is whether any PMC/community members feel like it would introduce too much complexity to proceed with further design/analysis.
Some more detailed replies to what has been discussed in the thread: I might be wrong, but page skipping relies on page headers which are stored > in-line with the data itself. When downloading data from blob stores this > could be less than ideal. No, parquet supports page indices <https://parquet.apache.org/docs/file-format/pageindex/> [1]. I think it is reasonable to also think about improvements to Parquet for large blobs so these can be handled better. There is also general interest in Parquet in evolving it to better support some of these use-cases in general, so if there are clear items that can be pushed to the file level, let's have those conversations. > Would it not be more a data file/parquet "issue" ? Especially with the > data file API you are proposing, I think Iceberg should "delegate" to > the data file layer (Parquet here) and Iceberg could be "agnostic". I think maybe we should maybe table the discussion on exactly what belongs in each layer until we have more data. Roughly the concerns expressed I think boil down into a few mains buckets: 1. Read vs Write amplification (I think one could run some rough experiments with low-level Parquet APIs to see the impact of splitting out columns into individual objects to answer both sides of this). - For large blobs, I think memory pressure becomes a real concern here as well. 2. Complexity: - If multiple files are needed for performance, what advantages do we gain from having effectively two level manifests)? What does it do to Iceberg metadata to have to track both (V4 is actually a great place to look at this since it seems like we are looking at major metadata overhauls anyway, if it is too ambitious we can perhaps postpone some of the work to v5)? - What are the implications for things like time-travel, maintenance, etc in these cases. I would guess this probably needs a little bit more detailed design considering the two options (pushing some concerns down to parquet vs handling everything in Iceberg metadata). Some of the complexity questions can be answered by prototyping the APIs necessary to make this work. Specifically I think we would at least need: 1. An `newAppendColumns` API added to the transaction <https://iceberg.apache.org/javadoc/1.9.1/org/apache/iceberg/Transaction.html> [2]. Lance's APIs might provide some inspiration [3] here. 2a. New abstractions to handle columns for the same rows split across files. 2b. New File level APIs for - Append columns - Delete files (if we decide on multiple files for a row-range and it is pushed down the file level, the deletion logic needs to be delegated to the file level as well). Items 2a/2b depend on the ultimate approach taken but trying to sketch these out and how they relate to the transaction API, might help inform the decision on complexity. Other feature interactions probably need a more careful analysis when proposing the spec changes. Cheers, Micah [1] https://parquet.apache.org/docs/file-format/pageindex/ [2] https://iceberg.apache.org/javadoc/1.9.1/org/apache/iceberg/Transaction.html [3] https://lancedb.github.io/lance/introduction/schema_evolution.html#adding-new-columns On Fri, Jun 6, 2025 at 10:39 AM Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Hi Peter > > Thanks for your message. It's an interesting topic. > > Would it not be more a data file/parquet "issue" ? Especially with the > data file API you are proposing, I think Iceberg should "delegate" to > the data file layer (Parquet here) and Iceberg could be "agnostic". > > Regards > JB > > On Mon, May 26, 2025 at 6:28 AM Péter Váry <peter.vary.apa...@gmail.com> > wrote: > > > > Hi Team, > > > > In machine learning use-cases, it's common to encounter tables with a > very high number of columns - sometimes even in the range of several > thousand. I've seen cases with up to 15,000 columns. Storing such wide > tables in a single Parquet file is often suboptimal, as Parquet can become > a bottleneck, even when only a subset of columns is queried. > > > > A common approach to mitigate this is to split the data across multiple > Parquet files. With the upcoming File Format API, we could introduce a > layer that combines these files into a single iterator, enabling efficient > reading of wide and very wide tables. > > > > To support this, we would need to revise the metadata specification. > Instead of the current `_file` column, we could introduce a _files column > containing: > > - `_file_column_ids`: the column IDs present in each file > > - `_file_path`: the path to the corresponding file > > > > Has there been any prior discussion around this idea? > > Is anyone else interested in exploring this further? > > > > Best regards, > > Peter >