Hi Peter Thanks for your message. It's an interesting topic.
Would it not be more a data file/parquet "issue" ? Especially with the data file API you are proposing, I think Iceberg should "delegate" to the data file layer (Parquet here) and Iceberg could be "agnostic". Regards JB On Mon, May 26, 2025 at 6:28 AM Péter Váry <peter.vary.apa...@gmail.com> wrote: > > Hi Team, > > In machine learning use-cases, it's common to encounter tables with a very > high number of columns - sometimes even in the range of several thousand. > I've seen cases with up to 15,000 columns. Storing such wide tables in a > single Parquet file is often suboptimal, as Parquet can become a bottleneck, > even when only a subset of columns is queried. > > A common approach to mitigate this is to split the data across multiple > Parquet files. With the upcoming File Format API, we could introduce a layer > that combines these files into a single iterator, enabling efficient reading > of wide and very wide tables. > > To support this, we would need to revise the metadata specification. Instead > of the current `_file` column, we could introduce a _files column containing: > - `_file_column_ids`: the column IDs present in each file > - `_file_path`: the path to the corresponding file > > Has there been any prior discussion around this idea? > Is anyone else interested in exploring this further? > > Best regards, > Peter