Hi Team, In machine learning use-cases, it's common to encounter tables with a very high number of columns - sometimes even in the range of several thousand. I've seen cases with up to 15,000 columns. Storing such wide tables in a single Parquet file is often suboptimal, as Parquet can become a bottleneck, even when only a subset of columns is queried.
A common approach to mitigate this is to split the data across multiple Parquet files. With the upcoming File Format API, we could introduce a layer that combines these files into a single iterator, enabling efficient reading of wide and very wide tables. To support this, we would need to revise the metadata specification. Instead of the current `_file` column, we could introduce a _files column containing: - `_file_column_ids`: the column IDs present in each file - `_file_path`: the path to the corresponding file Has there been any prior discussion around this idea? Is anyone else interested in exploring this further? Best regards, Peter