Hi Team,

In machine learning use-cases, it's common to encounter tables with a very
high number of columns - sometimes even in the range of several thousand.
I've seen cases with up to 15,000 columns. Storing such wide tables in a
single Parquet file is often suboptimal, as Parquet can become a
bottleneck, even when only a subset of columns is queried.

A common approach to mitigate this is to split the data across multiple
Parquet files. With the upcoming File Format API, we could introduce a
layer that combines these files into a single iterator, enabling efficient
reading of wide and very wide tables.

To support this, we would need to revise the metadata specification.
Instead of the current `_file` column, we could introduce a _files column
containing:
- `_file_column_ids`: the column IDs present in each file
- `_file_path`: the path to the corresponding file

Has there been any prior discussion around this idea?
Is anyone else interested in exploring this further?

Best regards,
Peter

Reply via email to