Wide tables in V4

Péter Váry Mon, 26 May 2025 06:29:35 -0700

Hi Team,

In machine learning use-cases, it's common to encounter tables with a very
high number of columns - sometimes even in the range of several thousand.
I've seen cases with up to 15,000 columns. Storing such wide tables in a
single Parquet file is often suboptimal, as Parquet can become a
bottleneck, even when only a subset of columns is queried.


A common approach to mitigate this is to split the data across multiple
Parquet files. With the upcoming File Format API, we could introduce a
layer that combines these files into a single iterator, enabling efficient
reading of wide and very wide tables.

To support this, we would need to revise the metadata specification.
Instead of the current `_file` column, we could introduce a _files column
containing:
- `_file_column_ids`: the column IDs present in each file
- `_file_path`: the path to the corresponding file

Has there been any prior discussion around this idea?
Is anyone else interested in exploring this further?

Best regards,
Peter

Wide tables in V4

Reply via email to