Hi Peter

Thanks for your message. It's an interesting topic.

Would it not be more a data file/parquet "issue" ? Especially with the
data file API you are proposing, I think Iceberg should "delegate" to
the data file layer (Parquet here) and Iceberg could be "agnostic".

Regards
JB

On Mon, May 26, 2025 at 6:28 AM Péter Váry <peter.vary.apa...@gmail.com> wrote:
>
> Hi Team,
>
> In machine learning use-cases, it's common to encounter tables with a very 
> high number of columns - sometimes even in the range of several thousand. 
> I've seen cases with up to 15,000 columns. Storing such wide tables in a 
> single Parquet file is often suboptimal, as Parquet can become a bottleneck, 
> even when only a subset of columns is queried.
>
> A common approach to mitigate this is to split the data across multiple 
> Parquet files. With the upcoming File Format API, we could introduce a layer 
> that combines these files into a single iterator, enabling efficient reading 
> of wide and very wide tables.
>
> To support this, we would need to revise the metadata specification. Instead 
> of the current `_file` column, we could introduce a _files column containing:
> - `_file_column_ids`: the column IDs present in each file
> - `_file_path`: the path to the corresponding file
>
> Has there been any prior discussion around this idea?
> Is anyone else interested in exploring this further?
>
> Best regards,
> Peter

Reply via email to