Hi Peter,

Thanks for bringing this up. Wouldn't it make more sense to fix this in
Parquet itself? It has been a long-running issue on Parquet, and there is
still active interest from the community. There is a PR to replace the
footer with FlatBuffers, which dramatically improves performance
<https://github.com/apache/arrow/pull/43793>. The underlying proposal can
be found here
<https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa>
.

Kind regards,
Fokko

Op ma 26 mei 2025 om 20:35 schreef yun zou <yunzou.colost...@gmail.com>:

> +1, I am really interested in this topic. Performance has always been a
> problem when dealing with wide tables, not just read/write, but also during
> compilation. Most of the ML use cases typically exhibit a vectorized
> read/write pattern, I am also wondering if there is any way at the metadata
> level to help the whole compilation and execution process. I do not have
> any answer fo this yet, but I would be really interested in exploring this
> further.
>
> Best Regards,
> Yun
>
> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang <py...@pinterest.com.invalid>
> wrote:
>
>> Hi Peter, I am interested in this proposal. What's more, I am curious if
>> there is a similar story on the write side as well (how to generate these
>> splitted files) and specifically, are you targeting feature backfill use
>> cases in ML use?
>>
>> On Mon, May 26, 2025 at 6:29 AM Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>>
>>> Hi Team,
>>>
>>> In machine learning use-cases, it's common to encounter tables with a
>>> very high number of columns - sometimes even in the range of several
>>> thousand. I've seen cases with up to 15,000 columns. Storing such wide
>>> tables in a single Parquet file is often suboptimal, as Parquet can become
>>> a bottleneck, even when only a subset of columns is queried.
>>>
>>> A common approach to mitigate this is to split the data across multiple
>>> Parquet files. With the upcoming File Format API, we could introduce a
>>> layer that combines these files into a single iterator, enabling efficient
>>> reading of wide and very wide tables.
>>>
>>> To support this, we would need to revise the metadata specification.
>>> Instead of the current `_file` column, we could introduce a _files column
>>> containing:
>>> - `_file_column_ids`: the column IDs present in each file
>>> - `_file_path`: the path to the corresponding file
>>>
>>> Has there been any prior discussion around this idea?
>>> Is anyone else interested in exploring this further?
>>>
>>> Best regards,
>>> Peter
>>>
>>

Reply via email to