Hi Peter, Thanks for bringing this up. Wouldn't it make more sense to fix this in Parquet itself? It has been a long-running issue on Parquet, and there is still active interest from the community. There is a PR to replace the footer with FlatBuffers, which dramatically improves performance <https://github.com/apache/arrow/pull/43793>. The underlying proposal can be found here <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa> .
Kind regards, Fokko Op ma 26 mei 2025 om 20:35 schreef yun zou <yunzou.colost...@gmail.com>: > +1, I am really interested in this topic. Performance has always been a > problem when dealing with wide tables, not just read/write, but also during > compilation. Most of the ML use cases typically exhibit a vectorized > read/write pattern, I am also wondering if there is any way at the metadata > level to help the whole compilation and execution process. I do not have > any answer fo this yet, but I would be really interested in exploring this > further. > > Best Regards, > Yun > > On Mon, May 26, 2025 at 9:14 AM Pucheng Yang <py...@pinterest.com.invalid> > wrote: > >> Hi Peter, I am interested in this proposal. What's more, I am curious if >> there is a similar story on the write side as well (how to generate these >> splitted files) and specifically, are you targeting feature backfill use >> cases in ML use? >> >> On Mon, May 26, 2025 at 6:29 AM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >>> Hi Team, >>> >>> In machine learning use-cases, it's common to encounter tables with a >>> very high number of columns - sometimes even in the range of several >>> thousand. I've seen cases with up to 15,000 columns. Storing such wide >>> tables in a single Parquet file is often suboptimal, as Parquet can become >>> a bottleneck, even when only a subset of columns is queried. >>> >>> A common approach to mitigate this is to split the data across multiple >>> Parquet files. With the upcoming File Format API, we could introduce a >>> layer that combines these files into a single iterator, enabling efficient >>> reading of wide and very wide tables. >>> >>> To support this, we would need to revise the metadata specification. >>> Instead of the current `_file` column, we could introduce a _files column >>> containing: >>> - `_file_column_ids`: the column IDs present in each file >>> - `_file_path`: the path to the corresponding file >>> >>> Has there been any prior discussion around this idea? >>> Is anyone else interested in exploring this further? >>> >>> Best regards, >>> Peter >>> >>