[Parquet] How to write hive partitioning with partitioning keys in the file

2023-12-01 Thread Haocheng Liu
Hi community, Hope this email finds you well. Can folk guide how to write hive partitioning with partitioning keys *in the file*? Right now only the subset of the data will be written. Both Python pyarrow.dataset.wite_dataset(...)

[parquet][Iceberg] Should hive partition keys appear as corresponding columns in the file

2023-11-29 Thread Haocheng Liu
Hi community, I want to solicit people's thoughts on the different toolchain behaviors of whether the hive partition keys should appear as columns in the underlying parquet file. Say I have data layout as: //myTable/dt=2019-10-31/lang=en/0.parquet //myTable/dt=2018-10-31/lang=fr/1.parquet IIRC

Re: Group rows in a stream of record batches by group id?

2023-06-14 Thread Haocheng Liu
Hi Jerry, I asked similar questions on how to "write the data iteratively in smaller quantities over successive writes?" as hive partitioned parquet months ago and the reply from Weston was extremely helpful to me. Here are the related threads on how to use acero

[DISCUSS][C++] How to run arrow-dataset-dataset-writer-test

2023-04-07 Thread Haocheng Liu
Hi, I'm new to arrow development and would like to get some help with newbie testing questions. I have ARROW_TESTING set to TRUE and ARROW_TEST_DATA set properly on my MacOS. Though when running tests via " : ctest -R dataset-writer-test -V", *0 test *gets run but plenty are defined in this datas