Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Gang Wu
format_version field was designed for this purpose but it is unfortunately not honored for now. I don't think add a new similar tag/flag will solve the problem because it does not fix the problem of legacy files and it takes a lot of effort for all implementations to adopt the new tag/flag. IMO, I

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Prem Sahoo
correct parquet-mr , hardcoded format version to 1 then how can we identify if a Parquet file written is from V1 or V2 ? I have asked the same question but according to you there is none . "As I have said in another thread, Parquet V2 is a concept which contains a lot of features. FWIW, what are d

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Gang Wu
Spark leverages parquet writer from parquet-mr, which hard-codes the format version to 1 [1] even when v2 features are enabled. That's why I said in dev@parquet that we cannot really tell if a parquet file is v1 or v2 simply from the format version field. [1] https://github.com/apache/parquet-mr/b

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Prem Sahoo
I tried with this option but spark is not creating V2 parquet. as I can still see "format_version: 1.0" . I think it needs something else too. On Wed, Apr 24, 2024 at 12:33 PM Adam Lippai wrote: > It supports writing v2, but defaults to v1. > hadoopConfiguration.set(“parquet.writer.version”, “v2

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Adam Lippai
It supports writing v2, but defaults to v1. hadoopConfiguration.set(“parquet.writer.version”, “v2”) Best regards, Adam Lippai On Wed, Apr 24, 2024 at 11:40 Prem Sahoo wrote: > They do support Reading of Parquet V2 , but writing is not supported by > Spark for V2. > > On Wed, Apr 24, 2024 at 11

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Jacob Wujciak
> Parquet "V2" (including the V2 data pages, and other details) and the 2.x.y releases of the format library artifact. They aren't the same unfortunately Oh wow, yeah that's really not clear. parquet.a.o doesn't have any structured version information as far as I could see. Am Mi., 24. Apr. 2024

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Prem Sahoo
They do support Reading of Parquet V2 , but writing is not supported by Spark for V2. On Wed, Apr 24, 2024 at 11:10 AM Adam Lippai wrote: > Hi Wes, > > As far as I remember hive, spark, impala, duckdb or even proprietary > systems like hyper, Vertica all support reading data page v2 now. The mos

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Adam Lippai
Hi Wes, As far as I remember hive, spark, impala, duckdb or even proprietary systems like hyper, Vertica all support reading data page v2 now. The most recent column encodings (BYTE_STREAM_SPLIT) might be missing, but overall the support seems much better than a year or two ago. Best regards, Ada

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Adam Lippai
As an outsider I suspect the only reason for these “common beliefs” is that Spark simply doesn’t support some of the breaking features (eg the nanoseconds data type). Maybe closing the very few gaps would resolve the issue for good. Best regards, Adam Lippai On Wed, Apr 24, 2024 at 10:32 Weston P

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Wes McKinney
I think there is confusion about the Parquet "V2" (including the V2 data pages, and other details) and the 2.x.y releases of the format library artifact. They aren't the same unfortunately. I don't think the V2 metadata structures (the data pages in particular, and new column encoding) is widely ad

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Weston Pace
> *As per Apache Parquet Community Parquet V2 is not final yet so it is not > official . They are advising not to use Parquet V2 for writing (though code > is available ) .* This would be news to me. Parquet releases are listed (by the parquet community) at [1] The vote to release parquet 2.10 i

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Prem Sahoo
Hello Jacob, Thanks for the information, and my apologies for the weird format of my email. This is the email from the Parquet community. May I know why pyarrow is using Parquet V2 which is not official yet ? My question is from Parquet community V2 is not final yet so it is not official yet. "Hi

Re: Fwd: PyArrow Using Parquet V2

2024-04-23 Thread Jacob Wujciak
Hello, First off, please try to clean up formating of emails to be legible when forwarding/quoting previous messages multiple times, especially when most of the quotes do not contain any useful information. It makes it much easier to parse the message and thus quicker to answer. The short answ