correct parquet-mr , hardcoded format version to 1 then how can we identify if a Parquet file written is from V1 or V2 ? I have asked the same question but according to you there is none .
"As I have said in another thread, Parquet V2 is a concept which contains a lot of features. FWIW, what are defined in the specs [1] are finalized and some of them have been implemented in various implementations. Any file that contains one or more of those features can be considered v2 but the the community has never defined a formal approach to distinguish between v1 and v2. Parquet does have a field in the footer thrift definition to mark the file version [2]. However, not all implementations populate it correctly and some engines will even throw if the version is not 1. To avoid confusion, I strongly suggest not using any v2 feature in your case unless you are 100% sure that all your tools support the v2 feature set you have enabled. [1] https://github.com/apache/parquet-format/blob/master/CHANGES.md [2] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1111 Best, Gang" Here are my 2 cents We should have some flag or tag which differentiates Parquet written in V1 or V2. While reading if the engine doesn't support V2 reading then we are sure we shouldn't feed V2 Parquet. Now few Tools/products are using Parquet V2 for both reading & writing but* Apache Spark is not supporting write through V2 encoding as per Parquet community V2 is not final yet*. Do we have any date when the parquet-mr jar will have Parquet V2 writing functionality so that Spark can adhere to it. *or if i will add this "hadoopConfiguration.set(“parquet.writer.version”, “v2”)" while creating Parquet then those are V2 parquet.* please confirm. On Wed, Apr 24, 2024 at 9:26 PM Gang Wu <ust...@gmail.com> wrote: > Spark leverages parquet writer from parquet-mr, which hard-codes the > format version to 1 [1] even when v2 features are enabled. That's why > I said in dev@parquet that we cannot really tell if a parquet file is v1 > or > v2 simply from the format version field. > > [1] > > https://github.com/apache/parquet-mr/blob/adb3e27c837f81fcef0fbefa8975eea202be693c/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1863 > > Best, > Gang > > > On Thu, Apr 25, 2024 at 3:51 AM Prem Sahoo <prem.re...@gmail.com> wrote: > > > I tried with this option but spark is not creating V2 parquet. as I can > > still see "format_version: 1.0" . I think it needs something else too. > > > > On Wed, Apr 24, 2024 at 12:33 PM Adam Lippai <a...@rigo.sk> wrote: > > > > > It supports writing v2, but defaults to v1. > > > hadoopConfiguration.set(“parquet.writer.version”, “v2”) > > > > > > Best regards, > > > Adam Lippai > > > > > > > > > On Wed, Apr 24, 2024 at 11:40 Prem Sahoo <prem.re...@gmail.com> wrote: > > > > > > > They do support Reading of Parquet V2 , but writing is not supported > by > > > > Spark for V2. > > > > > > > > On Wed, Apr 24, 2024 at 11:10 AM Adam Lippai <a...@rigo.sk> wrote: > > > > > > > > > Hi Wes, > > > > > > > > > > As far as I remember hive, spark, impala, duckdb or even > proprietary > > > > > systems like hyper, Vertica all support reading data page v2 now. > The > > > > most > > > > > recent column encodings (BYTE_STREAM_SPLIT) might be missing, but > > > overall > > > > > the support seems much better than a year or two ago. > > > > > > > > > > Best regards, > > > > > Adam Lippai > > > > > > > > > > On Wed, Apr 24, 2024 at 10:51 Wes McKinney <wesmck...@gmail.com> > > > wrote: > > > > > > > > > > > I think there is confusion about the Parquet "V2" (including the > V2 > > > > data > > > > > > pages, and other details) and the 2.x.y releases of the format > > > library > > > > > > artifact. They aren't the same unfortunately. I don't think the > V2 > > > > > metadata > > > > > > structures (the data pages in particular, and new column > encoding) > > is > > > > > > widely adopted / readable. > > > > > > > > > > > > On Wed, Apr 24, 2024 at 9:32 AM Weston Pace < > weston.p...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > > > > *As per Apache Parquet Community Parquet V2 is not final yet > so > > > it > > > > is > > > > > > not > > > > > > > > official . They are advising not to use Parquet V2 for > writing > > > > > (though > > > > > > > code > > > > > > > > is available ) .* > > > > > > > > > > > > > > This would be news to me. Parquet releases are listed (by the > > > > parquet > > > > > > > community) at [1] > > > > > > > > > > > > > > The vote to release parquet 2.10 is here: [2] > > > > > > > > > > > > > > Neither of these links mention anything about this being an > > > > > experimental, > > > > > > > unofficial, or non-finalized release. > > > > > > > > > > > > > > I understand your concern. I believe your quotes are coming > from > > > > your > > > > > > > discussion on the parquet mailing list here [3]. This > > > communication > > > > is > > > > > > > unfortunate and confusing to me as well. > > > > > > > > > > > > > > [1] https://parquet.apache.org/blog/ > > > > > > > [2] > > > https://lists.apache.org/thread/fdf1zz0f3xzz5zpvo6c811xjswhm1zy6 > > > > > > > [3] > > > https://lists.apache.org/thread/4nzroc68czwxnp0ndqz15kp1vhcd7vg3 > > > > > > > > > > > > > > > > > > > > > On Wed, Apr 24, 2024 at 5:10 AM Prem Sahoo < > prem.re...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > > > > > Hello Jacob, > > > > > > > > Thanks for the information, and my apologies for the weird > > format > > > > of > > > > > my > > > > > > > > email. > > > > > > > > > > > > > > > > This is the email from the Parquet community. May I know why > > > > pyarrow > > > > > is > > > > > > > > using Parquet V2 which is not official yet ? > > > > > > > > > > > > > > > > My question is from Parquet community V2 is not final yet so > it > > > is > > > > > not > > > > > > > > official yet. > > > > > > > > "Hi Prem - Maybe I can help clarify to the best of my > > knowledge. > > > > > > Parquet > > > > > > > V2 > > > > > > > > as a standard isn't finalized just yet. Meaning there is no > > > formal, > > > > > > > > *finalized* "contract" that specifies what it means to write > > data > > > > in > > > > > > the > > > > > > > V2 > > > > > > > > version. The discussions/conversations about what the final > V2 > > > > > standard > > > > > > > may > > > > > > > > be are still in progress and are evolving. > > > > > > > > > > > > > > > > That being said, because V2 code does exist (though > > unfinalized), > > > > > there > > > > > > > are > > > > > > > > clients / tools that are writing data in the un-finalized V2 > > > > format, > > > > > as > > > > > > > > seems to be the case with Dremio. > > > > > > > > > > > > > > > > Now, as that comment you quoted said, you can have Spark > write > > V2 > > > > > > files, > > > > > > > > but it's worth being mindful about the fact that V2 is a > moving > > > > > target > > > > > > > and > > > > > > > > can (and likely will) change. You can overwrite > > > > > parquet.writer.version > > > > > > to > > > > > > > > specify your desired version, but it can be dangerous to > > produce > > > > data > > > > > > in > > > > > > > a > > > > > > > > moving-target format. For example, let's say you write a > bunch > > of > > > > > data > > > > > > in > > > > > > > > Parquet V2, and then the community decides to make a breaking > > > > change > > > > > > > (which > > > > > > > > is completely fine / allowed since V2 isn't finalized). You > are > > > now > > > > > > left > > > > > > > > having to deal with a potentially large and complicated file > > > format > > > > > > > update. > > > > > > > > That's why it's not recommended to write files in parquet v2 > > just > > > > > yet." > > > > > > > > > > > > > > > > > > > > > > > > *As per Apache Parquet Community Parquet V2 is not final yet > so > > > it > > > > is > > > > > > not > > > > > > > > official . They are advising not to use Parquet V2 for > writing > > > > > (though > > > > > > > code > > > > > > > > is available ) .* > > > > > > > > > > > > > > > > > > > > > > > > *As per above Spark hasn't started using Parquet V2 for > writing > > > *. > > > > > > > > > > > > > > > > May I know how an unstable /unofficial version is being used > > in > > > > > > pyarrow > > > > > > > ? > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Apr 24, 2024 at 12:43 AM Jacob Wujciak < > > > > > assignu...@apache.org> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > First off, please try to clean up formating of emails to be > > > > legible > > > > > > > when > > > > > > > > > forwarding/quoting previous messages multiple times, > > especially > > > > > when > > > > > > > most > > > > > > > > > of the quotes do not contain any useful information. It > makes > > > it > > > > > much > > > > > > > > > easier to parse the message and thus quicker to answer. > > > > > > > > > > > > > > > > > > The short answer is that we switched to 2.4 and more > recently > > > to > > > > > 2.6 > > > > > > as > > > > > > > > > the default to enable the usage of features these versions > > > > provide. > > > > > > As > > > > > > > > you > > > > > > > > > have correctly quoted from the docs you can still write 1.0 > > if > > > > you > > > > > > want > > > > > > > > to > > > > > > > > > ensure compatibility with systems that can not process the > > > > 'newer' > > > > > > > > versions > > > > > > > > > yet (2.6 was released in 2018!). > > > > > > > > > > > > > > > > > > You can find the long form discussions about these changes > > > here: > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-12203 > > > > > > > > > > > > https://lists.apache.org/thread/027g366yr3m03hwtpst6sr58b3trwhsm > > > > > > > > > > > > > > > > > > Best > > > > > > > > > Jacob > > > > > > > > > > > > > > > > > > On 2024/04/24 02:32:01 Prem Sahoo wrote: > > > > > > > > > > Hello Team, > > > > > > > > > > Could you please share your thoughts about below > questions? > > > > > > > > > > Sent from my iPhone > > > > > > > > > > > > > > > > > > > > Begin forwarded message: > > > > > > > > > > > > > > > > > > > > > From: Prem Sahoo <prem.re...@gmail.com> > > > > > > > > > > > Date: April 23, 2024 at 11:03:48 AM EDT > > > > > > > > > > > To: dev-ow...@arrow.apache.org > > > > > > > > > > > Subject: Re: PyArrow Using Parquet V2 > > > > > > > > > > > > > > > > > > > > > > dev@arrow.apache.org > > > > > > > > > > > Sent from my iPhone > > > > > > > > > > > > > > > > > > > > > >>> On Apr 23, 2024, at 6:25 AM, Prem Sahoo < > > > > > prem.re...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > >>> > > > > > > > > > > >> Hello Team, > > > > > > > > > > >> Could anyone please help me on below query? > > > > > > > > > > >> Sent from my iPhone > > > > > > > > > > >> > > > > > > > > > > >>>> On Apr 22, 2024, at 10:01 PM, Prem Sahoo < > > > > > > prem.re...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > >>>> > > > > > > > > > > >>> > > > > > > > > > > >>> Sent from my iPhone > > > > > > > > > > >>> > > > > > > > > > > >>>>> On Apr 22, 2024, at 9:51 PM, Prem Sahoo < > > > > > > prem.re...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > >>>>> > > > > > > > > > > >>>> > > > > > > > > > > >>>> > > > > > > > > > > >>>>> > > > > > > > > > > >>>>> > > > > > > > > > > >>>>> Hello Team, > > > > > > > > > > >>>>> I have a question regarding Parquet V2 writing thro > > > > > pyarrow . > > > > > > > > > > >>>>> As per below Pyarrow started writing Parquet in V2 > > > > > encoding. > > > > > > > > > > >>>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table > > > > > > > > > > >>>>> > > > > > > > > > > >>>>> version{“1.0”, “2.4”, “2.6”}, default “2.6” > > > > > > > > > > >>>>> Determine which Parquet logical types are available > > for > > > > > use, > > > > > > > > > whether the reduced set from the Parquet 1.x.x format or > the > > > > > expanded > > > > > > > > > logical types added in later format versions. Files written > > > with > > > > > > > > > version=’2.4’ or ‘2.6’ may not be readable in all Parquet > > > > > > > > implementations, > > > > > > > > > so version=’1.0’ is likely the choice that maximizes file > > > > > > > compatibility. > > > > > > > > > UINT32 and some logical types are only available with > version > > > > > ‘2.4’. > > > > > > > > > Nanosecond timestamps are only available with version > ‘2.6’. > > > > Other > > > > > > > > features > > > > > > > > > such as compression algorithms or the new serialized data > > page > > > > > format > > > > > > > > must > > > > > > > > > be enabled separately (see ‘compression’ and > > > > ‘data_page_version’). > > > > > > > > > > >>>>> > > > > > > > > > > >>>>> > > > > > > > > > > >>>>> As per Apache Parquet Community Parquet V2 is not > > final > > > > yet > > > > > > so > > > > > > > it > > > > > > > > > is not official . They are advising not to use Parquet V2 > for > > > > > writing > > > > > > > > > (though code is available ) . > > > > > > > > > > >>>>> > > > > > > > > > > >>>>> As per above Spark hasn't started using Parquet V2 > > for > > > > > > writing > > > > > > > . > > > > > > > > > > >>>>> May I know how an unstable /unofficial version is > > > being > > > > > used > > > > > > > in > > > > > > > > > pyarrow ? > > > > > > > > > > >>>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >