Hello Jacob, Thanks for the information, and my apologies for the weird format of my email.
This is the email from the Parquet community. May I know why pyarrow is using Parquet V2 which is not official yet ? My question is from Parquet community V2 is not final yet so it is not official yet. "Hi Prem - Maybe I can help clarify to the best of my knowledge. Parquet V2 as a standard isn't finalized just yet. Meaning there is no formal, *finalized* "contract" that specifies what it means to write data in the V2 version. The discussions/conversations about what the final V2 standard may be are still in progress and are evolving. That being said, because V2 code does exist (though unfinalized), there are clients / tools that are writing data in the un-finalized V2 format, as seems to be the case with Dremio. Now, as that comment you quoted said, you can have Spark write V2 files, but it's worth being mindful about the fact that V2 is a moving target and can (and likely will) change. You can overwrite parquet.writer.version to specify your desired version, but it can be dangerous to produce data in a moving-target format. For example, let's say you write a bunch of data in Parquet V2, and then the community decides to make a breaking change (which is completely fine / allowed since V2 isn't finalized). You are now left having to deal with a potentially large and complicated file format update. That's why it's not recommended to write files in parquet v2 just yet." *As per Apache Parquet Community Parquet V2 is not final yet so it is not official . They are advising not to use Parquet V2 for writing (though code is available ) .* *As per above Spark hasn't started using Parquet V2 for writing *. May I know how an unstable /unofficial version is being used in pyarrow ? On Wed, Apr 24, 2024 at 12:43 AM Jacob Wujciak <assignu...@apache.org> wrote: > Hello, > > First off, please try to clean up formating of emails to be legible when > forwarding/quoting previous messages multiple times, especially when most > of the quotes do not contain any useful information. It makes it much > easier to parse the message and thus quicker to answer. > > The short answer is that we switched to 2.4 and more recently to 2.6 as > the default to enable the usage of features these versions provide. As you > have correctly quoted from the docs you can still write 1.0 if you want to > ensure compatibility with systems that can not process the 'newer' versions > yet (2.6 was released in 2018!). > > You can find the long form discussions about these changes here: > https://issues.apache.org/jira/browse/ARROW-12203 > https://lists.apache.org/thread/027g366yr3m03hwtpst6sr58b3trwhsm > > Best > Jacob > > On 2024/04/24 02:32:01 Prem Sahoo wrote: > > Hello Team, > > Could you please share your thoughts about below questions? > > Sent from my iPhone > > > > Begin forwarded message: > > > > > From: Prem Sahoo <prem.re...@gmail.com> > > > Date: April 23, 2024 at 11:03:48 AM EDT > > > To: dev-ow...@arrow.apache.org > > > Subject: Re: PyArrow Using Parquet V2 > > > > > > dev@arrow.apache.org > > > Sent from my iPhone > > > > > >>> On Apr 23, 2024, at 6:25 AM, Prem Sahoo <prem.re...@gmail.com> > wrote: > > >>> > > >> Hello Team, > > >> Could anyone please help me on below query? > > >> Sent from my iPhone > > >> > > >>>> On Apr 22, 2024, at 10:01 PM, Prem Sahoo <prem.re...@gmail.com> > wrote: > > >>>> > > >>> > > >>> Sent from my iPhone > > >>> > > >>>>> On Apr 22, 2024, at 9:51 PM, Prem Sahoo <prem.re...@gmail.com> > wrote: > > >>>>> > > >>>> > > >>>> > > >>>>> > > >>>>> > > >>>>> Hello Team, > > >>>>> I have a question regarding Parquet V2 writing thro pyarrow . > > >>>>> As per below Pyarrow started writing Parquet in V2 encoding. > > >>>>> > https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table > > >>>>> > > >>>>> version{“1.0”, “2.4”, “2.6”}, default “2.6” > > >>>>> Determine which Parquet logical types are available for use, > whether the reduced set from the Parquet 1.x.x format or the expanded > logical types added in later format versions. Files written with > version=’2.4’ or ‘2.6’ may not be readable in all Parquet implementations, > so version=’1.0’ is likely the choice that maximizes file compatibility. > UINT32 and some logical types are only available with version ‘2.4’. > Nanosecond timestamps are only available with version ‘2.6’. Other features > such as compression algorithms or the new serialized data page format must > be enabled separately (see ‘compression’ and ‘data_page_version’). > > >>>>> > > >>>>> > > >>>>> As per Apache Parquet Community Parquet V2 is not final yet so it > is not official . They are advising not to use Parquet V2 for writing > (though code is available ) . > > >>>>> > > >>>>> As per above Spark hasn't started using Parquet V2 for writing . > > >>>>> May I know how an unstable /unofficial version is being used in > pyarrow ? > > >>>>> > > >