Re: Fwd: PyArrow Using Parquet V2

Prem Sahoo Wed, 24 Apr 2024 05:10:50 -0700

Hello Jacob,
Thanks for the information, and my apologies for the weird format of my
email.

This is the email from the Parquet community. May I know why pyarrow is
using Parquet V2 which is not official yet ?

My question is from Parquet community V2 is not final yet so it is not
official yet.
"Hi Prem - Maybe I can help clarify to the best of my knowledge. Parquet V2
as a standard isn't finalized just yet. Meaning there is no formal,
*finalized* "contract" that specifies what it means to write data in the V2
version. The discussions/conversations about what the final V2 standard may
be are still in progress and are evolving.

That being said, because V2 code does exist (though unfinalized), there are
clients / tools that are writing data in the un-finalized V2 format, as
seems to be the case with Dremio.

Now, as that comment you quoted said, you can have Spark write V2 files,
but it's worth being mindful about the fact that V2 is a moving target and
can (and likely will) change. You can overwrite parquet.writer.version to
specify your desired version, but it can be dangerous to produce data in a
moving-target format. For example, let's say you write a bunch of data in
Parquet V2, and then the community decides to make a breaking change (which
is completely fine / allowed since V2 isn't finalized). You are now left
having to deal with a potentially large and complicated file format update.
That's why it's not recommended to write files in parquet v2 just yet."

*As per Apache Parquet Community Parquet V2 is not final yet so it is not
official . They are advising not to use Parquet V2 for writing (though code
is available ) .*

*As per above Spark hasn't started using Parquet V2 for writing *.

May I know how an unstable /unofficial  version is being used in pyarrow ?

On Wed, Apr 24, 2024 at 12:43 AM Jacob Wujciak <[email protected]>
wrote:

> Hello,
>
> First off, please try to clean up formating of emails to be legible when
> forwarding/quoting previous messages multiple times, especially when most
> of the quotes do not contain any useful information. It makes it much
> easier to parse the message and thus quicker to answer.
>
> The short answer is that we switched to 2.4 and more recently to 2.6 as
> the default to enable the usage of features these versions provide. As you
> have correctly quoted from the docs you can still write 1.0 if you want to
> ensure compatibility with systems that can not process the 'newer' versions
> yet (2.6 was released in 2018!).
>
> You can find the long form discussions about these changes here:
> https://issues.apache.org/jira/browse/ARROW-12203
> https://lists.apache.org/thread/027g366yr3m03hwtpst6sr58b3trwhsm
>
> Best
> Jacob
>
> On 2024/04/24 02:32:01 Prem Sahoo wrote:
> > Hello Team,
> > Could you please share your thoughts about below questions?
> > Sent from my iPhone
> >
> > Begin forwarded message:
> >
> > > From: Prem Sahoo <[email protected]>
> > > Date: April 23, 2024 at 11:03:48 AM EDT
> > > To: [email protected]
> > > Subject: Re: PyArrow Using Parquet V2
> > >
> > > [email protected]
> > > Sent from my iPhone
> > >
> > >>> On Apr 23, 2024, at 6:25 AM, Prem Sahoo <[email protected]>
> wrote:
> > >>>
> > >> Hello Team,
> > >> Could anyone please help me on below query?
> > >> Sent from my iPhone
> > >>
> > >>>> On Apr 22, 2024, at 10:01 PM, Prem Sahoo <[email protected]>
> wrote:
> > >>>>
> > >>> 
> > >>> Sent from my iPhone
> > >>>
> > >>>>> On Apr 22, 2024, at 9:51 PM, Prem Sahoo <[email protected]>
> wrote:
> > >>>>>
> > >>>> 
> > >>>>
> > >>>>>
> > >>>>> 
> > >>>>> Hello Team,
> > >>>>> I have a question regarding Parquet V2 writing thro pyarrow .
> > >>>>> As per below Pyarrow started writing Parquet in V2 encoding.
> > >>>>>
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table
> > >>>>>
> > >>>>> version{“1.0”, “2.4”, “2.6”}, default “2.6”
> > >>>>> Determine which Parquet logical types are available for use,
> whether the reduced set from the Parquet 1.x.x format or the expanded
> logical types added in later format versions. Files written with
> version=’2.4’ or ‘2.6’ may not be readable in all Parquet implementations,
> so version=’1.0’ is likely the choice that maximizes file compatibility.
> UINT32 and some logical types are only available with version ‘2.4’.
> Nanosecond timestamps are only available with version ‘2.6’. Other features
> such as compression algorithms or the new serialized data page format must
> be enabled separately (see ‘compression’ and ‘data_page_version’).
> > >>>>>
> > >>>>>
> > >>>>> As per Apache Parquet Community Parquet V2 is not final yet so it
> is not official . They are advising not to use Parquet V2 for writing
> (though code is available ) .
> > >>>>>
> > >>>>> As per above Spark hasn't started using Parquet V2 for writing .
> > >>>>> May I know how an unstable /unofficial  version is being used in
> pyarrow ?
> > >>>>>
> >
>

Re: Fwd: PyArrow Using Parquet V2

Reply via email to