Re: Fwd: PyArrow Using Parquet V2

Prem Sahoo Wed, 24 Apr 2024 19:12:15 -0700

correct parquet-mr , hardcoded format version to 1 then how can we
identify if a Parquet file written is from V1 or V2 ?
I have asked the same question but according to you there is none .


"As I have said in another thread, Parquet V2 is a concept which contains
a lot of features. FWIW, what are defined in the specs [1] are finalized and
some of them have been implemented in various implementations. Any file
that contains one or more of those features can be considered v2 but the
the community has never defined a formal approach to distinguish between
v1 and v2. Parquet does have a field in the footer thrift definition to mark
the file version [2]. However, not all implementations populate it
correctly and
some engines will even throw if the version is not 1. To avoid confusion, I
strongly suggest not using any v2 feature in your case unless you are 100%
sure that all your tools support the v2 feature set you have enabled.

[1] https://github.com/apache/parquet-format/blob/master/CHANGES.md
[2]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1111

Best,
Gang"

Here are my 2 cents

We should have some flag or tag which differentiates Parquet written in V1
or V2. While reading if the engine doesn't support V2 reading then we are
sure we shouldn't feed V2 Parquet.

Now few Tools/products are using Parquet V2 for both reading & writing
but* Apache
Spark is not supporting write through V2 encoding as per Parquet community
V2 is not final yet*.

Do we have any date when the parquet-mr jar will have Parquet V2 writing
functionality so that Spark can adhere to it.

*or if i will add this "hadoopConfiguration.set(“parquet.writer.version”,
“v2”)" while creating Parquet then those are V2 parquet.*
please confirm.



On Wed, Apr 24, 2024 at 9:26 PM Gang Wu <[email protected]> wrote:

> Spark leverages parquet writer from parquet-mr, which hard-codes the
> format version to 1 [1] even when v2 features are enabled. That's why
> I said in dev@parquet that we cannot really tell if a parquet file is v1
> or
> v2 simply from the format version field.
>
> [1]
>
> https://github.com/apache/parquet-mr/blob/adb3e27c837f81fcef0fbefa8975eea202be693c/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1863
>
> Best,
> Gang
>
>
> On Thu, Apr 25, 2024 at 3:51 AM Prem Sahoo <[email protected]> wrote:
>
> > I tried with this option but spark is not creating V2 parquet. as I can
> > still see "format_version: 1.0" . I think it needs something else too.
> >
> > On Wed, Apr 24, 2024 at 12:33 PM Adam Lippai <[email protected]> wrote:
> >
> > > It supports writing v2, but defaults to v1.
> > > hadoopConfiguration.set(“parquet.writer.version”, “v2”)
> > >
> > > Best regards,
> > > Adam Lippai
> > >
> > >
> > > On Wed, Apr 24, 2024 at 11:40 Prem Sahoo <[email protected]> wrote:
> > >
> > > > They do support Reading of Parquet V2 , but writing is not supported
> by
> > > > Spark for V2.
> > > >
> > > > On Wed, Apr 24, 2024 at 11:10 AM Adam Lippai <[email protected]> wrote:
> > > >
> > > > > Hi Wes,
> > > > >
> > > > > As far as I remember hive, spark, impala, duckdb or even
> proprietary
> > > > > systems like hyper, Vertica all support reading data page v2 now.
> The
> > > > most
> > > > > recent column encodings (BYTE_STREAM_SPLIT) might be missing, but
> > > overall
> > > > > the support seems much better than a year or two ago.
> > > > >
> > > > > Best regards,
> > > > > Adam Lippai
> > > > >
> > > > > On Wed, Apr 24, 2024 at 10:51 Wes McKinney <[email protected]>
> > > wrote:
> > > > >
> > > > > > I think there is confusion about the Parquet "V2" (including the
> V2
> > > > data
> > > > > > pages, and other details) and the 2.x.y releases of the format
> > > library
> > > > > > artifact. They aren't the same unfortunately. I don't think the
> V2
> > > > > metadata
> > > > > > structures (the data pages in particular, and new column
> encoding)
> > is
> > > > > > widely adopted / readable.
> > > > > >
> > > > > > On Wed, Apr 24, 2024 at 9:32 AM Weston Pace <
> [email protected]
> > >
> > > > > wrote:
> > > > > >
> > > > > > > > *As per Apache Parquet Community Parquet V2 is not final yet
> so
> > > it
> > > > is
> > > > > > not
> > > > > > > > official . They are advising not to use Parquet V2 for
> writing
> > > > > (though
> > > > > > > code
> > > > > > > > is available ) .*
> > > > > > >
> > > > > > > This would be news to me.  Parquet releases are listed (by the
> > > > parquet
> > > > > > > community) at [1]
> > > > > > >
> > > > > > > The vote to release parquet 2.10 is here: [2]
> > > > > > >
> > > > > > > Neither of these links mention anything about this being an
> > > > > experimental,
> > > > > > > unofficial, or non-finalized release.
> > > > > > >
> > > > > > > I understand your concern.  I believe your quotes are coming
> from
> > > > your
> > > > > > > discussion on the parquet mailing list here [3].  This
> > > communication
> > > > is
> > > > > > > unfortunate and confusing to me as well.
> > > > > > >
> > > > > > > [1] https://parquet.apache.org/blog/
> > > > > > > [2]
> > > https://lists.apache.org/thread/fdf1zz0f3xzz5zpvo6c811xjswhm1zy6
> > > > > > > [3]
> > > https://lists.apache.org/thread/4nzroc68czwxnp0ndqz15kp1vhcd7vg3
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Apr 24, 2024 at 5:10 AM Prem Sahoo <
> [email protected]
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hello Jacob,
> > > > > > > > Thanks for the information, and my apologies for the weird
> > format
> > > > of
> > > > > my
> > > > > > > > email.
> > > > > > > >
> > > > > > > > This is the email from the Parquet community. May I know why
> > > > pyarrow
> > > > > is
> > > > > > > > using Parquet V2 which is not official yet ?
> > > > > > > >
> > > > > > > > My question is from Parquet community V2 is not final yet so
> it
> > > is
> > > > > not
> > > > > > > > official yet.
> > > > > > > > "Hi Prem - Maybe I can help clarify to the best of my
> > knowledge.
> > > > > > Parquet
> > > > > > > V2
> > > > > > > > as a standard isn't finalized just yet. Meaning there is no
> > > formal,
> > > > > > > > *finalized* "contract" that specifies what it means to write
> > data
> > > > in
> > > > > > the
> > > > > > > V2
> > > > > > > > version. The discussions/conversations about what the final
> V2
> > > > > standard
> > > > > > > may
> > > > > > > > be are still in progress and are evolving.
> > > > > > > >
> > > > > > > > That being said, because V2 code does exist (though
> > unfinalized),
> > > > > there
> > > > > > > are
> > > > > > > > clients / tools that are writing data in the un-finalized V2
> > > > format,
> > > > > as
> > > > > > > > seems to be the case with Dremio.
> > > > > > > >
> > > > > > > > Now, as that comment you quoted said, you can have Spark
> write
> > V2
> > > > > > files,
> > > > > > > > but it's worth being mindful about the fact that V2 is a
> moving
> > > > > target
> > > > > > > and
> > > > > > > > can (and likely will) change. You can overwrite
> > > > > parquet.writer.version
> > > > > > to
> > > > > > > > specify your desired version, but it can be dangerous to
> > produce
> > > > data
> > > > > > in
> > > > > > > a
> > > > > > > > moving-target format. For example, let's say you write a
> bunch
> > of
> > > > > data
> > > > > > in
> > > > > > > > Parquet V2, and then the community decides to make a breaking
> > > > change
> > > > > > > (which
> > > > > > > > is completely fine / allowed since V2 isn't finalized). You
> are
> > > now
> > > > > > left
> > > > > > > > having to deal with a potentially large and complicated file
> > > format
> > > > > > > update.
> > > > > > > > That's why it's not recommended to write files in parquet v2
> > just
> > > > > yet."
> > > > > > > >
> > > > > > > >
> > > > > > > > *As per Apache Parquet Community Parquet V2 is not final yet
> so
> > > it
> > > > is
> > > > > > not
> > > > > > > > official . They are advising not to use Parquet V2 for
> writing
> > > > > (though
> > > > > > > code
> > > > > > > > is available ) .*
> > > > > > > >
> > > > > > > >
> > > > > > > > *As per above Spark hasn't started using Parquet V2 for
> writing
> > > *.
> > > > > > > >
> > > > > > > > May I know how an unstable /unofficial  version is being used
> > in
> > > > > > pyarrow
> > > > > > > ?
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Apr 24, 2024 at 12:43 AM Jacob Wujciak <
> > > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > First off, please try to clean up formating of emails to be
> > > > legible
> > > > > > > when
> > > > > > > > > forwarding/quoting previous messages multiple times,
> > especially
> > > > > when
> > > > > > > most
> > > > > > > > > of the quotes do not contain any useful information. It
> makes
> > > it
> > > > > much
> > > > > > > > > easier to parse the message and thus quicker to answer.
> > > > > > > > >
> > > > > > > > > The short answer is that we switched to 2.4 and more
> recently
> > > to
> > > > > 2.6
> > > > > > as
> > > > > > > > > the default to enable the usage of features these versions
> > > > provide.
> > > > > > As
> > > > > > > > you
> > > > > > > > > have correctly quoted from the docs you can still write 1.0
> > if
> > > > you
> > > > > > want
> > > > > > > > to
> > > > > > > > > ensure compatibility with systems that can not process the
> > > > 'newer'
> > > > > > > > versions
> > > > > > > > > yet (2.6 was released in 2018!).
> > > > > > > > >
> > > > > > > > > You can find the long form discussions about these changes
> > > here:
> > > > > > > > > https://issues.apache.org/jira/browse/ARROW-12203
> > > > > > > > >
> > > https://lists.apache.org/thread/027g366yr3m03hwtpst6sr58b3trwhsm
> > > > > > > > >
> > > > > > > > > Best
> > > > > > > > > Jacob
> > > > > > > > >
> > > > > > > > > On 2024/04/24 02:32:01 Prem Sahoo wrote:
> > > > > > > > > > Hello Team,
> > > > > > > > > > Could you please share your thoughts about below
> questions?
> > > > > > > > > > Sent from my iPhone
> > > > > > > > > >
> > > > > > > > > > Begin forwarded message:
> > > > > > > > > >
> > > > > > > > > > > From: Prem Sahoo <[email protected]>
> > > > > > > > > > > Date: April 23, 2024 at 11:03:48 AM EDT
> > > > > > > > > > > To: [email protected]
> > > > > > > > > > > Subject: Re: PyArrow Using Parquet V2
> > > > > > > > > > >
> > > > > > > > > > > [email protected]
> > > > > > > > > > > Sent from my iPhone
> > > > > > > > > > >
> > > > > > > > > > >>> On Apr 23, 2024, at 6:25 AM, Prem Sahoo <
> > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > > > >>>
> > > > > > > > > > >> Hello Team,
> > > > > > > > > > >> Could anyone please help me on below query?
> > > > > > > > > > >> Sent from my iPhone
> > > > > > > > > > >>
> > > > > > > > > > >>>> On Apr 22, 2024, at 10:01 PM, Prem Sahoo <
> > > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > > > >>>>
> > > > > > > > > > >>> 
> > > > > > > > > > >>> Sent from my iPhone
> > > > > > > > > > >>>
> > > > > > > > > > >>>>> On Apr 22, 2024, at 9:51 PM, Prem Sahoo <
> > > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>> 
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> 
> > > > > > > > > > >>>>> Hello Team,
> > > > > > > > > > >>>>> I have a question regarding Parquet V2 writing thro
> > > > > pyarrow .
> > > > > > > > > > >>>>> As per below Pyarrow started writing Parquet in V2
> > > > > encoding.
> > > > > > > > > > >>>>>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> version{“1.0”, “2.4”, “2.6”}, default “2.6”
> > > > > > > > > > >>>>> Determine which Parquet logical types are available
> > for
> > > > > use,
> > > > > > > > > whether the reduced set from the Parquet 1.x.x format or
> the
> > > > > expanded
> > > > > > > > > logical types added in later format versions. Files written
> > > with
> > > > > > > > > version=’2.4’ or ‘2.6’ may not be readable in all Parquet
> > > > > > > > implementations,
> > > > > > > > > so version=’1.0’ is likely the choice that maximizes file
> > > > > > > compatibility.
> > > > > > > > > UINT32 and some logical types are only available with
> version
> > > > > ‘2.4’.
> > > > > > > > > Nanosecond timestamps are only available with version
> ‘2.6’.
> > > > Other
> > > > > > > > features
> > > > > > > > > such as compression algorithms or the new serialized data
> > page
> > > > > format
> > > > > > > > must
> > > > > > > > > be enabled separately (see ‘compression’ and
> > > > ‘data_page_version’).
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> As per Apache Parquet Community Parquet V2 is not
> > final
> > > > yet
> > > > > > so
> > > > > > > it
> > > > > > > > > is not official . They are advising not to use Parquet V2
> for
> > > > > writing
> > > > > > > > > (though code is available ) .
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> As per above Spark hasn't started using Parquet V2
> > for
> > > > > > writing
> > > > > > > .
> > > > > > > > > > >>>>> May I know how an unstable /unofficial  version is
> > > being
> > > > > used
> > > > > > > in
> > > > > > > > > pyarrow ?
> > > > > > > > > > >>>>>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Fwd: PyArrow Using Parquet V2

Reply via email to