Re: Fwd: PyArrow Using Parquet V2

Gang Wu Wed, 24 Apr 2024 20:00:58 -0700

format_version field was designed for this purpose but it is unfortunately
not honored for now. I don't think add a new similar tag/flag will solve the
problem because it does not fix the problem of legacy files and it takes a
lot of effort for all implementations to adopt the new tag/flag.


IMO, I don't think the confusion and the current status will be changed
in the near future due to the activity in the Parquet community. Parquet
has many implementations and has been adopted in various engines
and frameworks. It takes a lot of effort to coordinate with these projects
to be on the same page. Different people may have different understandings
of what the Parquet V2 means. There was an effort to define this:
https://github.com/apache/parquet-format/pull/164. It would be helpful if
someone can revive the discussion.

I think setting parquet.writer.version does produce V2 files to some extent.
This is something that you can trust. But you need to be aware of what
V2 features are enabled under the hood. The good thing is that all these
projects that you have mentioned are open source. You might want to read
the code and find the answer to your question. Developers from different
projects may not be familiar with other projects. So that's why the more you
ask, the more people get confused.

On Thu, Apr 25, 2024 at 10:12 AM Prem Sahoo <[email protected]> wrote:

> correct parquet-mr , hardcoded format version to 1 then how can we
> identify if a Parquet file written is from V1 or V2 ?
> I have asked the same question but according to you there is none .
>
> "As I have said in another thread, Parquet V2 is a concept which contains
> a lot of features. FWIW, what are defined in the specs [1] are finalized
> and
> some of them have been implemented in various implementations. Any file
> that contains one or more of those features can be considered v2 but the
> the community has never defined a formal approach to distinguish between
> v1 and v2. Parquet does have a field in the footer thrift definition to
> mark
> the file version [2]. However, not all implementations populate it
> correctly and
> some engines will even throw if the version is not 1. To avoid confusion, I
> strongly suggest not using any v2 feature in your case unless you are 100%
> sure that all your tools support the v2 feature set you have enabled.
>
> [1] https://github.com/apache/parquet-format/blob/master/CHANGES.md
> [2]
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1111
>
> Best,
> Gang"
>
> Here are my 2 cents
>
> We should have some flag or tag which differentiates Parquet written in V1
> or V2. While reading if the engine doesn't support V2 reading then we are
> sure we shouldn't feed V2 Parquet.
>
> Now few Tools/products are using Parquet V2 for both reading & writing
> but* Apache
> Spark is not supporting write through V2 encoding as per Parquet community
> V2 is not final yet*.
>
> Do we have any date when the parquet-mr jar will have Parquet V2 writing
> functionality so that Spark can adhere to it.
>
> *or if i will add this "hadoopConfiguration.set(“parquet.writer.version”,
> “v2”)" while creating Parquet then those are V2 parquet.*
> please confirm.
>
>
>
> On Wed, Apr 24, 2024 at 9:26 PM Gang Wu <[email protected]> wrote:
>
> > Spark leverages parquet writer from parquet-mr, which hard-codes the
> > format version to 1 [1] even when v2 features are enabled. That's why
> > I said in dev@parquet that we cannot really tell if a parquet file is v1
> > or
> > v2 simply from the format version field.
> >
> > [1]
> >
> >
> https://github.com/apache/parquet-mr/blob/adb3e27c837f81fcef0fbefa8975eea202be693c/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1863
> >
> > Best,
> > Gang
> >
> >
> > On Thu, Apr 25, 2024 at 3:51 AM Prem Sahoo <[email protected]> wrote:
> >
> > > I tried with this option but spark is not creating V2 parquet. as I can
> > > still see "format_version: 1.0" . I think it needs something else too.
> > >
> > > On Wed, Apr 24, 2024 at 12:33 PM Adam Lippai <[email protected]> wrote:
> > >
> > > > It supports writing v2, but defaults to v1.
> > > > hadoopConfiguration.set(“parquet.writer.version”, “v2”)
> > > >
> > > > Best regards,
> > > > Adam Lippai
> > > >
> > > >
> > > > On Wed, Apr 24, 2024 at 11:40 Prem Sahoo <[email protected]>
> wrote:
> > > >
> > > > > They do support Reading of Parquet V2 , but writing is not
> supported
> > by
> > > > > Spark for V2.
> > > > >
> > > > > On Wed, Apr 24, 2024 at 11:10 AM Adam Lippai <[email protected]> wrote:
> > > > >
> > > > > > Hi Wes,
> > > > > >
> > > > > > As far as I remember hive, spark, impala, duckdb or even
> > proprietary
> > > > > > systems like hyper, Vertica all support reading data page v2 now.
> > The
> > > > > most
> > > > > > recent column encodings (BYTE_STREAM_SPLIT) might be missing, but
> > > > overall
> > > > > > the support seems much better than a year or two ago.
> > > > > >
> > > > > > Best regards,
> > > > > > Adam Lippai
> > > > > >
> > > > > > On Wed, Apr 24, 2024 at 10:51 Wes McKinney <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > > I think there is confusion about the Parquet "V2" (including
> the
> > V2
> > > > > data
> > > > > > > pages, and other details) and the 2.x.y releases of the format
> > > > library
> > > > > > > artifact. They aren't the same unfortunately. I don't think the
> > V2
> > > > > > metadata
> > > > > > > structures (the data pages in particular, and new column
> > encoding)
> > > is
> > > > > > > widely adopted / readable.
> > > > > > >
> > > > > > > On Wed, Apr 24, 2024 at 9:32 AM Weston Pace <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > > *As per Apache Parquet Community Parquet V2 is not final
> yet
> > so
> > > > it
> > > > > is
> > > > > > > not
> > > > > > > > > official . They are advising not to use Parquet V2 for
> > writing
> > > > > > (though
> > > > > > > > code
> > > > > > > > > is available ) .*
> > > > > > > >
> > > > > > > > This would be news to me.  Parquet releases are listed (by
> the
> > > > > parquet
> > > > > > > > community) at [1]
> > > > > > > >
> > > > > > > > The vote to release parquet 2.10 is here: [2]
> > > > > > > >
> > > > > > > > Neither of these links mention anything about this being an
> > > > > > experimental,
> > > > > > > > unofficial, or non-finalized release.
> > > > > > > >
> > > > > > > > I understand your concern.  I believe your quotes are coming
> > from
> > > > > your
> > > > > > > > discussion on the parquet mailing list here [3].  This
> > > > communication
> > > > > is
> > > > > > > > unfortunate and confusing to me as well.
> > > > > > > >
> > > > > > > > [1] https://parquet.apache.org/blog/
> > > > > > > > [2]
> > > > https://lists.apache.org/thread/fdf1zz0f3xzz5zpvo6c811xjswhm1zy6
> > > > > > > > [3]
> > > > https://lists.apache.org/thread/4nzroc68czwxnp0ndqz15kp1vhcd7vg3
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Apr 24, 2024 at 5:10 AM Prem Sahoo <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hello Jacob,
> > > > > > > > > Thanks for the information, and my apologies for the weird
> > > format
> > > > > of
> > > > > > my
> > > > > > > > > email.
> > > > > > > > >
> > > > > > > > > This is the email from the Parquet community. May I know
> why
> > > > > pyarrow
> > > > > > is
> > > > > > > > > using Parquet V2 which is not official yet ?
> > > > > > > > >
> > > > > > > > > My question is from Parquet community V2 is not final yet
> so
> > it
> > > > is
> > > > > > not
> > > > > > > > > official yet.
> > > > > > > > > "Hi Prem - Maybe I can help clarify to the best of my
> > > knowledge.
> > > > > > > Parquet
> > > > > > > > V2
> > > > > > > > > as a standard isn't finalized just yet. Meaning there is no
> > > > formal,
> > > > > > > > > *finalized* "contract" that specifies what it means to
> write
> > > data
> > > > > in
> > > > > > > the
> > > > > > > > V2
> > > > > > > > > version. The discussions/conversations about what the final
> > V2
> > > > > > standard
> > > > > > > > may
> > > > > > > > > be are still in progress and are evolving.
> > > > > > > > >
> > > > > > > > > That being said, because V2 code does exist (though
> > > unfinalized),
> > > > > > there
> > > > > > > > are
> > > > > > > > > clients / tools that are writing data in the un-finalized
> V2
> > > > > format,
> > > > > > as
> > > > > > > > > seems to be the case with Dremio.
> > > > > > > > >
> > > > > > > > > Now, as that comment you quoted said, you can have Spark
> > write
> > > V2
> > > > > > > files,
> > > > > > > > > but it's worth being mindful about the fact that V2 is a
> > moving
> > > > > > target
> > > > > > > > and
> > > > > > > > > can (and likely will) change. You can overwrite
> > > > > > parquet.writer.version
> > > > > > > to
> > > > > > > > > specify your desired version, but it can be dangerous to
> > > produce
> > > > > data
> > > > > > > in
> > > > > > > > a
> > > > > > > > > moving-target format. For example, let's say you write a
> > bunch
> > > of
> > > > > > data
> > > > > > > in
> > > > > > > > > Parquet V2, and then the community decides to make a
> breaking
> > > > > change
> > > > > > > > (which
> > > > > > > > > is completely fine / allowed since V2 isn't finalized). You
> > are
> > > > now
> > > > > > > left
> > > > > > > > > having to deal with a potentially large and complicated
> file
> > > > format
> > > > > > > > update.
> > > > > > > > > That's why it's not recommended to write files in parquet
> v2
> > > just
> > > > > > yet."
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > *As per Apache Parquet Community Parquet V2 is not final
> yet
> > so
> > > > it
> > > > > is
> > > > > > > not
> > > > > > > > > official . They are advising not to use Parquet V2 for
> > writing
> > > > > > (though
> > > > > > > > code
> > > > > > > > > is available ) .*
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > *As per above Spark hasn't started using Parquet V2 for
> > writing
> > > > *.
> > > > > > > > >
> > > > > > > > > May I know how an unstable /unofficial  version is being
> used
> > > in
> > > > > > > pyarrow
> > > > > > > > ?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Apr 24, 2024 at 12:43 AM Jacob Wujciak <
> > > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hello,
> > > > > > > > > >
> > > > > > > > > > First off, please try to clean up formating of emails to
> be
> > > > > legible
> > > > > > > > when
> > > > > > > > > > forwarding/quoting previous messages multiple times,
> > > especially
> > > > > > when
> > > > > > > > most
> > > > > > > > > > of the quotes do not contain any useful information. It
> > makes
> > > > it
> > > > > > much
> > > > > > > > > > easier to parse the message and thus quicker to answer.
> > > > > > > > > >
> > > > > > > > > > The short answer is that we switched to 2.4 and more
> > recently
> > > > to
> > > > > > 2.6
> > > > > > > as
> > > > > > > > > > the default to enable the usage of features these
> versions
> > > > > provide.
> > > > > > > As
> > > > > > > > > you
> > > > > > > > > > have correctly quoted from the docs you can still write
> 1.0
> > > if
> > > > > you
> > > > > > > want
> > > > > > > > > to
> > > > > > > > > > ensure compatibility with systems that can not process
> the
> > > > > 'newer'
> > > > > > > > > versions
> > > > > > > > > > yet (2.6 was released in 2018!).
> > > > > > > > > >
> > > > > > > > > > You can find the long form discussions about these
> changes
> > > > here:
> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-12203
> > > > > > > > > >
> > > > https://lists.apache.org/thread/027g366yr3m03hwtpst6sr58b3trwhsm
> > > > > > > > > >
> > > > > > > > > > Best
> > > > > > > > > > Jacob
> > > > > > > > > >
> > > > > > > > > > On 2024/04/24 02:32:01 Prem Sahoo wrote:
> > > > > > > > > > > Hello Team,
> > > > > > > > > > > Could you please share your thoughts about below
> > questions?
> > > > > > > > > > > Sent from my iPhone
> > > > > > > > > > >
> > > > > > > > > > > Begin forwarded message:
> > > > > > > > > > >
> > > > > > > > > > > > From: Prem Sahoo <[email protected]>
> > > > > > > > > > > > Date: April 23, 2024 at 11:03:48 AM EDT
> > > > > > > > > > > > To: [email protected]
> > > > > > > > > > > > Subject: Re: PyArrow Using Parquet V2
> > > > > > > > > > > >
> > > > > > > > > > > > [email protected]
> > > > > > > > > > > > Sent from my iPhone
> > > > > > > > > > > >
> > > > > > > > > > > >>> On Apr 23, 2024, at 6:25 AM, Prem Sahoo <
> > > > > > [email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > > > >>>
> > > > > > > > > > > >> Hello Team,
> > > > > > > > > > > >> Could anyone please help me on below query?
> > > > > > > > > > > >> Sent from my iPhone
> > > > > > > > > > > >>
> > > > > > > > > > > >>>> On Apr 22, 2024, at 10:01 PM, Prem Sahoo <
> > > > > > > [email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>> 
> > > > > > > > > > > >>> Sent from my iPhone
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>>> On Apr 22, 2024, at 9:51 PM, Prem Sahoo <
> > > > > > > [email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>> 
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> 
> > > > > > > > > > > >>>>> Hello Team,
> > > > > > > > > > > >>>>> I have a question regarding Parquet V2 writing
> thro
> > > > > > pyarrow .
> > > > > > > > > > > >>>>> As per below Pyarrow started writing Parquet in
> V2
> > > > > > encoding.
> > > > > > > > > > > >>>>>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> version{“1.0”, “2.4”, “2.6”}, default “2.6”
> > > > > > > > > > > >>>>> Determine which Parquet logical types are
> available
> > > for
> > > > > > use,
> > > > > > > > > > whether the reduced set from the Parquet 1.x.x format or
> > the
> > > > > > expanded
> > > > > > > > > > logical types added in later format versions. Files
> written
> > > > with
> > > > > > > > > > version=’2.4’ or ‘2.6’ may not be readable in all Parquet
> > > > > > > > > implementations,
> > > > > > > > > > so version=’1.0’ is likely the choice that maximizes file
> > > > > > > > compatibility.
> > > > > > > > > > UINT32 and some logical types are only available with
> > version
> > > > > > ‘2.4’.
> > > > > > > > > > Nanosecond timestamps are only available with version
> > ‘2.6’.
> > > > > Other
> > > > > > > > > features
> > > > > > > > > > such as compression algorithms or the new serialized data
> > > page
> > > > > > format
> > > > > > > > > must
> > > > > > > > > > be enabled separately (see ‘compression’ and
> > > > > ‘data_page_version’).
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> As per Apache Parquet Community Parquet V2 is not
> > > final
> > > > > yet
> > > > > > > so
> > > > > > > > it
> > > > > > > > > > is not official . They are advising not to use Parquet V2
> > for
> > > > > > writing
> > > > > > > > > > (though code is available ) .
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> As per above Spark hasn't started using Parquet
> V2
> > > for
> > > > > > > writing
> > > > > > > > .
> > > > > > > > > > > >>>>> May I know how an unstable /unofficial  version
> is
> > > > being
> > > > > > used
> > > > > > > > in
> > > > > > > > > > pyarrow ?
> > > > > > > > > > > >>>>>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Fwd: PyArrow Using Parquet V2

Reply via email to