Re: Fwd: PyArrow Using Parquet V2

Jacob Wujciak Wed, 24 Apr 2024 09:06:31 -0700

> Parquet "V2" (including the V2 data pages, and other details) and the
2.x.y releases of the format library artifact. They aren't the same
unfortunately


Oh wow, yeah that's really not clear. parquet.a.o doesn't have any
structured version information as far as I could see.

Am Mi., 24. Apr. 2024 um 17:38 Uhr schrieb Prem Sahoo <prem.re...@gmail.com
>:

> They do support Reading of Parquet V2 , but writing is not supported by
> Spark for V2.
>
> On Wed, Apr 24, 2024 at 11:10 AM Adam Lippai <a...@rigo.sk> wrote:
>
> > Hi Wes,
> >
> > As far as I remember hive, spark, impala, duckdb or even proprietary
> > systems like hyper, Vertica all support reading data page v2 now. The
> most
> > recent column encodings (BYTE_STREAM_SPLIT) might be missing, but overall
> > the support seems much better than a year or two ago.
> >
> > Best regards,
> > Adam Lippai
> >
> > On Wed, Apr 24, 2024 at 10:51 Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > > I think there is confusion about the Parquet "V2" (including the V2
> data
> > > pages, and other details) and the 2.x.y releases of the format library
> > > artifact. They aren't the same unfortunately. I don't think the V2
> > metadata
> > > structures (the data pages in particular, and new column encoding) is
> > > widely adopted / readable.
> > >
> > > On Wed, Apr 24, 2024 at 9:32 AM Weston Pace <weston.p...@gmail.com>
> > wrote:
> > >
> > > > > *As per Apache Parquet Community Parquet V2 is not final yet so it
> is
> > > not
> > > > > official . They are advising not to use Parquet V2 for writing
> > (though
> > > > code
> > > > > is available ) .*
> > > >
> > > > This would be news to me.  Parquet releases are listed (by the
> parquet
> > > > community) at [1]
> > > >
> > > > The vote to release parquet 2.10 is here: [2]
> > > >
> > > > Neither of these links mention anything about this being an
> > experimental,
> > > > unofficial, or non-finalized release.
> > > >
> > > > I understand your concern.  I believe your quotes are coming from
> your
> > > > discussion on the parquet mailing list here [3].  This communication
> is
> > > > unfortunate and confusing to me as well.
> > > >
> > > > [1] https://parquet.apache.org/blog/
> > > > [2] https://lists.apache.org/thread/fdf1zz0f3xzz5zpvo6c811xjswhm1zy6
> > > > [3] https://lists.apache.org/thread/4nzroc68czwxnp0ndqz15kp1vhcd7vg3
> > > >
> > > >
> > > > On Wed, Apr 24, 2024 at 5:10 AM Prem Sahoo <prem.re...@gmail.com>
> > wrote:
> > > >
> > > > > Hello Jacob,
> > > > > Thanks for the information, and my apologies for the weird format
> of
> > my
> > > > > email.
> > > > >
> > > > > This is the email from the Parquet community. May I know why
> pyarrow
> > is
> > > > > using Parquet V2 which is not official yet ?
> > > > >
> > > > > My question is from Parquet community V2 is not final yet so it is
> > not
> > > > > official yet.
> > > > > "Hi Prem - Maybe I can help clarify to the best of my knowledge.
> > > Parquet
> > > > V2
> > > > > as a standard isn't finalized just yet. Meaning there is no formal,
> > > > > *finalized* "contract" that specifies what it means to write data
> in
> > > the
> > > > V2
> > > > > version. The discussions/conversations about what the final V2
> > standard
> > > > may
> > > > > be are still in progress and are evolving.
> > > > >
> > > > > That being said, because V2 code does exist (though unfinalized),
> > there
> > > > are
> > > > > clients / tools that are writing data in the un-finalized V2
> format,
> > as
> > > > > seems to be the case with Dremio.
> > > > >
> > > > > Now, as that comment you quoted said, you can have Spark write V2
> > > files,
> > > > > but it's worth being mindful about the fact that V2 is a moving
> > target
> > > > and
> > > > > can (and likely will) change. You can overwrite
> > parquet.writer.version
> > > to
> > > > > specify your desired version, but it can be dangerous to produce
> data
> > > in
> > > > a
> > > > > moving-target format. For example, let's say you write a bunch of
> > data
> > > in
> > > > > Parquet V2, and then the community decides to make a breaking
> change
> > > > (which
> > > > > is completely fine / allowed since V2 isn't finalized). You are now
> > > left
> > > > > having to deal with a potentially large and complicated file format
> > > > update.
> > > > > That's why it's not recommended to write files in parquet v2 just
> > yet."
> > > > >
> > > > >
> > > > > *As per Apache Parquet Community Parquet V2 is not final yet so it
> is
> > > not
> > > > > official . They are advising not to use Parquet V2 for writing
> > (though
> > > > code
> > > > > is available ) .*
> > > > >
> > > > >
> > > > > *As per above Spark hasn't started using Parquet V2 for writing *.
> > > > >
> > > > > May I know how an unstable /unofficial  version is being used in
> > > pyarrow
> > > > ?
> > > > >
> > > > >
> > > > > On Wed, Apr 24, 2024 at 12:43 AM Jacob Wujciak <
> > assignu...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > First off, please try to clean up formating of emails to be
> legible
> > > > when
> > > > > > forwarding/quoting previous messages multiple times, especially
> > when
> > > > most
> > > > > > of the quotes do not contain any useful information. It makes it
> > much
> > > > > > easier to parse the message and thus quicker to answer.
> > > > > >
> > > > > > The short answer is that we switched to 2.4 and more recently to
> > 2.6
> > > as
> > > > > > the default to enable the usage of features these versions
> provide.
> > > As
> > > > > you
> > > > > > have correctly quoted from the docs you can still write 1.0 if
> you
> > > want
> > > > > to
> > > > > > ensure compatibility with systems that can not process the
> 'newer'
> > > > > versions
> > > > > > yet (2.6 was released in 2018!).
> > > > > >
> > > > > > You can find the long form discussions about these changes here:
> > > > > > https://issues.apache.org/jira/browse/ARROW-12203
> > > > > > https://lists.apache.org/thread/027g366yr3m03hwtpst6sr58b3trwhsm
> > > > > >
> > > > > > Best
> > > > > > Jacob
> > > > > >
> > > > > > On 2024/04/24 02:32:01 Prem Sahoo wrote:
> > > > > > > Hello Team,
> > > > > > > Could you please share your thoughts about below questions?
> > > > > > > Sent from my iPhone
> > > > > > >
> > > > > > > Begin forwarded message:
> > > > > > >
> > > > > > > > From: Prem Sahoo <prem.re...@gmail.com>
> > > > > > > > Date: April 23, 2024 at 11:03:48 AM EDT
> > > > > > > > To: dev-ow...@arrow.apache.org
> > > > > > > > Subject: Re: PyArrow Using Parquet V2
> > > > > > > >
> > > > > > > > dev@arrow.apache.org
> > > > > > > > Sent from my iPhone
> > > > > > > >
> > > > > > > >>> On Apr 23, 2024, at 6:25 AM, Prem Sahoo <
> > prem.re...@gmail.com>
> > > > > > wrote:
> > > > > > > >>>
> > > > > > > >> Hello Team,
> > > > > > > >> Could anyone please help me on below query?
> > > > > > > >> Sent from my iPhone
> > > > > > > >>
> > > > > > > >>>> On Apr 22, 2024, at 10:01 PM, Prem Sahoo <
> > > prem.re...@gmail.com>
> > > > > > wrote:
> > > > > > > >>>>
> > > > > > > >>> 
> > > > > > > >>> Sent from my iPhone
> > > > > > > >>>
> > > > > > > >>>>> On Apr 22, 2024, at 9:51 PM, Prem Sahoo <
> > > prem.re...@gmail.com>
> > > > > > wrote:
> > > > > > > >>>>>
> > > > > > > >>>> 
> > > > > > > >>>>
> > > > > > > >>>>>
> > > > > > > >>>>> 
> > > > > > > >>>>> Hello Team,
> > > > > > > >>>>> I have a question regarding Parquet V2 writing thro
> > pyarrow .
> > > > > > > >>>>> As per below Pyarrow started writing Parquet in V2
> > encoding.
> > > > > > > >>>>>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table
> > > > > > > >>>>>
> > > > > > > >>>>> version{“1.0”, “2.4”, “2.6”}, default “2.6”
> > > > > > > >>>>> Determine which Parquet logical types are available for
> > use,
> > > > > > whether the reduced set from the Parquet 1.x.x format or the
> > expanded
> > > > > > logical types added in later format versions. Files written with
> > > > > > version=’2.4’ or ‘2.6’ may not be readable in all Parquet
> > > > > implementations,
> > > > > > so version=’1.0’ is likely the choice that maximizes file
> > > > compatibility.
> > > > > > UINT32 and some logical types are only available with version
> > ‘2.4’.
> > > > > > Nanosecond timestamps are only available with version ‘2.6’.
> Other
> > > > > features
> > > > > > such as compression algorithms or the new serialized data page
> > format
> > > > > must
> > > > > > be enabled separately (see ‘compression’ and
> ‘data_page_version’).
> > > > > > > >>>>>
> > > > > > > >>>>>
> > > > > > > >>>>> As per Apache Parquet Community Parquet V2 is not final
> yet
> > > so
> > > > it
> > > > > > is not official . They are advising not to use Parquet V2 for
> > writing
> > > > > > (though code is available ) .
> > > > > > > >>>>>
> > > > > > > >>>>> As per above Spark hasn't started using Parquet V2 for
> > > writing
> > > > .
> > > > > > > >>>>> May I know how an unstable /unofficial  version is being
> > used
> > > > in
> > > > > > pyarrow ?
> > > > > > > >>>>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Fwd: PyArrow Using Parquet V2

Reply via email to