format_version field was designed for this purpose but it is unfortunately
not honored for now. I don't think add a new similar tag/flag will solve the
problem because it does not fix the problem of legacy files and it takes a
lot of effort for all implementations to adopt the new tag/flag.
IMO, I
correct parquet-mr , hardcoded format version to 1 then how can we
identify if a Parquet file written is from V1 or V2 ?
I have asked the same question but according to you there is none .
"As I have said in another thread, Parquet V2 is a concept which contains
a lot of features. FWIW, what are d
Spark leverages parquet writer from parquet-mr, which hard-codes the
format version to 1 [1] even when v2 features are enabled. That's why
I said in dev@parquet that we cannot really tell if a parquet file is v1 or
v2 simply from the format version field.
[1]
https://github.com/apache/parquet-mr/b
I tried with this option but spark is not creating V2 parquet. as I can
still see "format_version: 1.0" . I think it needs something else too.
On Wed, Apr 24, 2024 at 12:33 PM Adam Lippai wrote:
> It supports writing v2, but defaults to v1.
> hadoopConfiguration.set(“parquet.writer.version”, “v2
It supports writing v2, but defaults to v1.
hadoopConfiguration.set(“parquet.writer.version”, “v2”)
Best regards,
Adam Lippai
On Wed, Apr 24, 2024 at 11:40 Prem Sahoo wrote:
> They do support Reading of Parquet V2 , but writing is not supported by
> Spark for V2.
>
> On Wed, Apr 24, 2024 at 11
> Parquet "V2" (including the V2 data pages, and other details) and the
2.x.y releases of the format library artifact. They aren't the same
unfortunately
Oh wow, yeah that's really not clear. parquet.a.o doesn't have any
structured version information as far as I could see.
Am Mi., 24. Apr. 2024
They do support Reading of Parquet V2 , but writing is not supported by
Spark for V2.
On Wed, Apr 24, 2024 at 11:10 AM Adam Lippai wrote:
> Hi Wes,
>
> As far as I remember hive, spark, impala, duckdb or even proprietary
> systems like hyper, Vertica all support reading data page v2 now. The mos
Hi Wes,
As far as I remember hive, spark, impala, duckdb or even proprietary
systems like hyper, Vertica all support reading data page v2 now. The most
recent column encodings (BYTE_STREAM_SPLIT) might be missing, but overall
the support seems much better than a year or two ago.
Best regards,
Ada
As an outsider I suspect the only reason for these “common beliefs” is that
Spark simply doesn’t support some of the breaking features (eg the
nanoseconds data type). Maybe closing the very few gaps would resolve the
issue for good.
Best regards,
Adam Lippai
On Wed, Apr 24, 2024 at 10:32 Weston P
I think there is confusion about the Parquet "V2" (including the V2 data
pages, and other details) and the 2.x.y releases of the format library
artifact. They aren't the same unfortunately. I don't think the V2 metadata
structures (the data pages in particular, and new column encoding) is
widely ad
> *As per Apache Parquet Community Parquet V2 is not final yet so it is not
> official . They are advising not to use Parquet V2 for writing (though
code
> is available ) .*
This would be news to me. Parquet releases are listed (by the parquet
community) at [1]
The vote to release parquet 2.10 i
Hello Jacob,
Thanks for the information, and my apologies for the weird format of my
email.
This is the email from the Parquet community. May I know why pyarrow is
using Parquet V2 which is not official yet ?
My question is from Parquet community V2 is not final yet so it is not
official yet.
"Hi
Hello,
First off, please try to clean up formating of emails to be legible when
forwarding/quoting previous messages multiple times, especially when most of
the quotes do not contain any useful information. It makes it much easier to
parse the message and thus quicker to answer.
The short answ
13 matches
Mail list logo