What should be done if a system doesn't have a record batch concept? For
example, if I remember correctly, Velox works this way and only has a "row
vector" (struct array) but no equivalent to record batch. Should these
systems reject a record batch or should they just accept it as a struct
array?
format_version field was designed for this purpose but it is unfortunately
not honored for now. I don't think add a new similar tag/flag will solve the
problem because it does not fix the problem of legacy files and it takes a
lot of effort for all implementations to adopt the new tag/flag.
IMO, I
correct parquet-mr , hardcoded format version to 1 then how can we
identify if a Parquet file written is from V1 or V2 ?
I have asked the same question but according to you there is none .
"As I have said in another thread, Parquet V2 is a concept which contains
a lot of features. FWIW, what are d
Spark leverages parquet writer from parquet-mr, which hard-codes the
format version to 1 [1] even when v2 features are enabled. That's why
I said in dev@parquet that we cannot really tell if a parquet file is v1 or
v2 simply from the format version field.
[1]
https://github.com/apache/parquet-mr/b
> I believe several array implementations (e.g., numpy, R) are able to
broadcast/recycle a length-1 array. Run-end-encoding is also an option that
would make that broadcast explicit without expanding the scalar.
Some libraries behave this way, i.e. Polars, but others like Pandas and
cuDF only broa
I tried with this option but spark is not creating V2 parquet. as I can
still see "format_version: 1.0" . I think it needs something else too.
On Wed, Apr 24, 2024 at 12:33 PM Adam Lippai wrote:
> It supports writing v2, but defaults to v1.
> hadoopConfiguration.set(“parquet.writer.version”, “v2
It supports writing v2, but defaults to v1.
hadoopConfiguration.set(“parquet.writer.version”, “v2”)
Best regards,
Adam Lippai
On Wed, Apr 24, 2024 at 11:40 Prem Sahoo wrote:
> They do support Reading of Parquet V2 , but writing is not supported by
> Spark for V2.
>
> On Wed, Apr 24, 2024 at 11
> Parquet "V2" (including the V2 data pages, and other details) and the
2.x.y releases of the format library artifact. They aren't the same
unfortunately
Oh wow, yeah that's really not clear. parquet.a.o doesn't have any
structured version information as far as I could see.
Am Mi., 24. Apr. 2024
They do support Reading of Parquet V2 , but writing is not supported by
Spark for V2.
On Wed, Apr 24, 2024 at 11:10 AM Adam Lippai wrote:
> Hi Wes,
>
> As far as I remember hive, spark, impala, duckdb or even proprietary
> systems like hyper, Vertica all support reading data page v2 now. The mos
Hi Wes,
As far as I remember hive, spark, impala, duckdb or even proprietary
systems like hyper, Vertica all support reading data page v2 now. The most
recent column encodings (BYTE_STREAM_SPLIT) might be missing, but overall
the support seems much better than a year or two ago.
Best regards,
Ada
As an outsider I suspect the only reason for these “common beliefs” is that
Spark simply doesn’t support some of the breaking features (eg the
nanoseconds data type). Maybe closing the very few gaps would resolve the
issue for good.
Best regards,
Adam Lippai
On Wed, Apr 24, 2024 at 10:32 Weston P
I think there is confusion about the Parquet "V2" (including the V2 data
pages, and other details) and the 2.x.y releases of the format library
artifact. They aren't the same unfortunately. I don't think the V2 metadata
structures (the data pages in particular, and new column encoding) is
widely ad
> *As per Apache Parquet Community Parquet V2 is not final yet so it is not
> official . They are advising not to use Parquet V2 for writing (though
code
> is available ) .*
This would be news to me. Parquet releases are listed (by the parquet
community) at [1]
The vote to release parquet 2.10 i
Due to unexpected issues (my computer died) I'll move the feature freeze to
early next week.
Thanks
Raúl
El jue, 18 abr 2024, 17:01, Raúl Cumplido escribió:
> Hi,
>
> As discussed on the mailing list [1] I plan to generate a new MINOR
> release 16.1.0 to accommodate some features that missed t
I definitely see the problem here: we don't currently provide a way
for something like a Microsoft Excel or PowerBI or Tableau to use ADBC
drivers without bundling all of the ones they want to support or
requiring/embedding Python or R. I also see how this is a particular
problem for Windows and Ma
Hello Jacob,
Thanks for the information, and my apologies for the weird format of my
email.
This is the email from the Parquet community. May I know why pyarrow is
using Parquet V2 which is not official yet ?
My question is from Parquet community V2 is not final yet so it is not
official yet.
"Hi
16 matches
Mail list logo