Re: [DISCUSSION] New Flags for Arrow C Interface Schema

2024-04-24 Thread Weston Pace
What should be done if a system doesn't have a record batch concept? For example, if I remember correctly, Velox works this way and only has a "row vector" (struct array) but no equivalent to record batch. Should these systems reject a record batch or should they just accept it as a struct array?

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Gang Wu
format_version field was designed for this purpose but it is unfortunately not honored for now. I don't think add a new similar tag/flag will solve the problem because it does not fix the problem of legacy files and it takes a lot of effort for all implementations to adopt the new tag/flag. IMO, I

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Prem Sahoo
correct parquet-mr , hardcoded format version to 1 then how can we identify if a Parquet file written is from V1 or V2 ? I have asked the same question but according to you there is none . "As I have said in another thread, Parquet V2 is a concept which contains a lot of features. FWIW, what are d

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Gang Wu
Spark leverages parquet writer from parquet-mr, which hard-codes the format version to 1 [1] even when v2 features are enabled. That's why I said in dev@parquet that we cannot really tell if a parquet file is v1 or v2 simply from the format version field. [1] https://github.com/apache/parquet-mr/b

Re: [DISCUSSION] New Flags for Arrow C Interface Schema

2024-04-24 Thread Keith Kraus
> I believe several array implementations (e.g., numpy, R) are able to broadcast/recycle a length-1 array. Run-end-encoding is also an option that would make that broadcast explicit without expanding the scalar. Some libraries behave this way, i.e. Polars, but others like Pandas and cuDF only broa

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Prem Sahoo
I tried with this option but spark is not creating V2 parquet. as I can still see "format_version: 1.0" . I think it needs something else too. On Wed, Apr 24, 2024 at 12:33 PM Adam Lippai wrote: > It supports writing v2, but defaults to v1. > hadoopConfiguration.set(“parquet.writer.version”, “v2

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Adam Lippai
It supports writing v2, but defaults to v1. hadoopConfiguration.set(“parquet.writer.version”, “v2”) Best regards, Adam Lippai On Wed, Apr 24, 2024 at 11:40 Prem Sahoo wrote: > They do support Reading of Parquet V2 , but writing is not supported by > Spark for V2. > > On Wed, Apr 24, 2024 at 11

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Jacob Wujciak
> Parquet "V2" (including the V2 data pages, and other details) and the 2.x.y releases of the format library artifact. They aren't the same unfortunately Oh wow, yeah that's really not clear. parquet.a.o doesn't have any structured version information as far as I could see. Am Mi., 24. Apr. 2024

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Prem Sahoo
They do support Reading of Parquet V2 , but writing is not supported by Spark for V2. On Wed, Apr 24, 2024 at 11:10 AM Adam Lippai wrote: > Hi Wes, > > As far as I remember hive, spark, impala, duckdb or even proprietary > systems like hyper, Vertica all support reading data page v2 now. The mos

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Adam Lippai
Hi Wes, As far as I remember hive, spark, impala, duckdb or even proprietary systems like hyper, Vertica all support reading data page v2 now. The most recent column encodings (BYTE_STREAM_SPLIT) might be missing, but overall the support seems much better than a year or two ago. Best regards, Ada

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Adam Lippai
As an outsider I suspect the only reason for these “common beliefs” is that Spark simply doesn’t support some of the breaking features (eg the nanoseconds data type). Maybe closing the very few gaps would resolve the issue for good. Best regards, Adam Lippai On Wed, Apr 24, 2024 at 10:32 Weston P

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Wes McKinney
I think there is confusion about the Parquet "V2" (including the V2 data pages, and other details) and the 2.x.y releases of the format library artifact. They aren't the same unfortunately. I don't think the V2 metadata structures (the data pages in particular, and new column encoding) is widely ad

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Weston Pace
> *As per Apache Parquet Community Parquet V2 is not final yet so it is not > official . They are advising not to use Parquet V2 for writing (though code > is available ) .* This would be news to me. Parquet releases are listed (by the parquet community) at [1] The vote to release parquet 2.10 i

Re: [INFO] Arrow 16.1.0 - MINOR release feature freeze 25th of April

2024-04-24 Thread Raúl Cumplido
Due to unexpected issues (my computer died) I'll move the feature freeze to early next week. Thanks Raúl El jue, 18 abr 2024, 17:01, Raúl Cumplido escribió: > Hi, > > As discussed on the mailing list [1] I plan to generate a new MINOR > release 16.1.0 to accommodate some features that missed t

Re: ADBC - OS-level driver manager

2024-04-24 Thread Dewey Dunnington
I definitely see the problem here: we don't currently provide a way for something like a Microsoft Excel or PowerBI or Tableau to use ADBC drivers without bundling all of the ones they want to support or requiring/embedding Python or R. I also see how this is a particular problem for Windows and Ma

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Prem Sahoo
Hello Jacob, Thanks for the information, and my apologies for the weird format of my email. This is the email from the Parquet community. May I know why pyarrow is using Parquet V2 which is not official yet ? My question is from Parquet community V2 is not final yet so it is not official yet. "Hi