Re: [VOTE] Release Apache Arrow nanoarrow 0.5.0

2024-05-23 Thread Raúl Cumplido
+1 (binding) I've tested successfully on Ubuntu 22.04 without R. TEST_R=0 ./verify-release-candidate.sh 0.5.0 0 Regards, Raúl El jue, 23 may 2024 a las 6:49, David Li () escribió: > > +1 (binding) > > Tested on Debian 12 'bookworm' > > On Thu, May 23, 2024, at 11:03, Sutou Kouhei wrote: > > +1

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Dewey Dunnington
Thank you for the background! I understand that these statistics are important for query planning; however, I am not sure that I follow why we are constrained to the ArrowSchema to represent them. The examples given seem to going through Python...would it be easier to request statistics at a higher

Re: [VOTE] Release Apache Arrow ADBC 12 - RC4

2024-05-23 Thread Dewey Dunnington
The adbcdrivermanager, adbcsqlite, and adbcpostgresql packages are all updated on CRAN! On Tue, May 21, 2024 at 10:41 PM David Li wrote: > > [x] Close the GitHub milestone/project > [x] Add the new release to the Apache Reporter System > [x] Upload source release artifacts to Subversion > [x] Cre

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Curt Hagenlocher
> would it be easier to request statistics at a higher level of abstraction? What if there were a "single table provider" level of abstraction between ADBC and ArrowArrayStream as a C API; something that can report statistics and apply simple predicates? On Thu, May 23, 2024 at 5:57 AM Dewey Dun

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Felipe Oliveira Carvalho
I want to +1 on what Dewey is saying here and some comments. Sutou Kouhei wrote: > ADBC may be a bit larger to use only for transmitting statistics. ADBC has > statistics related APIs but it has more other APIs. It's impossible to keep the responsibility of communication protocols cleanly separa

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Antoine Pitrou
Le 23/05/2024 à 16:09, Felipe Oliveira Carvalho a écrit : Protocols that produce/consume statistics might want to use the C Data Interface as a primitive for passing Arrow arrays of statistics. This is also my opinion. I think what we are slowly converging on is the need for a spec to desc

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)
This is a really exciting development, thank you for putting together this proposal! It looks like this thread and the linked GitHub issue has lots of input from folks who work with Arrow at a low level and have better familiarity with the Arrow specifications than I do, so I'll refrain from co

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Antoine Pitrou
Hi Shoumyo, The problem with communicating data statistics through schema metadata is that it's not compatible with use cases where you want to know the schema *before* the data is produced. Regards Antoine. On Thu, 23 May 2024 14:28:43 - "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)" wrot

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Dewey Dunnington
Thanks Shoumyo for bringing this up! Using a schema to transmit statistica/data dependent values is also something we do in GeoParquet (whose schema also finds its way into pyarrow and the C data interface when reading). It is usually fine but occasionally ends up with schema metadata that is lyin

[C++] Thread deadlock in ObjectOutputStream

2024-05-23 Thread Li Jin
Hello, I am seeing a deadlock when destructing an ObjectOutputStream. I have attached the stack trace. I did some debugging and found that the issue seems to be that the mutex in question is already held by this thread (I checked the __owner field in the pthread_mutex_t which points to the hangin

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)
Appreciate the additional context! > use cases where you want to know the schema *before* > the data is produced I think my understanding aligns with Dewey's on this point. I guess I'm struggling to imagine a scenario where a query planner would want the schema but not the statistics. Because by

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Aldrin
For what it's worth, duckdb accesses arrow data via IPC in an extension then exports to C data interface to call into code in its core. Also, assumptions about when query optimization occurs relative to data access potentially breaks down in scenarios involving: views, distributed tables, substr

Re: [VOTE] Release Apache Arrow nanoarrow 0.5.0

2024-05-23 Thread Vibhatha Abeykoon
+1 (non-binding) I have tested on Ubuntu 22.04 ./verify-release-candidate.sh 0.5.0 0 With Regards, Vibhatha Abeykoon On Thu, May 23, 2024 at 3:21 PM Raúl Cumplido wrote: > +1 (binding) > > I've tested successfully on Ubuntu 22.04 without R. > > TEST_R=0 ./verify-release-candidate.sh 0.5.0 0