Re: [DISCUSS] Statistics through the C data interface

2024-10-22 Thread Sutou Kouhei
Hi, It seems that how to build the statistics schema isn't so complex. I'll proceed with this approach. Thanks, -- kou In <20241016.150837.525206251920899606@clear-code.com> "Re: [DISCUSS] Statistics through the C data interface" on Wed, 16 Oct 2024 15:08:37 +

Re: [DISCUSS] Statistics through the C data interface

2024-10-15 Thread Sutou Kouhei
a bit complex to build. What do you think about this? Thanks, -- kou In <20240805.183331.1066091419162501890@clear-code.com> "Re: [DISCUSS] Statistics through the C data interface" on Mon, 05 Aug 2024 18:33:31 +0900 (JST), Sutou Kouhei wrote: > Hi, > > I&#x

Re: [DISCUSS] Statistics through the C data interface

2024-08-05 Thread Sutou Kouhei
Hi, I've opened a PR for documentation: https://github.com/apache/arrow/pull/43553 The "Example" section isn't written yet but suggestions are very welcome. Thanks, -- kou In <20240725.143518.421507820763165665....@clear-code.com> "Re: [DISCUSS] Statistics t

Re: [DISCUSS] Statistics through the C data interface

2024-07-24 Thread Sutou Kouhei
when I complete them. We can start a vote for this after we review the PR. Thanks, -- kou In <20240712.151536.312169170508271330@clear-code.com> "Re: [DISCUSS] Statistics through the C data interface" on Fri, 12 Jul 2024 15:15:36 +0900 (JST), Sutou Kouhei wrote: > H

Re: [DISCUSS] Statistics through the C data interface

2024-07-11 Thread Sutou Kouhei
atistics, I think that map or map, ...> is better as Antoine said. Because they are simpler than int32+utf8. Thanks, -- kou In "Re: [DISCUSS] Statistics through the C data interface" on Thu, 11 Jul 2024 14:17:46 -0300, Felipe Oliveira Carvalho wrote: > On Thu, Jul 11, 2024

Re: [DISCUSS] Statistics through the C data interface

2024-07-11 Thread Felipe Oliveira Carvalho
o leverage the fact that libraries handles unions gracefully, this could be: map, dense_union<...needed types based on stat kinds in the keys...>> X is either sparse or dense. A possible alternative is to use a custom struct instead of map and reduce the levels of nesting: struct> --

Re: [DISCUSS] Statistics through the C data interface

2024-07-11 Thread Sutou Kouhei
ase? map, dense_union<...needed types based on stat kinds in the keys...>> Thanks, -- kou In "Re: [DISCUSS] Statistics through the C data interface" on Mon, 1 Jul 2024 11:58:44 -0300, Felipe Oliveira Carvalho wrote: > Hi, > > You can promise that well-kno

Re: [DISCUSS] Statistics through the C data interface

2024-07-01 Thread Antoine Pitrou
e can use int32 for statistic keys. 2. We can use the special statistic key + a string identifier for non-standard statistic keys. What do you think about this? Thanks, -- kou In <20240606.182727.1004633558059795207@clear-code.com> "Re: [DISCUSS] Statistics through the C da

Re: [DISCUSS] Statistics through the C data interface

2024-07-01 Thread Felipe Oliveira Carvalho
statistic keys or use prefix such as >"vendor1:"/"vendor1." for non-standard statistic keys. > > Here is Felipe's idea: > https://lists.apache.org/thread/gr2nmlrwr7d5wkz3zgq6vy5q0ow8xof2 > > 1. We can use int32 for statistic keys. > 2. We can use t

Re: [DISCUSS] Statistics through the C data interface

2024-06-20 Thread Sutou Kouhei
for non-standard statistic keys. What do you think about this? Thanks, -- kou In <20240606.182727.1004633558059795207....@clear-code.com> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 06 Jun 2024 18:27:27 +0900 (JST), Sutou Kouhei wrote: > Hi, >

Re: [DISCUSS] Statistics through the C data interface

2024-06-14 Thread Felipe Oliveira Carvalho
On Sun, Jun 9, 2024 at 7:53 PM Sutou Kouhei wrote: > > Hi, > > In > "Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun 2024 > 22:11:54 +0200, > Antoine Pitrou wrote: > > >>>> Fields: > >

Re: [DISCUSS] Statistics through the C data interface

2024-06-11 Thread Sutou Kouhei
terchanged through the C data interface. Do you want to use it to determine whether pushdown is used or not? Thanks, -- kou In "Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun 2024 18:29:19 -0400, Adam Lippai wrote: > It’s not strictly statistics, bu

Re: [DISCUSS] Statistics through the C data interface

2024-06-10 Thread Sutou Kouhei
Hi, In "Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun 2024 22:07:01 +0200, Antoine Pitrou wrote: >> How about reserving a specific range (e.g. 1-2) for >> vendor-specific statistics? > > This would be quite annoying to work

Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Adam Lippai
considering it for a minute. Best regards, Adam Lippai On Sun, Jun 9, 2024 at 17:36 Sutou Kouhei wrote: > Hi, > > In > "Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun > 2024 22:11:54 +0200, > Antoine Pitrou wrote: > > >>&

Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Sutou Kouhei
Hi, In "Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun 2024 22:11:54 +0200, Antoine Pitrou wrote: >>>> Fields: >>>> | Name | Type | Comments | >>>> ||--

Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Antoine Pitrou
Le 09/06/2024 à 08:33, Sutou Kouhei a écrit : Fields: | Name | Type | Comments | ||---| | | column | utf8 | (2) | | key| utf8 not null | (3) | 1. Should the key be

Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Antoine Pitrou
Le 09/06/2024 à 09:01, Sutou Kouhei a écrit : Hi, One thing that a plain integer makes more difficult is representing non-standard statistics. For example some engine might want to expose elaborate quantile-based statistics even if it not officially defined here. With a `utf8` or `dictionary(

Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Sutou Kouhei
e range aren't global unique but global uniqueness may not be needed in the specific producer-consumer communication. Thanks, -- kou In "Re: [DISCUSS] Statistics through the C data interface" on Fri, 7 Jun 2024 10:05:48 +0200, Antoine Pitrou wrote: > > Le 07/06/2024

Re: [DISCUSS] Statistics through the C data interface

2024-06-08 Thread Sutou Kouhei
hink that we can't mix map< // the column index or null if the statistics refer to whole table or batch column: int32, map> > and map< // the column indexes or empty if the statistics refer to whole table or batch column: list, map> > . Thanks, -- kou In

Re: [DISCUSS] Statistics through the C data interface

2024-06-08 Thread Sutou Kouhei
ld also be used for the other > statistics in addition to a row count if the array is not a struct > array? I didn't think of it. Thanks. It makes sense. Thanks, -- kou In "Re: [DISCUSS] Statistics through the C data interface" on Thu, 6 Jun 2024 22:06:41 -0300, Dewe

Re: [DISCUSS] Statistics through the C data interface

2024-06-08 Thread Sutou Kouhei
Hi, We can use 4. for per-batch statistics. Because 4. uses separated API call. Users can design the separated API call for per-batch statistics. Thanks, -- kou In "Re: [DISCUSS] Statistics through the C data interface" on Thu, 6 Jun 2024 13:14:08 +0200, Alessandro Molina w

Re: [DISCUSS] Statistics through the C data interface

2024-06-08 Thread Sutou Kouhei
9 | {Chicago, IL} | {f,f} | 0.001333 |5.1e-05 >... > (99 rows) It may be complex to support full multi-column statistics use cases. How about standardizing this without multi-columns statistics support for the first version? We can add support for multi-column sta

Re: [DISCUSS] Statistics through the C data interface

2024-06-08 Thread Felipe Oliveira Carvalho
> I just used quantiles as an example of a statistic that's not in the current > proposed spec, but that some engines would like to expose. All statistics are optional so we can always add more to the spec. > In other words, a plain integer makes extensibility more difficult than a > string. O

Re: [DISCUSS] Statistics through the C data interface

2024-06-08 Thread Antoine Pitrou
Le 07/06/2024 à 18:30, Felipe Oliveira Carvalho a écrit : On Fri, Jun 7, 2024 at 6:24 AM Antoine Pitrou wrote: Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit : I've been thinking about how to encode statistics on Arrow arrays and how to keep the set of statistics known by both pr

Re: [DISCUSS] Statistics through the C data interface

2024-06-07 Thread Felipe Oliveira Carvalho
On Fri, Jun 7, 2024 at 6:24 AM Antoine Pitrou wrote: > > > Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit : > > I've been thinking about how to encode statistics on Arrow arrays and > > how to keep the set of statistics known by both producers and > > consumers (i.e. standardized). > > >

Re: [DISCUSS] Statistics through the C data interface

2024-06-07 Thread Antoine Pitrou
Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit : I've been thinking about how to encode statistics on Arrow arrays and how to keep the set of statistics known by both producers and consumers (i.e. standardized). The statistics array(s) could be a map< // the column index or n

Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Felipe Oliveira Carvalho
I've been thinking about how to encode statistics on Arrow arrays and how to keep the set of statistics known by both producers and consumers (i.e. standardized). The statistics array(s) could be a map< // the column index or null if the statistics refer to whole table or batch column:

Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Dewey Dunnington
Thank you for collecting all of our opinions on this! I also agree that (4) is the best option. > Fields: > > | Name | Type | Comments | > ||---| | > | column | utf8 | (2) | The uft8 type would p

Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Alessandro Molina
I brought it up on Github, but writing here too to avoid spawning too many threads. https://github.com/apache/arrow/issues/38837#issuecomment-2145343755 It's not something we have to address now, but it would be great if we could design a solution that can be extended in the future to add Par-Batc

Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Antoine Pitrou
Hi Kou, Thanks for pushing for this! Le 06/06/2024 à 11:27, Sutou Kouhei a écrit : 4. Standardize Apache Arrow schema for statistics and transmit statistics via separated API call that uses the C data interface [...] I think that 4. is the best approach in these candidates. I agr

Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Sutou Kouhei
Hi, Thanks for sharing your comments. Here is a summary so far: Use cases: * Optimize query plan: e.g. JOIN for DuckDB Out of scope: * Transmit statistics through not the C data interface Examples: * Transmit statistics through Apache Arrow IPC file * Transmit statistics through Ap

Re: [DISCUSS] Statistics through the C data interface

2024-05-31 Thread Sutou Kouhei
cuss this use case in a separated thread. It may be better that we start a new thread for it after we complete this thread. In the separated thread, we will use a conclusion in this thread. Thanks, -- kou In "Re: [DISCUSS] Statistics through the C data interface" on Fri, 31 May 2024

Re: [DISCUSS] Statistics through the C data interface

2024-05-31 Thread Raphael Taylor-Davies
ata interface to also depend on the IPC format. JSON sounds >more reasonable in that case. > >Shoumyo > >From: dev@arrow.apache.org At: 05/29/24 02:02:23 UTC-4:00To: >dev@arrow.apache.org >Subject: Re: [DISCUSS] Statistics through the C data interface > >>Hi, > >

Re: [DISCUSS] Statistics through the C data interface

2024-05-31 Thread Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)
t;, \"value\": 29, >\"value_type\": \"uint64\", \"is_approximate\": false } ]" >> } >> } >> >> It's more verbose, but more closely mirrors the Arrow array >> schema defined for statistics getter APIs. This co

Re: [DISCUSS] Statistics through the C data interface

2024-05-28 Thread Sutou Kouhei
re closely mirrors the Arrow array > schema defined for statistics getter APIs. This could make it > easier to translate between the two. Thanks. I didn't think of it. It makes sense. Thanks, -- kou In <665673b500015f5808ce0...@message.bloomberg.net> "Re: [DISCUSS] Statis

Re: [DISCUSS] Statistics through the C data interface

2024-05-28 Thread Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)
verbose, but more closely mirrors the Arrow array schema defined for statistics getter APIs. This could make it easier to translate between the two. Thanks, Shoumyo From: dev@arrow.apache.org At: 05/26/24 21:48:52 UTC-4:00To: dev@arrow.apache.org Subject: Re: [DISCUSS] Statistics through the C data in

Re: [DISCUSS] Statistics through the C data interface

2024-05-26 Thread Sutou Kouhei
uot; metadata when we use "ARROW:statistics" for statistics. Thanks, -- kou In "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024 15:14:49 -0300, Dewey Dunnington wrote: > Thanks Shoumyo for bringing this up! > > Using a schema to transmit s

Re: [DISCUSS] Statistics through the C data interface

2024-05-26 Thread Sutou Kouhei
> > |--|---| | >> > | column_name | utf8 | (1) | >> > | statistic_key| utf8 not null | (2) | >> > | statistic_value | VALUE_SCHEMA not null |

Re: [DISCUSS] Statistics through the C data interface

2024-05-26 Thread Sutou Kouhei
s "max", "min", >> >"byte_width" and "distinct_count" but users can also use >> >application specific keys. >> > 3. If true, then the value is approximate or best-effort. >> > >> > VALUE_SCHEMA is a dense un

Re: [DISCUSS] Statistics through the C data interface

2024-05-24 Thread Weston Pace
tistics()? Or Does it propose that > we define one more Arrow C XXX interface that wraps > ArrowArrayStream like ArrowDeviceArray wraps ArrowArray? > > ArrowDeviceArray: > https://arrow.apache.org/docs/format/CDeviceDataInterface.html > > > Thanks, > -- > kou >

Re: [DISCUSS] Statistics through the C data interface

2024-05-24 Thread Sutou Kouhei
? ArrowDeviceArray: https://arrow.apache.org/docs/format/CDeviceDataInterface.html Thanks, -- kou In "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024 06:55:40 -0700, Curt Hagenlocher wrote: >> would it be easier to request statistics at a higher level

Re: [DISCUSS] Statistics through the C data interface

2024-05-24 Thread Sutou Kouhei
ch for statistics key. It requires additional ID and name mapping for application-specific statistics key. We can use just name for it. See also the related discussion on the issue: https://github.com/apache/arrow/issues/38837#issuecomment-2108895904 Thanks, -- kou In "Re: [DISCUSS] Statistic

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Aldrin
e to standardize it as canonical metadata. > > > > I say this all as a happy user of DuckDB's Arrow scan functionality that is excited to see better query optimization capabilities. It's just that, in its current form, the changes in this proposal are not something I could f

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)
eemed like one approach that would achieve this goal. Best, Shoumyo From: dev@arrow.apache.org At: 05/23/24 14:16:32 UTC-4:00To: dev@arrow.apache.org Subject: Re: [DISCUSS] Statistics through the C data interface Thanks Shoumyo for bringing this up! Using a schema to transmit statistica/data dependent val

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Dewey Dunnington
all as a happy user of DuckDB's Arrow scan functionality that is > > excited to see better query optimization capabilities. It's just that, in > > its current form, the changes in this proposal are not something I could > > foreseeably integrate with. > > > &g

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Antoine Pitrou
timization capabilities. It's just that, in its > current form, the changes in this proposal are not something I could > foreseeably integrate with. > > Best, > Shoumyo > > [1]: > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata > > F

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)
4:00To: dev@arrow.apache.org Subject: Re: [DISCUSS] Statistics through the C data interface I want to +1 on what Dewey is saying here and some comments. Sutou Kouhei wrote: > ADBC may be a bit larger to use only for transmitting statistics. ADBC has statistics related APIs but it has more other A

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Antoine Pitrou
Le 23/05/2024 à 16:09, Felipe Oliveira Carvalho a écrit : Protocols that produce/consume statistics might want to use the C Data Interface as a primitive for passing Arrow arrays of statistics. This is also my opinion. I think what we are slowly converging on is the need for a spec to desc

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Felipe Oliveira Carvalho
is a dense union with members: > > > > | Field Name | Field Type | > > |--------|----| > > | int64 | int64 | > > | uint64 | uint64 | > > | float64| float64| > > | binary | binary | > > > > If a co

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Curt Hagenlocher
an also use > >application specific keys. > > 3. If true, then the value is approximate or best-effort. > > > > VALUE_SCHEMA is a dense union with members: > > > > | Field Name | Field Type | > > |--------|----| > > | int64 | int64

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Dewey Dunnington
4 | > | float64| float64| > | binary | binary | > > If a column is an int32 column, it uses int64 for > "max"/"min". We don't provide all types here. Users should > use a compatible type (int64 for a int32 column) instead. > > > Th

Re: [DISCUSS] Statistics through the C data interface

2024-05-22 Thread Sutou Kouhei
| binary | If a column is an int32 column, it uses int64 for "max"/"min". We don't provide all types here. Users should use a compatible type (int64 for a int32 column) instead. Thanks, -- kou In "Re: [DISCUSS] Statistics through the C data interface" on

Re: [DISCUSS] Statistics through the C data interface

2024-05-22 Thread Sutou Kouhei
ing a schema for statistics ArrowArray? It doesn't define how to get the statistics ArrowArray like the Arrow C data interface but defining a schema for statistics ArrowArray will improve how to transmit statistics. Thanks, -- kou In "Re: [DISCUSS] Statistics through the C data int

Re: [DISCUSS] Statistics through the C data interface

2024-05-22 Thread Sutou Kouhei
tistics [1], Thanks for sharing this. It seems that DataFusion supports only pre-defined statistics. Does DataFusion ever require application-specific statistics? I'm not sure whether we should support application-specific statistics or not. Thanks, -- kou In "Re: [DISCUSS] Sta

Re: [DISCUSS] Statistics through the C data interface

2024-05-22 Thread Antoine Pitrou
Hi Kou, I agree that Dewey that this is overstretching the capabilities of the C Data Interface. In particular, stuffing a pointer as metadata value and decreeing it immortal doesn't sound like a good design decision. Why not simply pass the statistics ArrowArray separately in your produce

Re: [DISCUSS] Statistics through the C data interface

2024-05-22 Thread Dewey Dunnington
I am definitely in favor of adding (or adopting an existing) ABI-stable way to transmit statistics (the one that comes up most frequently for me is just the number of values that are about to show up in an ArrowArrayStream, since the producer often knows this and the consumer often would like to pr

Re: [DISCUSS] Statistics through the C data interface

2024-05-22 Thread Raphael Taylor-Davies
Hi, One potential challenge with encoding statistics in the schema metadata is that some systems may consider this metadata as part of assessing schema equivalence. However, I think the bigger question is what the intended use-case for these statistics is? Often query engines want to collect