Hi,
It seems that how to build the statistics schema isn't so
complex. I'll proceed with this approach.
Thanks,
--
kou
In <20241016.150837.525206251920899606@clear-code.com>
"Re: [DISCUSS] Statistics through the C data interface" on Wed, 16 Oct 2024
15:08:37 +
a bit complex to build.
What do you think about this?
Thanks,
--
kou
In <20240805.183331.1066091419162501890@clear-code.com>
"Re: [DISCUSS] Statistics through the C data interface" on Mon, 05 Aug 2024
18:33:31 +0900 (JST),
Sutou Kouhei wrote:
> Hi,
>
> I
Hi,
I've opened a PR for documentation:
https://github.com/apache/arrow/pull/43553
The "Example" section isn't written yet but suggestions are
very welcome.
Thanks,
--
kou
In <20240725.143518.421507820763165665....@clear-code.com>
"Re: [DISCUSS] Statistics t
when I
complete them. We can start a vote for this after we review
the PR.
Thanks,
--
kou
In <20240712.151536.312169170508271330@clear-code.com>
"Re: [DISCUSS] Statistics through the C data interface" on Fri, 12 Jul 2024
15:15:36 +0900 (JST),
Sutou Kouhei wrote:
> H
atistics, I think that
map or map, ...> is
better as Antoine said. Because they are simpler than
int32+utf8.
Thanks,
--
kou
In
"Re: [DISCUSS] Statistics through the C data interface" on Thu, 11 Jul 2024
14:17:46 -0300,
Felipe Oliveira Carvalho wrote:
> On Thu, Jul 11, 2024
o leverage the fact that libraries handles
unions gracefully, this could be:
map, dense_union<...needed types based on stat kinds
in the keys...>>
X is either sparse or dense.
A possible alternative is to use a custom struct instead of map and reduce
the levels of nesting:
struct>
--
ase?
map,
dense_union<...needed types based on stat kinds in the keys...>>
Thanks,
--
kou
In
"Re: [DISCUSS] Statistics through the C data interface" on Mon, 1 Jul 2024
11:58:44 -0300,
Felipe Oliveira Carvalho wrote:
> Hi,
>
> You can promise that well-kno
e can use int32 for statistic keys.
2. We can use the special statistic key + a string identifier
for non-standard statistic keys.
What do you think about this?
Thanks,
--
kou
In <20240606.182727.1004633558059795207@clear-code.com>
"Re: [DISCUSS] Statistics through the C da
statistic keys or use prefix such as
>"vendor1:"/"vendor1." for non-standard statistic keys.
>
> Here is Felipe's idea:
> https://lists.apache.org/thread/gr2nmlrwr7d5wkz3zgq6vy5q0ow8xof2
>
> 1. We can use int32 for statistic keys.
> 2. We can use t
for non-standard statistic keys.
What do you think about this?
Thanks,
--
kou
In <20240606.182727.1004633558059795207....@clear-code.com>
"Re: [DISCUSS] Statistics through the C data interface" on Thu, 06 Jun 2024
18:27:27 +0900 (JST),
Sutou Kouhei wrote:
> Hi,
>
On Sun, Jun 9, 2024 at 7:53 PM Sutou Kouhei wrote:
>
> Hi,
>
> In
> "Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun 2024
> 22:11:54 +0200,
> Antoine Pitrou wrote:
>
> >>>> Fields:
> >
terchanged through the C
data interface. Do you want to use it to determine whether
pushdown is used or not?
Thanks,
--
kou
In
"Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun 2024
18:29:19 -0400,
Adam Lippai wrote:
> It’s not strictly statistics, bu
Hi,
In
"Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun 2024
22:07:01 +0200,
Antoine Pitrou wrote:
>> How about reserving a specific range (e.g. 1-2) for
>> vendor-specific statistics?
>
> This would be quite annoying to work
considering it for a minute.
Best regards,
Adam Lippai
On Sun, Jun 9, 2024 at 17:36 Sutou Kouhei wrote:
> Hi,
>
> In
> "Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun
> 2024 22:11:54 +0200,
> Antoine Pitrou wrote:
>
> >>&
Hi,
In
"Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun 2024
22:11:54 +0200,
Antoine Pitrou wrote:
>>>> Fields:
>>>> | Name | Type | Comments |
>>>> ||--
Le 09/06/2024 à 08:33, Sutou Kouhei a écrit :
Fields:
| Name | Type | Comments |
||---| |
| column | utf8 | (2) |
| key| utf8 not null | (3) |
1. Should the key be
Le 09/06/2024 à 09:01, Sutou Kouhei a écrit :
Hi,
One thing that a plain integer makes more difficult is representing
non-standard statistics. For example some engine might want to expose
elaborate quantile-based statistics even if it not officially defined
here. With a `utf8` or `dictionary(
e range aren't
global unique but global uniqueness may not be needed in the
specific producer-consumer communication.
Thanks,
--
kou
In
"Re: [DISCUSS] Statistics through the C data interface" on Fri, 7 Jun 2024
10:05:48 +0200,
Antoine Pitrou wrote:
>
> Le 07/06/2024
hink that we can't mix
map<
// the column index or null if the statistics refer to whole table or batch
column: int32,
map>
>
and
map<
// the column indexes or empty if the statistics refer to whole table or batch
column: list,
map>
>
.
Thanks,
--
kou
In
ld also be used for the other
> statistics in addition to a row count if the array is not a struct
> array?
I didn't think of it. Thanks. It makes sense.
Thanks,
--
kou
In
"Re: [DISCUSS] Statistics through the C data interface" on Thu, 6 Jun 2024
22:06:41 -0300,
Dewe
Hi,
We can use 4. for per-batch statistics. Because 4. uses
separated API call. Users can design the separated API call
for per-batch statistics.
Thanks,
--
kou
In
"Re: [DISCUSS] Statistics through the C data interface" on Thu, 6 Jun 2024
13:14:08 +0200,
Alessandro Molina w
9 | {Chicago, IL} | {f,f} | 0.001333 |5.1e-05
>...
> (99 rows)
It may be complex to support full multi-column statistics
use cases. How about standardizing this without
multi-columns statistics support for the first version? We
can add support for multi-column sta
> I just used quantiles as an example of a statistic that's not in the current
> proposed spec, but that some engines would like to expose.
All statistics are optional so we can always add more to the spec.
> In other words, a plain integer makes extensibility more difficult than a
> string.
O
Le 07/06/2024 à 18:30, Felipe Oliveira Carvalho a écrit :
On Fri, Jun 7, 2024 at 6:24 AM Antoine Pitrou wrote:
Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit :
I've been thinking about how to encode statistics on Arrow arrays and
how to keep the set of statistics known by both pr
On Fri, Jun 7, 2024 at 6:24 AM Antoine Pitrou wrote:
>
>
> Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit :
> > I've been thinking about how to encode statistics on Arrow arrays and
> > how to keep the set of statistics known by both producers and
> > consumers (i.e. standardized).
> >
>
Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit :
I've been thinking about how to encode statistics on Arrow arrays and
how to keep the set of statistics known by both producers and
consumers (i.e. standardized).
The statistics array(s) could be a
map<
// the column index or n
I've been thinking about how to encode statistics on Arrow arrays and
how to keep the set of statistics known by both producers and
consumers (i.e. standardized).
The statistics array(s) could be a
map<
// the column index or null if the statistics refer to whole table or batch
column:
Thank you for collecting all of our opinions on this! I also agree
that (4) is the best option.
> Fields:
>
> | Name | Type | Comments |
> ||---| |
> | column | utf8 | (2) |
The uft8 type would p
I brought it up on Github, but writing here too to avoid spawning too many
threads.
https://github.com/apache/arrow/issues/38837#issuecomment-2145343755
It's not something we have to address now, but it would be great if we
could design a solution that can be extended in the future to add Par-Batc
Hi Kou,
Thanks for pushing for this!
Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
4. Standardize Apache Arrow schema for statistics and
transmit statistics via separated API call that uses the
C data interface
[...]
I think that 4. is the best approach in these candidates.
I agr
Hi,
Thanks for sharing your comments. Here is a summary so far:
Use cases:
* Optimize query plan: e.g. JOIN for DuckDB
Out of scope:
* Transmit statistics through not the C data interface
Examples:
* Transmit statistics through Apache Arrow IPC file
* Transmit statistics through Ap
cuss this use case in a separated thread. It may be
better that we start a new thread for it after we complete
this thread. In the separated thread, we will use a
conclusion in this thread.
Thanks,
--
kou
In
"Re: [DISCUSS] Statistics through the C data interface" on Fri, 31 May 2024
ata interface to also depend on the IPC format. JSON sounds
>more reasonable in that case.
>
>Shoumyo
>
>From: dev@arrow.apache.org At: 05/29/24 02:02:23 UTC-4:00To:
>dev@arrow.apache.org
>Subject: Re: [DISCUSS] Statistics through the C data interface
>
>>Hi,
>
>
t;, \"value\": 29,
>\"value_type\": \"uint64\", \"is_approximate\": false } ]"
>> }
>> }
>>
>> It's more verbose, but more closely mirrors the Arrow array
>> schema defined for statistics getter APIs. This co
re closely mirrors the Arrow array
> schema defined for statistics getter APIs. This could make it
> easier to translate between the two.
Thanks. I didn't think of it.
It makes sense.
Thanks,
--
kou
In <665673b500015f5808ce0...@message.bloomberg.net>
"Re: [DISCUSS] Statis
verbose, but more closely mirrors the Arrow array
schema defined for statistics getter APIs. This could make it
easier to translate between the two.
Thanks,
Shoumyo
From: dev@arrow.apache.org At: 05/26/24 21:48:52 UTC-4:00To:
dev@arrow.apache.org
Subject: Re: [DISCUSS] Statistics through the C data in
uot;
metadata when we use "ARROW:statistics" for statistics.
Thanks,
--
kou
In
"Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024
15:14:49 -0300,
Dewey Dunnington wrote:
> Thanks Shoumyo for bringing this up!
>
> Using a schema to transmit s
> > |--|---| |
>> > | column_name | utf8 | (1) |
>> > | statistic_key| utf8 not null | (2) |
>> > | statistic_value | VALUE_SCHEMA not null |
s "max", "min",
>> >"byte_width" and "distinct_count" but users can also use
>> >application specific keys.
>> > 3. If true, then the value is approximate or best-effort.
>> >
>> > VALUE_SCHEMA is a dense un
tistics()? Or Does it propose that
> we define one more Arrow C XXX interface that wraps
> ArrowArrayStream like ArrowDeviceArray wraps ArrowArray?
>
> ArrowDeviceArray:
> https://arrow.apache.org/docs/format/CDeviceDataInterface.html
>
>
> Thanks,
> --
> kou
>
?
ArrowDeviceArray:
https://arrow.apache.org/docs/format/CDeviceDataInterface.html
Thanks,
--
kou
In
"Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024
06:55:40 -0700,
Curt Hagenlocher wrote:
>> would it be easier to request statistics at a higher level
ch for statistics
key. It requires additional ID and name mapping for
application-specific statistics key. We can use just name
for it.
See also the related discussion on the issue:
https://github.com/apache/arrow/issues/38837#issuecomment-2108895904
Thanks,
--
kou
In
"Re: [DISCUSS] Statistic
e to standardize it as canonical metadata.
> >
> > I say this all as a happy user of DuckDB's Arrow scan functionality
that is
excited to see better query optimization capabilities. It's just that, in its
current form, the changes in this proposal are not something I could
f
eemed like
one approach that would achieve this goal.
Best,
Shoumyo
From: dev@arrow.apache.org At: 05/23/24 14:16:32 UTC-4:00To:
dev@arrow.apache.org
Subject: Re: [DISCUSS] Statistics through the C data interface
Thanks Shoumyo for bringing this up!
Using a schema to transmit statistica/data dependent val
all as a happy user of DuckDB's Arrow scan functionality that is
> > excited to see better query optimization capabilities. It's just that, in
> > its current form, the changes in this proposal are not something I could
> > foreseeably integrate with.
> >
> &g
timization capabilities. It's just that, in its
> current form, the changes in this proposal are not something I could
> foreseeably integrate with.
>
> Best,
> Shoumyo
>
> [1]:
> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
>
> F
4:00To:
dev@arrow.apache.org
Subject: Re: [DISCUSS] Statistics through the C data interface
I want to +1 on what Dewey is saying here and some comments.
Sutou Kouhei wrote:
> ADBC may be a bit larger to use only for transmitting statistics. ADBC has
statistics related APIs but it has more other A
Le 23/05/2024 à 16:09, Felipe Oliveira Carvalho a écrit :
Protocols that produce/consume statistics might want to use the C Data
Interface as a primitive for passing Arrow arrays of statistics.
This is also my opinion.
I think what we are slowly converging on is the need for a spec to
desc
is a dense union with members:
> >
> > | Field Name | Field Type |
> > |--------|----|
> > | int64 | int64 |
> > | uint64 | uint64 |
> > | float64| float64|
> > | binary | binary |
> >
> > If a co
an also use
> >application specific keys.
> > 3. If true, then the value is approximate or best-effort.
> >
> > VALUE_SCHEMA is a dense union with members:
> >
> > | Field Name | Field Type |
> > |--------|----|
> > | int64 | int64
4 |
> | float64| float64|
> | binary | binary |
>
> If a column is an int32 column, it uses int64 for
> "max"/"min". We don't provide all types here. Users should
> use a compatible type (int64 for a int32 column) instead.
>
>
> Th
| binary |
If a column is an int32 column, it uses int64 for
"max"/"min". We don't provide all types here. Users should
use a compatible type (int64 for a int32 column) instead.
Thanks,
--
kou
In
"Re: [DISCUSS] Statistics through the C data interface" on
ing a schema for statistics ArrowArray?
It doesn't define how to get the statistics ArrowArray like
the Arrow C data interface but defining a schema for
statistics ArrowArray will improve how to transmit
statistics.
Thanks,
--
kou
In
"Re: [DISCUSS] Statistics through the C data int
tistics [1],
Thanks for sharing this. It seems that DataFusion supports
only pre-defined statistics. Does DataFusion ever require
application-specific statistics? I'm not sure whether we
should support application-specific statistics or not.
Thanks,
--
kou
In
"Re: [DISCUSS] Sta
Hi Kou,
I agree that Dewey that this is overstretching the capabilities of the C
Data Interface. In particular, stuffing a pointer as metadata value and
decreeing it immortal doesn't sound like a good design decision.
Why not simply pass the statistics ArrowArray separately in your
produce
I am definitely in favor of adding (or adopting an existing)
ABI-stable way to transmit statistics (the one that comes up most
frequently for me is just the number of values that are about to show
up in an ArrowArrayStream, since the producer often knows this and the
consumer often would like to pr
Hi,
One potential challenge with encoding statistics in the schema metadata
is that some systems may consider this metadata as part of assessing
schema equivalence.
However, I think the bigger question is what the intended use-case for
these statistics is? Often query engines want to collect
57 matches
Mail list logo