Re: [DISCUSS] Statistics through the C data interface

Sutou Kouhei Tue, 22 Oct 2024 18:40:47 -0700

Hi,

It seems that how to build the statistics schema isn't so
complex. I'll proceed with this approach.



Thanks,
-- 
kou

In <20241016.150837.525206251920899606....@clear-code.com>
  "Re: [DISCUSS] Statistics through the C data interface" on Wed, 16 Oct 2024 
15:08:37 +0900 (JST),
  Sutou Kouhei <k...@clear-code.com> wrote:

> Hi,
> 
> I'm implementing C++ producer for statistics array:
> https://github.com/apache/arrow/pull/44252
> 
> Our discussed statistics schema is compact:
> https://github.com/apache/arrow/pull/43553/files#diff-f3758fb6986ea8d24bb2e13c2feb625b68bbd6b93b3fbafd3e2a03dcdc7ba263R77-R145
> 
> But it may be a bit complex to build.
> 
> What do you think about this?
> 
> 
> Thanks,
> -- 
> kou
> 
> In <20240805.183331.1066091419162501890....@clear-code.com>
>   "Re: [DISCUSS] Statistics through the C data interface" on Mon, 05 Aug 2024 
> 18:33:31 +0900 (JST),
>   Sutou Kouhei <k...@clear-code.com> wrote:
> 
>> Hi,
>> 
>> I've opened a PR for documentation:
>> https://github.com/apache/arrow/pull/43553
>> 
>> The "Example" section isn't written yet but suggestions are
>> very welcome.
>> 
>> Thanks,
>> -- 
>> kou
>> 
>> In <20240725.143518.421507820763165665....@clear-code.com>
>>   "Re: [DISCUSS] Statistics through the C data interface" on Thu, 25 Jul 
>> 2024 14:35:18 +0900 (JST),
>>   Sutou Kouhei <k...@clear-code.com> wrote:
>> 
>>> Hi,
>>> 
>>> If nobody objects using utf8 or dictionary<int32, utf8> for
>>> statistics key, let's use dictionary<int32, utf8>. Because
>>> dictionary<int32, utf8> will be more effective than utf8
>>> when there are many columns.
>>> 
>>> I'll start writing a documentation for this and implementing
>>> this for C++ next week. I'll share a PR for them when I
>>> complete them. We can start a vote for this after we review
>>> the PR.
>>> 
>>> 
>>> Thanks,
>>> -- 
>>> kou
>>> 
>>> In <20240712.151536.312169170508271330....@clear-code.com>
>>>   "Re: [DISCUSS] Statistics through the C data interface" on Fri, 12 Jul 
>>> 2024 15:15:36 +0900 (JST),
>>>   Sutou Kouhei <k...@clear-code.com> wrote:
>>> 
>>>> Hi,
>>>> 
>>>>>> map<struct<int32, utf8>,
>>>>>>     dense_union<...needed types based on stat kinds in the keys...>>
>>>>>>
>>>>> 
>>>>> Yes. That's my suggestion. And to leverage the fact that libraries handles
>>>>> unions gracefully, this could be:
>>>>> 
>>>>> map<X_union<int32, utf8>, dense_union<...needed types based on stat kinds
>>>>> in the keys...>>
>>>>> 
>>>>> X is either sparse or dense.
>>>>> 
>>>>> A possible alternative is to use a custom struct instead of map and reduce
>>>>> the levels of nesting:
>>>>> 
>>>>> struct<int32, utf8, dense_union<...needed types base on the keys...>>
>>>> 
>>>> Thanks for clarifying your suggestion.
>>>> 
>>>> If we need utf8 for non-standard statistics, I think that
>>>> map<utf8, ...> or map<dictionary<int32, utf8>, ...> is
>>>> better as Antoine said. Because they are simpler than
>>>> int32+utf8.
>>>> 
>>>> 
>>>> Thanks,
>>>> -- 
>>>> kou
>>>> 
>>>> In <caoc8yxy_e9w5sky572phktt-tdguc9v3xpxiom1xohp8rq7...@mail.gmail.com>
>>>>   "Re: [DISCUSS] Statistics through the C data interface" on Thu, 11 Jul 
>>>> 2024 14:17:46 -0300,
>>>>   Felipe Oliveira Carvalho <felipe...@gmail.com> wrote:
>>>> 
>>>>> On Thu, Jul 11, 2024 at 5:04 AM Sutou Kouhei <k...@clear-code.com> wrote:
>>>>> 
>>>>>> Hi,
>>>>>>
>>>>>> >            for non-standard statistics from open-source products the
>>>>>> key=0
>>>>>> > combined with string label is the way to go
>>>>>>
>>>>>> Where do we store the string label?
>>>>>>
>>>>>> I think that we're considering the following schema:
>>>>>>
>>>>>> >> map<
>>>>>> >>   // The column index or null if the statistics refer to whole table 
>>>>>> >> or
>>>>>> batch.
>>>>>> >>   column: int32,
>>>>>> >>   // Statistics key is int32.
>>>>>> >>   // Different keys are assigned for exact value and
>>>>>> >>   // approximate value.
>>>>>> >>   map<int32, dense_union<...needed types based on stat kinds in the
>>>>>> keys...>>
>>>>>> >> >
>>>>>>
>>>>>> Are you considering the following schema for key=0 case?
>>>>>>
>>>>>> map<struct<int32, utf8>,
>>>>>>     dense_union<...needed types based on stat kinds in the keys...>>
>>>>>>
>>>>> 
>>>>> Yes. That's my suggestion. And to leverage the fact that libraries handles
>>>>> unions gracefully, this could be:
>>>>> 
>>>>> map<X_union<int32, utf8>, dense_union<...needed types based on stat kinds
>>>>> in the keys...>>
>>>>> 
>>>>> X is either sparse or dense.
>>>>> 
>>>>> A possible alternative is to use a custom struct instead of map and reduce
>>>>> the levels of nesting:
>>>>> 
>>>>> struct<int32, utf8, dense_union<...needed types base on the keys...>>
>>>>> 
>>>>> --
>>>>> Felipe
>>>>> 
>>>>> 
>>>>>>
>>>>>> Thanks,
>>>>>> --
>>>>>> kou
>>>>>>
>>>>>> In <CAOC8YXYnePq=qfwvzhfqmoxgcubogbhb2gtmabmc7v-x2ap...@mail.gmail.com>
>>>>>>   "Re: [DISCUSS] Statistics through the C data interface" on Mon, 1 Jul
>>>>>> 2024 11:58:44 -0300,
>>>>>>   Felipe Oliveira Carvalho <felipe...@gmail.com> wrote:
>>>>>>
>>>>>> > Hi,
>>>>>> >
>>>>>> > You can promise that well-known int32 statistic keys won't ever be 
>>>>>> > higher
>>>>>> > than a certain value (2^18) [1] like TCP IP ports (well-known ports in
>>>>>> [0,
>>>>>> > 2^10)) but for non-standard statistics from open-source products the
>>>>>> key=0
>>>>>> > combined with string label is the way to go, otherwise collisions would
>>>>>> be
>>>>>> > inevitable and everyone would hate us for having integer keys.
>>>>>> >
>>>>>> > This is not a very weird proposal from my part because integer keys
>>>>>> > representing labels are common in most low-level standardized C APIs
>>>>>> (e.g.
>>>>>> > linux syscalls, ioctls, OpenGL, Vulcan...). I expect higher level APIs 
>>>>>> > to
>>>>>> > map all these keys to strings, but with them we keep the "C Data
>>>>>> Interface"
>>>>>> > low-level and portable as it should be.
>>>>>> >
>>>>>> > --
>>>>>> > Felipe
>>>>>> >
>>>>>> > [1] 2^16 is too small. For instance, OpenGL constants can't be enums
>>>>>> > because C limits enum to 2^16 and that is *not enough*.
>>>>>> >
>>>>>> > On Thu, Jun 20, 2024 at 7:43 AM Sutou Kouhei <k...@clear-code.com> 
>>>>>> > wrote:
>>>>>> >
>>>>>> >> Hi,
>>>>>> >>
>>>>>> >> Here is an updated summary so far:
>>>>>> >>
>>>>>> >> ----
>>>>>> >> Use cases:
>>>>>> >>
>>>>>> >> * Optimize query plan: e.g. JOIN for DuckDB
>>>>>> >>
>>>>>> >> Out of scope:
>>>>>> >>
>>>>>> >> * Transmit statistics through not the C data interface
>>>>>> >>   Examples:
>>>>>> >>   * Transmit statistics through Apache Arrow IPC file
>>>>>> >>   * Transmit statistics through Apache Arrow Flight
>>>>>> >> * Multi-column statistics
>>>>>> >> * Constraints information
>>>>>> >> * Indexes information
>>>>>> >>
>>>>>> >> Discussing approach:
>>>>>> >>
>>>>>> >> Standardize Apache Arrow schema for statistics and transmit
>>>>>> >> statistics via separated API call that uses the C data
>>>>>> >> interface.
>>>>>> >>
>>>>>> >> This also works for per-batch statistics.
>>>>>> >>
>>>>>> >> Candidate schema:
>>>>>> >>
>>>>>> >> map<
>>>>>> >>   // The column index or null if the statistics refer to whole table 
>>>>>> >> or
>>>>>> >> batch.
>>>>>> >>   column: int32,
>>>>>> >>   // Statistics key is int32.
>>>>>> >>   // Different keys are assigned for exact value and
>>>>>> >>   // approximate value.
>>>>>> >>   map<int32, dense_union<...needed types based on stat kinds in the
>>>>>> >> keys...>>
>>>>>> >> >
>>>>>> >>
>>>>>> >> Discussions:
>>>>>> >>
>>>>>> >> 1. Can we use int32 for statistic keys?
>>>>>> >>    Should we use utf8 (or dictionary<int32, utf8>) for
>>>>>> >>    statistic keys?
>>>>>> >> 2. Hot to support non-standard (vendor-specific)
>>>>>> >>    statistic keys?
>>>>>> >> ----
>>>>>> >>
>>>>>> >> Here is my idea:
>>>>>> >>
>>>>>> >> 1. We can use int32 for statistic keys.
>>>>>> >> 2. We can reserve a specific range for non-standard
>>>>>> >>    statistic keys. Prerequisites of this:
>>>>>> >>    * There is no use case to merge some statistics for
>>>>>> >>      the same data.
>>>>>> >>    * We can't merge statistics for different data.
>>>>>> >>
>>>>>> >> If the prerequisites aren't satisfied:
>>>>>> >>
>>>>>> >> 1. We should use utf8 (or dictionary<int32, utf8>) for
>>>>>> >>    statistic keys?
>>>>>> >> 2. We can use reserved prefix such as "ARROW:"/"arrow." for
>>>>>> >>    standard statistic keys or use prefix such as
>>>>>> >>    "vendor1:"/"vendor1." for non-standard statistic keys.
>>>>>> >>
>>>>>> >> Here is Felipe's idea:
>>>>>> >> https://lists.apache.org/thread/gr2nmlrwr7d5wkz3zgq6vy5q0ow8xof2
>>>>>> >>
>>>>>> >> 1. We can use int32 for statistic keys.
>>>>>> >> 2. We can use the special statistic key + a string identifier
>>>>>> >>    for non-standard statistic keys.
>>>>>> >>
>>>>>> >>
>>>>>> >> What do you think about this?
>>>>>> >>
>>>>>> >>
>>>>>> >> Thanks,
>>>>>> >> --
>>>>>> >> kou
>>>>>> >>
>>>>>> >> In <20240606.182727.1004633558059795207....@clear-code.com>
>>>>>> >>   "Re: [DISCUSS] Statistics through the C data interface" on Thu, 06 
>>>>>> >> Jun
>>>>>> >> 2024 18:27:27 +0900 (JST),
>>>>>> >>   Sutou Kouhei <k...@clear-code.com> wrote:
>>>>>> >>
>>>>>> >> > Hi,
>>>>>> >> >
>>>>>> >> > Thanks for sharing your comments. Here is a summary so far:
>>>>>> >> >
>>>>>> >> > ----
>>>>>> >> >
>>>>>> >> > Use cases:
>>>>>> >> >
>>>>>> >> > * Optimize query plan: e.g. JOIN for DuckDB
>>>>>> >> >
>>>>>> >> > Out of scope:
>>>>>> >> >
>>>>>> >> > * Transmit statistics through not the C data interface
>>>>>> >> >   Examples:
>>>>>> >> >   * Transmit statistics through Apache Arrow IPC file
>>>>>> >> >   * Transmit statistics through Apache Arrow Flight
>>>>>> >> >
>>>>>> >> > Candidate approaches:
>>>>>> >> >
>>>>>> >> > 1. Pass statistics (encoded as an Apache Arrow data) via
>>>>>> >> >    ArrowSchema metadata
>>>>>> >> >    * This embeds statistics address into metadata
>>>>>> >> >    * It's for avoiding using Apache Arrow IPC format with
>>>>>> >> >      the C data interface
>>>>>> >> > 2. Embed statistics (encoded as an Apache Arrow data) into
>>>>>> >> >    ArrowSchema metadata
>>>>>> >> >    * This adds statistics to metadata in Apache Arrow IPC
>>>>>> >> >      format
>>>>>> >> > 3. Embed statistics (encoded as JSON) into ArrowArray
>>>>>> >> >    metadata
>>>>>> >> > 4. Standardize Apache Arrow schema for statistics and
>>>>>> >> >    transmit statistics via separated API call that uses the
>>>>>> >> >    C data interface
>>>>>> >> > 5. Use ADBC
>>>>>> >> >
>>>>>> >> > ----
>>>>>> >> >
>>>>>> >> > I think that 4. is the best approach in these candidates.
>>>>>> >> >
>>>>>> >> > 1. Embedding statistics address is tricky.
>>>>>> >> > 2. Consumers need to parse Apache Arrow IPC format data.
>>>>>> >> >    (The C data interface consumers may not have the
>>>>>> >> >    feature.)
>>>>>> >> > 3. This will work but 4. is more generic.
>>>>>> >> > 5. ADBC is too large to use only for statistics.
>>>>>> >> >
>>>>>> >> > What do you think about this?
>>>>>> >> >
>>>>>> >> >
>>>>>> >> > If we select 4., we need to standardize Apache Arrow schema
>>>>>> >> > for statistics. How about the following schema?
>>>>>> >> >
>>>>>> >> > ----
>>>>>> >> > Metadata:
>>>>>> >> >
>>>>>> >> > | Name                       | Value | Comments |
>>>>>> >> > |----------------------------|-------|--------- |
>>>>>> >> > | ARROW::statistics::version | 1.0.0 | (1)      |
>>>>>> >> >
>>>>>> >> > (1) This follows semantic versioning.
>>>>>> >> >
>>>>>> >> > Fields:
>>>>>> >> >
>>>>>> >> > | Name           | Type                  | Comments |
>>>>>> >> > |----------------|-----------------------| -------- |
>>>>>> >> > | column         | utf8                  | (2)      |
>>>>>> >> > | key            | utf8 not null         | (3)      |
>>>>>> >> > | value          | VALUE_SCHEMA not null |          |
>>>>>> >> > | is_approximate | bool not null         | (4)      |
>>>>>> >> >
>>>>>> >> > (2) If null, then the statistic applies to the entire table.
>>>>>> >> >     It's for "row_count".
>>>>>> >> > (3) We'll provide pre-defined keys such as "max", "min",
>>>>>> >> >     "byte_width" and "distinct_count" but users can also use
>>>>>> >> >     application specific keys.
>>>>>> >> > (4) If true, then the value is approximate or best-effort.
>>>>>> >> >
>>>>>> >> > VALUE_SCHEMA is a dense union with members:
>>>>>> >> >
>>>>>> >> > | Name    | Type    |
>>>>>> >> > |---------|---------|
>>>>>> >> > | int64   | int64   |
>>>>>> >> > | uint64  | uint64  |
>>>>>> >> > | float64 | float64 |
>>>>>> >> > | binary  | binary  |
>>>>>> >> >
>>>>>> >> > If a column is an int32 column, it uses int64 for
>>>>>> >> > "max"/"min". We don't provide all types here. Users should
>>>>>> >> > use a compatible type (int64 for a int32 column) instead.
>>>>>> >> > ----
>>>>>> >> >
>>>>>> >> >
>>>>>> >> > Thanks,
>>>>>> >> > --
>>>>>> >> > kou
>>>>>> >> >
>>>>>> >> >
>>>>>> >> > In <20240522.113708.2023905028549001143....@clear-code.com>
>>>>>> >> >   "[DISCUSS] Statistics through the C data interface" on Wed, 22 May
>>>>>> >> 2024 11:37:08 +0900 (JST),
>>>>>> >> >   Sutou Kouhei <k...@clear-code.com> wrote:
>>>>>> >> >
>>>>>> >> >> Hi,
>>>>>> >> >>
>>>>>> >> >> We're discussing how to provide statistics through the C
>>>>>> >> >> data interface at:
>>>>>> >> >> https://github.com/apache/arrow/issues/38837
>>>>>> >> >>
>>>>>> >> >> If you're interested in this feature, could you share your
>>>>>> >> >> comments?
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> Motivation:
>>>>>> >> >>
>>>>>> >> >> We can interchange Apache Arrow data by the C data interface
>>>>>> >> >> in the same process. For example, we can pass Apache Arrow
>>>>>> >> >> data read by Apache Arrow C++ (provider) to DuckDB
>>>>>> >> >> (consumer) through the C data interface.
>>>>>> >> >>
>>>>>> >> >> A provider may know Apache Arrow data statistics. For
>>>>>> >> >> example, a provider can know statistics when it reads Apache
>>>>>> >> >> Parquet data because Apache Parquet may provide statistics.
>>>>>> >> >>
>>>>>> >> >> But a consumer can't know statistics that are known by a
>>>>>> >> >> producer. Because there isn't a standard way to provide
>>>>>> >> >> statistics through the C data interface. If a consumer can
>>>>>> >> >> know statistics, it can process Apache Arrow data faster
>>>>>> >> >> based on statistics.
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> Proposal:
>>>>>> >> >>
>>>>>> >> >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
>>>>>> >> >>
>>>>>> >> >> How about providing statistics as a metadata in ArrowSchema?
>>>>>> >> >>
>>>>>> >> >> We reserve "ARROW" namespace for internal Apache Arrow use:
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >>
>>>>>> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
>>>>>> >> >>
>>>>>> >> >>> The ARROW pattern is a reserved namespace for internal
>>>>>> >> >>> Arrow use in the custom_metadata fields. For example,
>>>>>> >> >>> ARROW:extension:name.
>>>>>> >> >>
>>>>>> >> >> So we can use "ARROW:statistics" for the metadata key.
>>>>>> >> >>
>>>>>> >> >> We can represent statistics as a ArrowArray like ADBC does.
>>>>>> >> >>
>>>>>> >> >> Here is an example ArrowSchema that is for a record batch
>>>>>> >> >> that has "int32 column1" and "string column2":
>>>>>> >> >>
>>>>>> >> >> ArrowSchema {
>>>>>> >> >>   .format = "+siu",
>>>>>> >> >>   .metadata = {
>>>>>> >> >>     "ARROW:statistics" => ArrowArray*, /* table-level statistics 
>>>>>> >> >> such
>>>>>> >> as row count */
>>>>>> >> >>   },
>>>>>> >> >>   .children = {
>>>>>> >> >>     ArrowSchema {
>>>>>> >> >>       .name = "column1",
>>>>>> >> >>       .format = "i",
>>>>>> >> >>       .metadata = {
>>>>>> >> >>         "ARROW:statistics" => ArrowArray*, /* column-level 
>>>>>> >> >> statistics
>>>>>> >> such as count distinct */
>>>>>> >> >>       },
>>>>>> >> >>     },
>>>>>> >> >>     ArrowSchema {
>>>>>> >> >>       .name = "column2",
>>>>>> >> >>       .format = "u",
>>>>>> >> >>       .metadata = {
>>>>>> >> >>         "ARROW:statistics" => ArrowArray*, /* column-level 
>>>>>> >> >> statistics
>>>>>> >> such as count distinct */
>>>>>> >> >>       },
>>>>>> >> >>     },
>>>>>> >> >>   },
>>>>>> >> >> }
>>>>>> >> >>
>>>>>> >> >> The metadata value (ArrowArray* part) of '"ARROW:statistics"
>>>>>> >> >> => ArrowArray*' is a base 10 string of the address of the
>>>>>> >> >> ArrowArray. Because we can use only string for metadata
>>>>>> >> >> value. You can't release the statistics ArrowArray*. (Its
>>>>>> >> >> release is a no-op function.) It follows
>>>>>> >> >>
>>>>>> >>
>>>>>> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
>>>>>> >> >> semantics. (The base ArrowSchema owns statistics
>>>>>> >> >> ArrowArray*.)
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> ArrowArray* for statistics use the following schema:
>>>>>> >> >>
>>>>>> >> >> | Field Name     | Field Type                       | Comments |
>>>>>> >> >> |----------------|----------------------------------| -------- |
>>>>>> >> >> | key            | string not null                  | (1)      |
>>>>>> >> >> | value          | `VALUE_SCHEMA` not null          |          |
>>>>>> >> >> | is_approximate | bool not null                    | (2)      |
>>>>>> >> >>
>>>>>> >> >> 1. We'll provide pre-defined keys such as "max", "min",
>>>>>> >> >>    "byte_width" and "distinct_count" but users can also use
>>>>>> >> >>    application specific keys.
>>>>>> >> >>
>>>>>> >> >> 2. If true, then the value is approximate or best-effort.
>>>>>> >> >>
>>>>>> >> >> VALUE_SCHEMA is a dense union with members:
>>>>>> >> >>
>>>>>> >> >> | Field Name | Field Type                       | Comments |
>>>>>> >> >> |------------|----------------------------------| -------- |
>>>>>> >> >> | int64      | int64                            |          |
>>>>>> >> >> | uint64     | uint64                           |          |
>>>>>> >> >> | float64    | float64                          |          |
>>>>>> >> >> | value      | The same type of the ArrowSchema | (3)      |
>>>>>> >> >> |            | that is belonged to.             |          |
>>>>>> >> >>
>>>>>> >> >> 3. If the ArrowSchema's type is string, this type is also string.
>>>>>> >> >>
>>>>>> >> >>    TODO: Is "value" good name? If we refer it from the
>>>>>> >> >>    top-level statistics schema, we need to use
>>>>>> >> >>    "value.value". It's a bit strange...
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> What do you think about this proposal? Could you share your
>>>>>> >> >> comments?
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> Thanks,
>>>>>> >> >> --
>>>>>> >> >> kou
>>>>>> >>
>>>>>>

Re: [DISCUSS] Statistics through the C data interface

Reply via email to