Hi, It seems that how to build the statistics schema isn't so complex. I'll proceed with this approach.
Thanks, -- kou In <20241016.150837.525206251920899606....@clear-code.com> "Re: [DISCUSS] Statistics through the C data interface" on Wed, 16 Oct 2024 15:08:37 +0900 (JST), Sutou Kouhei <k...@clear-code.com> wrote: > Hi, > > I'm implementing C++ producer for statistics array: > https://github.com/apache/arrow/pull/44252 > > Our discussed statistics schema is compact: > https://github.com/apache/arrow/pull/43553/files#diff-f3758fb6986ea8d24bb2e13c2feb625b68bbd6b93b3fbafd3e2a03dcdc7ba263R77-R145 > > But it may be a bit complex to build. > > What do you think about this? > > > Thanks, > -- > kou > > In <20240805.183331.1066091419162501890....@clear-code.com> > "Re: [DISCUSS] Statistics through the C data interface" on Mon, 05 Aug 2024 > 18:33:31 +0900 (JST), > Sutou Kouhei <k...@clear-code.com> wrote: > >> Hi, >> >> I've opened a PR for documentation: >> https://github.com/apache/arrow/pull/43553 >> >> The "Example" section isn't written yet but suggestions are >> very welcome. >> >> Thanks, >> -- >> kou >> >> In <20240725.143518.421507820763165665....@clear-code.com> >> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 25 Jul >> 2024 14:35:18 +0900 (JST), >> Sutou Kouhei <k...@clear-code.com> wrote: >> >>> Hi, >>> >>> If nobody objects using utf8 or dictionary<int32, utf8> for >>> statistics key, let's use dictionary<int32, utf8>. Because >>> dictionary<int32, utf8> will be more effective than utf8 >>> when there are many columns. >>> >>> I'll start writing a documentation for this and implementing >>> this for C++ next week. I'll share a PR for them when I >>> complete them. We can start a vote for this after we review >>> the PR. >>> >>> >>> Thanks, >>> -- >>> kou >>> >>> In <20240712.151536.312169170508271330....@clear-code.com> >>> "Re: [DISCUSS] Statistics through the C data interface" on Fri, 12 Jul >>> 2024 15:15:36 +0900 (JST), >>> Sutou Kouhei <k...@clear-code.com> wrote: >>> >>>> Hi, >>>> >>>>>> map<struct<int32, utf8>, >>>>>> dense_union<...needed types based on stat kinds in the keys...>> >>>>>> >>>>> >>>>> Yes. That's my suggestion. And to leverage the fact that libraries handles >>>>> unions gracefully, this could be: >>>>> >>>>> map<X_union<int32, utf8>, dense_union<...needed types based on stat kinds >>>>> in the keys...>> >>>>> >>>>> X is either sparse or dense. >>>>> >>>>> A possible alternative is to use a custom struct instead of map and reduce >>>>> the levels of nesting: >>>>> >>>>> struct<int32, utf8, dense_union<...needed types base on the keys...>> >>>> >>>> Thanks for clarifying your suggestion. >>>> >>>> If we need utf8 for non-standard statistics, I think that >>>> map<utf8, ...> or map<dictionary<int32, utf8>, ...> is >>>> better as Antoine said. Because they are simpler than >>>> int32+utf8. >>>> >>>> >>>> Thanks, >>>> -- >>>> kou >>>> >>>> In <caoc8yxy_e9w5sky572phktt-tdguc9v3xpxiom1xohp8rq7...@mail.gmail.com> >>>> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 11 Jul >>>> 2024 14:17:46 -0300, >>>> Felipe Oliveira Carvalho <felipe...@gmail.com> wrote: >>>> >>>>> On Thu, Jul 11, 2024 at 5:04 AM Sutou Kouhei <k...@clear-code.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> > for non-standard statistics from open-source products the >>>>>> key=0 >>>>>> > combined with string label is the way to go >>>>>> >>>>>> Where do we store the string label? >>>>>> >>>>>> I think that we're considering the following schema: >>>>>> >>>>>> >> map< >>>>>> >> // The column index or null if the statistics refer to whole table >>>>>> >> or >>>>>> batch. >>>>>> >> column: int32, >>>>>> >> // Statistics key is int32. >>>>>> >> // Different keys are assigned for exact value and >>>>>> >> // approximate value. >>>>>> >> map<int32, dense_union<...needed types based on stat kinds in the >>>>>> keys...>> >>>>>> >> > >>>>>> >>>>>> Are you considering the following schema for key=0 case? >>>>>> >>>>>> map<struct<int32, utf8>, >>>>>> dense_union<...needed types based on stat kinds in the keys...>> >>>>>> >>>>> >>>>> Yes. That's my suggestion. And to leverage the fact that libraries handles >>>>> unions gracefully, this could be: >>>>> >>>>> map<X_union<int32, utf8>, dense_union<...needed types based on stat kinds >>>>> in the keys...>> >>>>> >>>>> X is either sparse or dense. >>>>> >>>>> A possible alternative is to use a custom struct instead of map and reduce >>>>> the levels of nesting: >>>>> >>>>> struct<int32, utf8, dense_union<...needed types base on the keys...>> >>>>> >>>>> -- >>>>> Felipe >>>>> >>>>> >>>>>> >>>>>> Thanks, >>>>>> -- >>>>>> kou >>>>>> >>>>>> In <CAOC8YXYnePq=qfwvzhfqmoxgcubogbhb2gtmabmc7v-x2ap...@mail.gmail.com> >>>>>> "Re: [DISCUSS] Statistics through the C data interface" on Mon, 1 Jul >>>>>> 2024 11:58:44 -0300, >>>>>> Felipe Oliveira Carvalho <felipe...@gmail.com> wrote: >>>>>> >>>>>> > Hi, >>>>>> > >>>>>> > You can promise that well-known int32 statistic keys won't ever be >>>>>> > higher >>>>>> > than a certain value (2^18) [1] like TCP IP ports (well-known ports in >>>>>> [0, >>>>>> > 2^10)) but for non-standard statistics from open-source products the >>>>>> key=0 >>>>>> > combined with string label is the way to go, otherwise collisions would >>>>>> be >>>>>> > inevitable and everyone would hate us for having integer keys. >>>>>> > >>>>>> > This is not a very weird proposal from my part because integer keys >>>>>> > representing labels are common in most low-level standardized C APIs >>>>>> (e.g. >>>>>> > linux syscalls, ioctls, OpenGL, Vulcan...). I expect higher level APIs >>>>>> > to >>>>>> > map all these keys to strings, but with them we keep the "C Data >>>>>> Interface" >>>>>> > low-level and portable as it should be. >>>>>> > >>>>>> > -- >>>>>> > Felipe >>>>>> > >>>>>> > [1] 2^16 is too small. For instance, OpenGL constants can't be enums >>>>>> > because C limits enum to 2^16 and that is *not enough*. >>>>>> > >>>>>> > On Thu, Jun 20, 2024 at 7:43 AM Sutou Kouhei <k...@clear-code.com> >>>>>> > wrote: >>>>>> > >>>>>> >> Hi, >>>>>> >> >>>>>> >> Here is an updated summary so far: >>>>>> >> >>>>>> >> ---- >>>>>> >> Use cases: >>>>>> >> >>>>>> >> * Optimize query plan: e.g. JOIN for DuckDB >>>>>> >> >>>>>> >> Out of scope: >>>>>> >> >>>>>> >> * Transmit statistics through not the C data interface >>>>>> >> Examples: >>>>>> >> * Transmit statistics through Apache Arrow IPC file >>>>>> >> * Transmit statistics through Apache Arrow Flight >>>>>> >> * Multi-column statistics >>>>>> >> * Constraints information >>>>>> >> * Indexes information >>>>>> >> >>>>>> >> Discussing approach: >>>>>> >> >>>>>> >> Standardize Apache Arrow schema for statistics and transmit >>>>>> >> statistics via separated API call that uses the C data >>>>>> >> interface. >>>>>> >> >>>>>> >> This also works for per-batch statistics. >>>>>> >> >>>>>> >> Candidate schema: >>>>>> >> >>>>>> >> map< >>>>>> >> // The column index or null if the statistics refer to whole table >>>>>> >> or >>>>>> >> batch. >>>>>> >> column: int32, >>>>>> >> // Statistics key is int32. >>>>>> >> // Different keys are assigned for exact value and >>>>>> >> // approximate value. >>>>>> >> map<int32, dense_union<...needed types based on stat kinds in the >>>>>> >> keys...>> >>>>>> >> > >>>>>> >> >>>>>> >> Discussions: >>>>>> >> >>>>>> >> 1. Can we use int32 for statistic keys? >>>>>> >> Should we use utf8 (or dictionary<int32, utf8>) for >>>>>> >> statistic keys? >>>>>> >> 2. Hot to support non-standard (vendor-specific) >>>>>> >> statistic keys? >>>>>> >> ---- >>>>>> >> >>>>>> >> Here is my idea: >>>>>> >> >>>>>> >> 1. We can use int32 for statistic keys. >>>>>> >> 2. We can reserve a specific range for non-standard >>>>>> >> statistic keys. Prerequisites of this: >>>>>> >> * There is no use case to merge some statistics for >>>>>> >> the same data. >>>>>> >> * We can't merge statistics for different data. >>>>>> >> >>>>>> >> If the prerequisites aren't satisfied: >>>>>> >> >>>>>> >> 1. We should use utf8 (or dictionary<int32, utf8>) for >>>>>> >> statistic keys? >>>>>> >> 2. We can use reserved prefix such as "ARROW:"/"arrow." for >>>>>> >> standard statistic keys or use prefix such as >>>>>> >> "vendor1:"/"vendor1." for non-standard statistic keys. >>>>>> >> >>>>>> >> Here is Felipe's idea: >>>>>> >> https://lists.apache.org/thread/gr2nmlrwr7d5wkz3zgq6vy5q0ow8xof2 >>>>>> >> >>>>>> >> 1. We can use int32 for statistic keys. >>>>>> >> 2. We can use the special statistic key + a string identifier >>>>>> >> for non-standard statistic keys. >>>>>> >> >>>>>> >> >>>>>> >> What do you think about this? >>>>>> >> >>>>>> >> >>>>>> >> Thanks, >>>>>> >> -- >>>>>> >> kou >>>>>> >> >>>>>> >> In <20240606.182727.1004633558059795207....@clear-code.com> >>>>>> >> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 06 >>>>>> >> Jun >>>>>> >> 2024 18:27:27 +0900 (JST), >>>>>> >> Sutou Kouhei <k...@clear-code.com> wrote: >>>>>> >> >>>>>> >> > Hi, >>>>>> >> > >>>>>> >> > Thanks for sharing your comments. Here is a summary so far: >>>>>> >> > >>>>>> >> > ---- >>>>>> >> > >>>>>> >> > Use cases: >>>>>> >> > >>>>>> >> > * Optimize query plan: e.g. JOIN for DuckDB >>>>>> >> > >>>>>> >> > Out of scope: >>>>>> >> > >>>>>> >> > * Transmit statistics through not the C data interface >>>>>> >> > Examples: >>>>>> >> > * Transmit statistics through Apache Arrow IPC file >>>>>> >> > * Transmit statistics through Apache Arrow Flight >>>>>> >> > >>>>>> >> > Candidate approaches: >>>>>> >> > >>>>>> >> > 1. Pass statistics (encoded as an Apache Arrow data) via >>>>>> >> > ArrowSchema metadata >>>>>> >> > * This embeds statistics address into metadata >>>>>> >> > * It's for avoiding using Apache Arrow IPC format with >>>>>> >> > the C data interface >>>>>> >> > 2. Embed statistics (encoded as an Apache Arrow data) into >>>>>> >> > ArrowSchema metadata >>>>>> >> > * This adds statistics to metadata in Apache Arrow IPC >>>>>> >> > format >>>>>> >> > 3. Embed statistics (encoded as JSON) into ArrowArray >>>>>> >> > metadata >>>>>> >> > 4. Standardize Apache Arrow schema for statistics and >>>>>> >> > transmit statistics via separated API call that uses the >>>>>> >> > C data interface >>>>>> >> > 5. Use ADBC >>>>>> >> > >>>>>> >> > ---- >>>>>> >> > >>>>>> >> > I think that 4. is the best approach in these candidates. >>>>>> >> > >>>>>> >> > 1. Embedding statistics address is tricky. >>>>>> >> > 2. Consumers need to parse Apache Arrow IPC format data. >>>>>> >> > (The C data interface consumers may not have the >>>>>> >> > feature.) >>>>>> >> > 3. This will work but 4. is more generic. >>>>>> >> > 5. ADBC is too large to use only for statistics. >>>>>> >> > >>>>>> >> > What do you think about this? >>>>>> >> > >>>>>> >> > >>>>>> >> > If we select 4., we need to standardize Apache Arrow schema >>>>>> >> > for statistics. How about the following schema? >>>>>> >> > >>>>>> >> > ---- >>>>>> >> > Metadata: >>>>>> >> > >>>>>> >> > | Name | Value | Comments | >>>>>> >> > |----------------------------|-------|--------- | >>>>>> >> > | ARROW::statistics::version | 1.0.0 | (1) | >>>>>> >> > >>>>>> >> > (1) This follows semantic versioning. >>>>>> >> > >>>>>> >> > Fields: >>>>>> >> > >>>>>> >> > | Name | Type | Comments | >>>>>> >> > |----------------|-----------------------| -------- | >>>>>> >> > | column | utf8 | (2) | >>>>>> >> > | key | utf8 not null | (3) | >>>>>> >> > | value | VALUE_SCHEMA not null | | >>>>>> >> > | is_approximate | bool not null | (4) | >>>>>> >> > >>>>>> >> > (2) If null, then the statistic applies to the entire table. >>>>>> >> > It's for "row_count". >>>>>> >> > (3) We'll provide pre-defined keys such as "max", "min", >>>>>> >> > "byte_width" and "distinct_count" but users can also use >>>>>> >> > application specific keys. >>>>>> >> > (4) If true, then the value is approximate or best-effort. >>>>>> >> > >>>>>> >> > VALUE_SCHEMA is a dense union with members: >>>>>> >> > >>>>>> >> > | Name | Type | >>>>>> >> > |---------|---------| >>>>>> >> > | int64 | int64 | >>>>>> >> > | uint64 | uint64 | >>>>>> >> > | float64 | float64 | >>>>>> >> > | binary | binary | >>>>>> >> > >>>>>> >> > If a column is an int32 column, it uses int64 for >>>>>> >> > "max"/"min". We don't provide all types here. Users should >>>>>> >> > use a compatible type (int64 for a int32 column) instead. >>>>>> >> > ---- >>>>>> >> > >>>>>> >> > >>>>>> >> > Thanks, >>>>>> >> > -- >>>>>> >> > kou >>>>>> >> > >>>>>> >> > >>>>>> >> > In <20240522.113708.2023905028549001143....@clear-code.com> >>>>>> >> > "[DISCUSS] Statistics through the C data interface" on Wed, 22 May >>>>>> >> 2024 11:37:08 +0900 (JST), >>>>>> >> > Sutou Kouhei <k...@clear-code.com> wrote: >>>>>> >> > >>>>>> >> >> Hi, >>>>>> >> >> >>>>>> >> >> We're discussing how to provide statistics through the C >>>>>> >> >> data interface at: >>>>>> >> >> https://github.com/apache/arrow/issues/38837 >>>>>> >> >> >>>>>> >> >> If you're interested in this feature, could you share your >>>>>> >> >> comments? >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> Motivation: >>>>>> >> >> >>>>>> >> >> We can interchange Apache Arrow data by the C data interface >>>>>> >> >> in the same process. For example, we can pass Apache Arrow >>>>>> >> >> data read by Apache Arrow C++ (provider) to DuckDB >>>>>> >> >> (consumer) through the C data interface. >>>>>> >> >> >>>>>> >> >> A provider may know Apache Arrow data statistics. For >>>>>> >> >> example, a provider can know statistics when it reads Apache >>>>>> >> >> Parquet data because Apache Parquet may provide statistics. >>>>>> >> >> >>>>>> >> >> But a consumer can't know statistics that are known by a >>>>>> >> >> producer. Because there isn't a standard way to provide >>>>>> >> >> statistics through the C data interface. If a consumer can >>>>>> >> >> know statistics, it can process Apache Arrow data faster >>>>>> >> >> based on statistics. >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> Proposal: >>>>>> >> >> >>>>>> >> >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 >>>>>> >> >> >>>>>> >> >> How about providing statistics as a metadata in ArrowSchema? >>>>>> >> >> >>>>>> >> >> We reserve "ARROW" namespace for internal Apache Arrow use: >>>>>> >> >> >>>>>> >> >> >>>>>> >> >>>>>> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata >>>>>> >> >> >>>>>> >> >>> The ARROW pattern is a reserved namespace for internal >>>>>> >> >>> Arrow use in the custom_metadata fields. For example, >>>>>> >> >>> ARROW:extension:name. >>>>>> >> >> >>>>>> >> >> So we can use "ARROW:statistics" for the metadata key. >>>>>> >> >> >>>>>> >> >> We can represent statistics as a ArrowArray like ADBC does. >>>>>> >> >> >>>>>> >> >> Here is an example ArrowSchema that is for a record batch >>>>>> >> >> that has "int32 column1" and "string column2": >>>>>> >> >> >>>>>> >> >> ArrowSchema { >>>>>> >> >> .format = "+siu", >>>>>> >> >> .metadata = { >>>>>> >> >> "ARROW:statistics" => ArrowArray*, /* table-level statistics >>>>>> >> >> such >>>>>> >> as row count */ >>>>>> >> >> }, >>>>>> >> >> .children = { >>>>>> >> >> ArrowSchema { >>>>>> >> >> .name = "column1", >>>>>> >> >> .format = "i", >>>>>> >> >> .metadata = { >>>>>> >> >> "ARROW:statistics" => ArrowArray*, /* column-level >>>>>> >> >> statistics >>>>>> >> such as count distinct */ >>>>>> >> >> }, >>>>>> >> >> }, >>>>>> >> >> ArrowSchema { >>>>>> >> >> .name = "column2", >>>>>> >> >> .format = "u", >>>>>> >> >> .metadata = { >>>>>> >> >> "ARROW:statistics" => ArrowArray*, /* column-level >>>>>> >> >> statistics >>>>>> >> such as count distinct */ >>>>>> >> >> }, >>>>>> >> >> }, >>>>>> >> >> }, >>>>>> >> >> } >>>>>> >> >> >>>>>> >> >> The metadata value (ArrowArray* part) of '"ARROW:statistics" >>>>>> >> >> => ArrowArray*' is a base 10 string of the address of the >>>>>> >> >> ArrowArray. Because we can use only string for metadata >>>>>> >> >> value. You can't release the statistics ArrowArray*. (Its >>>>>> >> >> release is a no-op function.) It follows >>>>>> >> >> >>>>>> >> >>>>>> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation >>>>>> >> >> semantics. (The base ArrowSchema owns statistics >>>>>> >> >> ArrowArray*.) >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> ArrowArray* for statistics use the following schema: >>>>>> >> >> >>>>>> >> >> | Field Name | Field Type | Comments | >>>>>> >> >> |----------------|----------------------------------| -------- | >>>>>> >> >> | key | string not null | (1) | >>>>>> >> >> | value | `VALUE_SCHEMA` not null | | >>>>>> >> >> | is_approximate | bool not null | (2) | >>>>>> >> >> >>>>>> >> >> 1. We'll provide pre-defined keys such as "max", "min", >>>>>> >> >> "byte_width" and "distinct_count" but users can also use >>>>>> >> >> application specific keys. >>>>>> >> >> >>>>>> >> >> 2. If true, then the value is approximate or best-effort. >>>>>> >> >> >>>>>> >> >> VALUE_SCHEMA is a dense union with members: >>>>>> >> >> >>>>>> >> >> | Field Name | Field Type | Comments | >>>>>> >> >> |------------|----------------------------------| -------- | >>>>>> >> >> | int64 | int64 | | >>>>>> >> >> | uint64 | uint64 | | >>>>>> >> >> | float64 | float64 | | >>>>>> >> >> | value | The same type of the ArrowSchema | (3) | >>>>>> >> >> | | that is belonged to. | | >>>>>> >> >> >>>>>> >> >> 3. If the ArrowSchema's type is string, this type is also string. >>>>>> >> >> >>>>>> >> >> TODO: Is "value" good name? If we refer it from the >>>>>> >> >> top-level statistics schema, we need to use >>>>>> >> >> "value.value". It's a bit strange... >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> What do you think about this proposal? Could you share your >>>>>> >> >> comments? >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> Thanks, >>>>>> >> >> -- >>>>>> >> >> kou >>>>>> >> >>>>>>