Hi, You can promise that well-known int32 statistic keys won't ever be higher than a certain value (2^18) [1] like TCP IP ports (well-known ports in [0, 2^10)) but for non-standard statistics from open-source products the key=0 combined with string label is the way to go, otherwise collisions would be inevitable and everyone would hate us for having integer keys.
This is not a very weird proposal from my part because integer keys representing labels are common in most low-level standardized C APIs (e.g. linux syscalls, ioctls, OpenGL, Vulcan...). I expect higher level APIs to map all these keys to strings, but with them we keep the "C Data Interface" low-level and portable as it should be. -- Felipe [1] 2^16 is too small. For instance, OpenGL constants can't be enums because C limits enum to 2^16 and that is *not enough*. On Thu, Jun 20, 2024 at 7:43 AM Sutou Kouhei <[email protected]> wrote: > Hi, > > Here is an updated summary so far: > > ---- > Use cases: > > * Optimize query plan: e.g. JOIN for DuckDB > > Out of scope: > > * Transmit statistics through not the C data interface > Examples: > * Transmit statistics through Apache Arrow IPC file > * Transmit statistics through Apache Arrow Flight > * Multi-column statistics > * Constraints information > * Indexes information > > Discussing approach: > > Standardize Apache Arrow schema for statistics and transmit > statistics via separated API call that uses the C data > interface. > > This also works for per-batch statistics. > > Candidate schema: > > map< > // The column index or null if the statistics refer to whole table or > batch. > column: int32, > // Statistics key is int32. > // Different keys are assigned for exact value and > // approximate value. > map<int32, dense_union<...needed types based on stat kinds in the > keys...>> > > > > Discussions: > > 1. Can we use int32 for statistic keys? > Should we use utf8 (or dictionary<int32, utf8>) for > statistic keys? > 2. Hot to support non-standard (vendor-specific) > statistic keys? > ---- > > Here is my idea: > > 1. We can use int32 for statistic keys. > 2. We can reserve a specific range for non-standard > statistic keys. Prerequisites of this: > * There is no use case to merge some statistics for > the same data. > * We can't merge statistics for different data. > > If the prerequisites aren't satisfied: > > 1. We should use utf8 (or dictionary<int32, utf8>) for > statistic keys? > 2. We can use reserved prefix such as "ARROW:"/"arrow." for > standard statistic keys or use prefix such as > "vendor1:"/"vendor1." for non-standard statistic keys. > > Here is Felipe's idea: > https://lists.apache.org/thread/gr2nmlrwr7d5wkz3zgq6vy5q0ow8xof2 > > 1. We can use int32 for statistic keys. > 2. We can use the special statistic key + a string identifier > for non-standard statistic keys. > > > What do you think about this? > > > Thanks, > -- > kou > > In <[email protected]> > "Re: [DISCUSS] Statistics through the C data interface" on Thu, 06 Jun > 2024 18:27:27 +0900 (JST), > Sutou Kouhei <[email protected]> wrote: > > > Hi, > > > > Thanks for sharing your comments. Here is a summary so far: > > > > ---- > > > > Use cases: > > > > * Optimize query plan: e.g. JOIN for DuckDB > > > > Out of scope: > > > > * Transmit statistics through not the C data interface > > Examples: > > * Transmit statistics through Apache Arrow IPC file > > * Transmit statistics through Apache Arrow Flight > > > > Candidate approaches: > > > > 1. Pass statistics (encoded as an Apache Arrow data) via > > ArrowSchema metadata > > * This embeds statistics address into metadata > > * It's for avoiding using Apache Arrow IPC format with > > the C data interface > > 2. Embed statistics (encoded as an Apache Arrow data) into > > ArrowSchema metadata > > * This adds statistics to metadata in Apache Arrow IPC > > format > > 3. Embed statistics (encoded as JSON) into ArrowArray > > metadata > > 4. Standardize Apache Arrow schema for statistics and > > transmit statistics via separated API call that uses the > > C data interface > > 5. Use ADBC > > > > ---- > > > > I think that 4. is the best approach in these candidates. > > > > 1. Embedding statistics address is tricky. > > 2. Consumers need to parse Apache Arrow IPC format data. > > (The C data interface consumers may not have the > > feature.) > > 3. This will work but 4. is more generic. > > 5. ADBC is too large to use only for statistics. > > > > What do you think about this? > > > > > > If we select 4., we need to standardize Apache Arrow schema > > for statistics. How about the following schema? > > > > ---- > > Metadata: > > > > | Name | Value | Comments | > > |----------------------------|-------|--------- | > > | ARROW::statistics::version | 1.0.0 | (1) | > > > > (1) This follows semantic versioning. > > > > Fields: > > > > | Name | Type | Comments | > > |----------------|-----------------------| -------- | > > | column | utf8 | (2) | > > | key | utf8 not null | (3) | > > | value | VALUE_SCHEMA not null | | > > | is_approximate | bool not null | (4) | > > > > (2) If null, then the statistic applies to the entire table. > > It's for "row_count". > > (3) We'll provide pre-defined keys such as "max", "min", > > "byte_width" and "distinct_count" but users can also use > > application specific keys. > > (4) If true, then the value is approximate or best-effort. > > > > VALUE_SCHEMA is a dense union with members: > > > > | Name | Type | > > |---------|---------| > > | int64 | int64 | > > | uint64 | uint64 | > > | float64 | float64 | > > | binary | binary | > > > > If a column is an int32 column, it uses int64 for > > "max"/"min". We don't provide all types here. Users should > > use a compatible type (int64 for a int32 column) instead. > > ---- > > > > > > Thanks, > > -- > > kou > > > > > > In <[email protected]> > > "[DISCUSS] Statistics through the C data interface" on Wed, 22 May > 2024 11:37:08 +0900 (JST), > > Sutou Kouhei <[email protected]> wrote: > > > >> Hi, > >> > >> We're discussing how to provide statistics through the C > >> data interface at: > >> https://github.com/apache/arrow/issues/38837 > >> > >> If you're interested in this feature, could you share your > >> comments? > >> > >> > >> Motivation: > >> > >> We can interchange Apache Arrow data by the C data interface > >> in the same process. For example, we can pass Apache Arrow > >> data read by Apache Arrow C++ (provider) to DuckDB > >> (consumer) through the C data interface. > >> > >> A provider may know Apache Arrow data statistics. For > >> example, a provider can know statistics when it reads Apache > >> Parquet data because Apache Parquet may provide statistics. > >> > >> But a consumer can't know statistics that are known by a > >> producer. Because there isn't a standard way to provide > >> statistics through the C data interface. If a consumer can > >> know statistics, it can process Apache Arrow data faster > >> based on statistics. > >> > >> > >> Proposal: > >> > >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 > >> > >> How about providing statistics as a metadata in ArrowSchema? > >> > >> We reserve "ARROW" namespace for internal Apache Arrow use: > >> > >> > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata > >> > >>> The ARROW pattern is a reserved namespace for internal > >>> Arrow use in the custom_metadata fields. For example, > >>> ARROW:extension:name. > >> > >> So we can use "ARROW:statistics" for the metadata key. > >> > >> We can represent statistics as a ArrowArray like ADBC does. > >> > >> Here is an example ArrowSchema that is for a record batch > >> that has "int32 column1" and "string column2": > >> > >> ArrowSchema { > >> .format = "+siu", > >> .metadata = { > >> "ARROW:statistics" => ArrowArray*, /* table-level statistics such > as row count */ > >> }, > >> .children = { > >> ArrowSchema { > >> .name = "column1", > >> .format = "i", > >> .metadata = { > >> "ARROW:statistics" => ArrowArray*, /* column-level statistics > such as count distinct */ > >> }, > >> }, > >> ArrowSchema { > >> .name = "column2", > >> .format = "u", > >> .metadata = { > >> "ARROW:statistics" => ArrowArray*, /* column-level statistics > such as count distinct */ > >> }, > >> }, > >> }, > >> } > >> > >> The metadata value (ArrowArray* part) of '"ARROW:statistics" > >> => ArrowArray*' is a base 10 string of the address of the > >> ArrowArray. Because we can use only string for metadata > >> value. You can't release the statistics ArrowArray*. (Its > >> release is a no-op function.) It follows > >> > https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation > >> semantics. (The base ArrowSchema owns statistics > >> ArrowArray*.) > >> > >> > >> ArrowArray* for statistics use the following schema: > >> > >> | Field Name | Field Type | Comments | > >> |----------------|----------------------------------| -------- | > >> | key | string not null | (1) | > >> | value | `VALUE_SCHEMA` not null | | > >> | is_approximate | bool not null | (2) | > >> > >> 1. We'll provide pre-defined keys such as "max", "min", > >> "byte_width" and "distinct_count" but users can also use > >> application specific keys. > >> > >> 2. If true, then the value is approximate or best-effort. > >> > >> VALUE_SCHEMA is a dense union with members: > >> > >> | Field Name | Field Type | Comments | > >> |------------|----------------------------------| -------- | > >> | int64 | int64 | | > >> | uint64 | uint64 | | > >> | float64 | float64 | | > >> | value | The same type of the ArrowSchema | (3) | > >> | | that is belonged to. | | > >> > >> 3. If the ArrowSchema's type is string, this type is also string. > >> > >> TODO: Is "value" good name? If we refer it from the > >> top-level statistics schema, we need to use > >> "value.value". It's a bit strange... > >> > >> > >> What do you think about this proposal? Could you share your > >> comments? > >> > >> > >> Thanks, > >> -- > >> kou >
