I want to +1 on what Dewey is saying here and some comments. Sutou Kouhei wrote: > ADBC may be a bit larger to use only for transmitting statistics. ADBC has > statistics related APIs but it has more other APIs.
It's impossible to keep the responsibility of communication protocols cleanly separated, but IMO, we should strive to keep the C Data Interface more of a Transport Protocol than an Application Protocol. Statistics are application dependent and can complicate the implementation of importers/exporters which would hinder the adoption of the C Data Interface. Statistics also bring in security concerns that are application-specific. e.g. can an algorithm trust min/max stats and risk producing incorrect results if the statistics are incorrect? A question that can't really be answered at the C Data Interface level. The need for more sophisticated statistics only grows with time, so there is no such thing as a "simple statistics schema". Protocols that produce/consume statistics might want to use the C Data Interface as a primitive for passing Arrow arrays of statistics. ADBC might be too big of a leap in complexity now, but "we just need C Data Interface + statistics" is unlikely to remain true for very long as projects grow in complexity. -- Felipe On Thu, May 23, 2024 at 9:57 AM Dewey Dunnington <de...@voltrondata.com.invalid> wrote: > > Thank you for the background! I understand that these statistics are > important for query planning; however, I am not sure that I follow why > we are constrained to the ArrowSchema to represent them. The examples > given seem to going through Python...would it be easier to request > statistics at a higher level of abstraction? There would already need > to be a separate mechanism to request an ArrowArrayStream with > statistics (unless the PyCapsule `requested_schema` argument would > suffice). > > > ADBC may be a bit larger to use only for transmitting > > statistics. ADBC has statistics related APIs but it has more > > other APIs. > > Some examples of producers given in the linked threads (Delta Lake, > Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One > can implement an ADBC driver without defining all the methods (where > the producer could call AdbcConnectionGetStatistics(), although > AdbcStatementGetStatistics() might be more relevant here and doesn't > exist). One example listed (using an Arrow Table as a source) seems a > bit light to wrap in an ADBC driver; however, it would not take much > code to do so and the overhead of getting the reader via ADBC it is > something like 100 microseconds (tested via the ADBC R package's > "monkey driver" which wraps an existing stream as a statement). In any > case, the bulk of the code is building the statistics array. > > > How about the following schema for the > > statistics ArrowArray? It's based on ADBC. > > Whatever format for statistics is decided on, I imagine it should be > exactly the same as the ADBC standard? (Perhaps pushing changes > upstream if needed?). > > On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei <k...@clear-code.com> wrote: > > > > Hi, > > > > > Why not simply pass the statistics ArrowArray separately in your > > > producer API of choice > > > > It seems that we should use the approach because all > > feedback said so. How about the following schema for the > > statistics ArrowArray? It's based on ADBC. > > > > | Field Name | Field Type | Comments | > > |--------------------------|-----------------------| -------- | > > | column_name | utf8 | (1) | > > | statistic_key | utf8 not null | (2) | > > | statistic_value | VALUE_SCHEMA not null | | > > | statistic_is_approximate | bool not null | (3) | > > > > 1. If null, then the statistic applies to the entire table. > > It's for "row_count". > > 2. We'll provide pre-defined keys such as "max", "min", > > "byte_width" and "distinct_count" but users can also use > > application specific keys. > > 3. If true, then the value is approximate or best-effort. > > > > VALUE_SCHEMA is a dense union with members: > > > > | Field Name | Field Type | > > |------------|------------| > > | int64 | int64 | > > | uint64 | uint64 | > > | float64 | float64 | > > | binary | binary | > > > > If a column is an int32 column, it uses int64 for > > "max"/"min". We don't provide all types here. Users should > > use a compatible type (int64 for a int32 column) instead. > > > > > > Thanks, > > -- > > kou > > > > In <a3ce5e96-176c-4226-9d74-6a458317a...@python.org> > > "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May > > 2024 17:04:57 +0200, > > Antoine Pitrou <anto...@python.org> wrote: > > > > > > > > Hi Kou, > > > > > > I agree that Dewey that this is overstretching the capabilities of the > > > C Data Interface. In particular, stuffing a pointer as metadata value > > > and decreeing it immortal doesn't sound like a good design decision. > > > > > > Why not simply pass the statistics ArrowArray separately in your > > > producer API of choice (Dewey mentioned ADBC but it is of course just > > > a possible API among others)? > > > > > > Regards > > > > > > Antoine. > > > > > > > > > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit : > > >> Hi, > > >> We're discussing how to provide statistics through the C > > >> data interface at: > > >> https://github.com/apache/arrow/issues/38837 > > >> If you're interested in this feature, could you share your > > >> comments? > > >> Motivation: > > >> We can interchange Apache Arrow data by the C data interface > > >> in the same process. For example, we can pass Apache Arrow > > >> data read by Apache Arrow C++ (provider) to DuckDB > > >> (consumer) through the C data interface. > > >> A provider may know Apache Arrow data statistics. For > > >> example, a provider can know statistics when it reads Apache > > >> Parquet data because Apache Parquet may provide statistics. > > >> But a consumer can't know statistics that are known by a > > >> producer. Because there isn't a standard way to provide > > >> statistics through the C data interface. If a consumer can > > >> know statistics, it can process Apache Arrow data faster > > >> based on statistics. > > >> Proposal: > > >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 > > >> How about providing statistics as a metadata in ArrowSchema? > > >> We reserve "ARROW" namespace for internal Apache Arrow use: > > >> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata > > >> > > >>> The ARROW pattern is a reserved namespace for internal > > >>> Arrow use in the custom_metadata fields. For example, > > >>> ARROW:extension:name. > > >> So we can use "ARROW:statistics" for the metadata key. > > >> We can represent statistics as a ArrowArray like ADBC does. > > >> Here is an example ArrowSchema that is for a record batch > > >> that has "int32 column1" and "string column2": > > >> ArrowSchema { > > >> .format = "+siu", > > >> .metadata = { > > >> "ARROW:statistics" => ArrowArray*, /* table-level statistics such as > > >> row count */ > > >> }, > > >> .children = { > > >> ArrowSchema { > > >> .name = "column1", > > >> .format = "i", > > >> .metadata = { > > >> "ARROW:statistics" => ArrowArray*, /* column-level statistics > > >> such as > > >> count distinct */ > > >> }, > > >> }, > > >> ArrowSchema { > > >> .name = "column2", > > >> .format = "u", > > >> .metadata = { > > >> "ARROW:statistics" => ArrowArray*, /* column-level statistics > > >> such as > > >> count distinct */ > > >> }, > > >> }, > > >> }, > > >> } > > >> The metadata value (ArrowArray* part) of '"ARROW:statistics" > > >> => ArrowArray*' is a base 10 string of the address of the > > >> ArrowArray. Because we can use only string for metadata > > >> value. You can't release the statistics ArrowArray*. (Its > > >> release is a no-op function.) It follows > > >> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation > > >> semantics. (The base ArrowSchema owns statistics > > >> ArrowArray*.) > > >> ArrowArray* for statistics use the following schema: > > >> | Field Name | Field Type | Comments | > > >> |----------------|----------------------------------| -------- | > > >> | key | string not null | (1) | > > >> | value | `VALUE_SCHEMA` not null | | > > >> | is_approximate | bool not null | (2) | > > >> 1. We'll provide pre-defined keys such as "max", "min", > > >> "byte_width" and "distinct_count" but users can also use > > >> application specific keys. > > >> 2. If true, then the value is approximate or best-effort. > > >> VALUE_SCHEMA is a dense union with members: > > >> | Field Name | Field Type | Comments | > > >> |------------|----------------------------------| -------- | > > >> | int64 | int64 | | > > >> | uint64 | uint64 | | > > >> | float64 | float64 | | > > >> | value | The same type of the ArrowSchema | (3) | > > >> | | that is belonged to. | | > > >> 3. If the ArrowSchema's type is string, this type is also string. > > >> TODO: Is "value" good name? If we refer it from the > > >> top-level statistics schema, we need to use > > >> "value.value". It's a bit strange... > > >> What do you think about this proposal? Could you share your > > >> comments? > > >> Thanks,