Hi, Thanks for sharing your opinion. There is no objection to remove the C data interface limitation.
I opened a new PR that focus on only statistics schema: * PR: https://github.com/apache/arrow/pull/45058 * Preview: http://crossbow.voltrondata.com/pr_docs/45058/format/StatisticsSchema.html It still includes the original motivation (passing statistics through the C data/stream interface for query plan) but it doesn't limit to the C data interface. It also includes "this is the canonical schema to represent statistics about an Arrow dataset as Arrow data" provided by David. There are some TODOs in the specification. If you have any opinions for them, please share them. 1. Should we standardize field names of dense union that is used for statistics values? See also: * The related part in the PR: https://github.com/apache/arrow/pull/45058/files#diff-516092add42944da6a0cd1a601014c5b89a41a61098adb57536e79c105211f4fR136-R150 * The discussion: https://github.com/apache/arrow/pull/43553/files#r1871755575 I think that we don't need to standardize them because we can access dense union values via type codes not names. If there aren't any opinions, I will not standardize these field names. 2. Should we use int64 not float64 for approximate max_byte_width? See also: * The related part in the PR: https://github.com/apache/arrow/pull/45058/files#diff-516092add42944da6a0cd1a601014c5b89a41a61098adb57536e79c105211f4fR199-R202 * The discussion: https://github.com/apache/arrow/pull/43553/files#r1871759234 We can use ceil(float_max_byte_width) for int64 approximate max_byte_width. If there aren't any opinions... I don't have strong opinion for this... Hmm. I will use float64. There are examples for record batches but not for arrays for now. I'll add examples for arrays later... Thanks, -- kou In <20241212.102712.700919417179383555....@clear-code.com> "[DISCUSS] Arrow array representation of statistics" on Thu, 12 Dec 2024 10:27:12 +0900 (JST), Sutou Kouhei <k...@clear-code.com> wrote: > Hi, > > I want to discuss Arrow array representation of statistics > and usable contexts of it. > > Background: > > We discussed how to pass statistics through the C data > interface: > > * [DISCUSS] Statistics through the C data interface > https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx > * [VOTE] Statistics through the C data interface > https://lists.apache.org/thread/rsw3wsyj68dksc98s5rpdp6dn8hfk0yd > * GH-38837: [Format] Add the specification to pass > statistics through the Arrow C data interface > https://github.com/apache/arrow/pull/43553 > > The latest proposal is that we standardize schema for Arrow > array that represents statistics. See the above PR for > details. > > I think that the proposed approach is the best approach for > the C data interface. But I'm not sure whether the approach > is the best approach for other contexts such as IPC format, > Flight, ADBC and so on. So the latest proposal limits its > target to only the C data interface. > > But there are comments that can we standardize this approach > for all contexts including the C data interface? > I want to discuss this in this thread. > > Here are related comments so far: > > * https://github.com/apache/arrow/pull/43553/files#r1871749972 > * https://github.com/apache/arrow/pull/43553/files#r1704373291 > * https://github.com/apache/arrow/pull/43553/files#r1871757604 > > > Could you share your opinions? > > > If we can remove the C data interface only limitation, I'll > open a new PR for it. > > > Thanks, > -- > kou