> I think it's fair not to mention any other Arrow-like transport mechanism > since the benefits of transporting the statistics as an Arrow array are > less clear right now.
When we (or applications) start thinking about more advanced statistics like compressed histograms and sketch data structures, using a columnar format (Arrow itself!) to represent the data, the benefits will be more clear. -- Felipe On Fri, Dec 13, 2024 at 11:52 PM Dewey Dunnington < dewey.dunning...@gmail.com> wrote: > First of all, thank you for driving this proposal! I don't think there's > anything particularly bad or wrong about mentioning the C data interface in > the title of the document...my initial comment was mostly a reaction to the > fact that most of the content of the proposal is describing this schema, > which I think is a reflection that the schema itself is an extremely > valuable (and orthogonal) thing to agree on! > > > Or in other words, we can just say, "this is the canonical schema to > represent statistics about an Arrow dataset as Arrow data", without > defining anything about how or where to use it. > > +1! > > > I think it's still useful to have context/examples of why we are > motivated to define this (and note that these are examples only), which may > use C Data Interface or something as an example > > +1! I think it's OK to use a C Data interface example as the only example > (as a reflection that it's the thing that makes the most sense): > > A file reader or table-like object providing an API using the Arrow C > Stream interface (e.g., `MyObjectExportAsArrayStream(MyObject* object, > ArrowArrayStream* out)`) may wish to also provide a parallel function > (e.g., `MyObjectGetStatistics(MyObject* object, ArrowArrayStream* out)`) > allowing the caller to inspect the properties of the object to perform the > appropriate query planning or memory allocation. > > I think it's fair not to mention any other Arrow-like transport mechanism > since the benefits of transporting the statistics as an Arrow array are > less clear right now. > > Cheers, > > -dewey > > On Wed, Dec 11, 2024 at 7:48 PM David Li <lidav...@apache.org> wrote: > > > I think the feedback is more along the lines of: we can just standardize > a > > representation of statistics, without referencing where it's used (C Data > > Interface or otherwise). So people are free to use it wherever they want, > > whether C Data Interface or IPC or somewhere else. At the same time, we > are > > not saying that we are going to (for example) embed this into IPC files > as > > the official way to have Parquet-like statistics. It is simply an agreed > > upon schema for interoperability, and the details of how it is passed > > around are up to the application. (At least for now.) > > > > Or in other words, we can just say, "this is the canonical schema to > > represent statistics about an Arrow dataset as Arrow data", without > > defining anything about how or where to use it. > > > > I think it's still useful to have context/examples of why we are > motivated > > to define this (and note that these are examples only), which may use C > > Data Interface or something as an example, but others may disagree. > > > > On Thu, Dec 12, 2024, at 10:27, Sutou Kouhei wrote: > > > Hi, > > > > > > I want to discuss Arrow array representation of statistics > > > and usable contexts of it. > > > > > > Background: > > > > > > We discussed how to pass statistics through the C data > > > interface: > > > > > > * [DISCUSS] Statistics through the C data interface > > > https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx > > > * [VOTE] Statistics through the C data interface > > > https://lists.apache.org/thread/rsw3wsyj68dksc98s5rpdp6dn8hfk0yd > > > * GH-38837: [Format] Add the specification to pass > > > statistics through the Arrow C data interface > > > https://github.com/apache/arrow/pull/43553 > > > > > > The latest proposal is that we standardize schema for Arrow > > > array that represents statistics. See the above PR for > > > details. > > > > > > I think that the proposed approach is the best approach for > > > the C data interface. But I'm not sure whether the approach > > > is the best approach for other contexts such as IPC format, > > > Flight, ADBC and so on. So the latest proposal limits its > > > target to only the C data interface. > > > > > > But there are comments that can we standardize this approach > > > for all contexts including the C data interface? > > > I want to discuss this in this thread. > > > > > > Here are related comments so far: > > > > > > * https://github.com/apache/arrow/pull/43553/files#r1871749972 > > > * https://github.com/apache/arrow/pull/43553/files#r1704373291 > > > * https://github.com/apache/arrow/pull/43553/files#r1871757604 > > > > > > > > > Could you share your opinions? > > > > > > > > > If we can remove the C data interface only limitation, I'll > > > open a new PR for it. > > > > > > > > > Thanks, > > > -- > > > kou > > >