> > What is the parallel-list means? Something like:
table RecordBatch { nodes: [FieldNode]; // Statistics related to the data represented by each FieldNode // This field is either length=0 or has the same length as nodes. statistics: [Statistic]; } On Wed, Feb 17, 2021 at 8:34 PM Kohei KaiGai <kai...@heterodb.com> wrote: > Thanks for the clarification. > > > There is key-value metadata available on Message which might be able to > > work in the short term (some sort of encoded message). I think > > standardizing how we store statistics per batch does make sense. > > > For example, JSON array of min/max values as a key-value metadata > in the Footer->Schema->Fields[]->custom_metadata? > Even though the metadata field must be less than INT_MAX, I think it > is enough portable and not invasive way. > > > We unfortunately can't add anything to field-node without breaking > > compatibility. But another option would be to add a new structure as a > > parallel list on RecordBatch itself. > > > > If we do add a new structure or arbitrary key-value pair we should not > use > > KeyValue but should have something where the values can be bytes. > > > What is the parallel-list means? > If we would have a standardized binary structure, like DictionaryBatch, > to store the statistics including min/max values, it exactly makes sense > more than text-encoded key-value metadata, of course. > > Best regards, > > 2021年2月18日(木) 12:37 Micah Kornfield <emkornfi...@gmail.com>: > > > > There is key-value metadata available on Message which might be able to > > work in the short term (some sort of encoded message). I think > > standardizing how we store statistics per batch does make sense. > > > > We unfortunately can't add anything to field-node without breaking > > compatibility. But another option would be to add a new structure as a > > parallel list on RecordBatch itself. > > > > If we do add a new structure or arbitrary key-value pair we should not > use > > KeyValue but should have something where the values can be bytes. > > > > On Wed, Feb 17, 2021 at 7:17 PM Kohei KaiGai <kai...@heterodb.com> > wrote: > > > > > Hello, > > > > > > Does Apache Arrow have any standard way to embed min/max values of the > > > fields > > > per record-batch basis? > > > It looks FieldNode supports neither dedicated min/max attribute nor > > > custom-metadata. > > > https://github.com/apache/arrow/blob/master/format/Message.fbs#L28 > > > > > > If we embed an array of min/max values into the custom-metadata of the > > > Field-node, > > > we may be able to implement. > > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L344 > > > > > > What I like to implement is something like BRIN index at PostgreSQL. > > > http://heterodb.github.io/pg-strom/brin/ > > > > > > This index contains only min/max values for a particular block ranges, > and > > > query > > > executor can skip blocks that obviously don't contain the target data. > > > If we can skip 9990 of 10000 record batch by checking metadata on a > query > > > that > > > tries to fetch items in very narrow timestamps, it is a great > > > acceleration more than > > > full file scans. > > > > > > Best regards, > > > -- > > > HeteroDB, Inc / The PG-Strom Project > > > KaiGai Kohei <kai...@heterodb.com> > > > > > > > -- > HeteroDB, Inc / The PG-Strom Project > KaiGai Kohei <kai...@heterodb.com> >