TableProvider has a statistics method already. The approach that Calcite takes is to include sort order as part of statistics [1], so that could be one approach to consider.
We may also want to add a method to LogicalPlan for returning the sort order (or statistics) for a particular operator. [1] https://calcite.apache.org/javadocAggregate/org/apache/calcite/schema/Statistic.html On Tue, May 11, 2021 at 12:14 PM Andrew Lamb <al...@influxdata.com> wrote: > I was imagining something known at Query Planning time (e.g if the data you > are reading in from a parquet file is already sorted by `time` and the > query calls for sorting by time, the sort can be omitted). In this case, I > was thinking "how would we communicate this information to DataFusion from > a TableProvider" > > Another usecase for sortedness is if you are merging two parquet files into > a single sorted output and you want to know the inputs are already sorted, > you can simply merge the two streams together and save quite a lot of > processing time and intermediate buffers. > > > > On Tue, May 11, 2021 at 2:01 PM Andy Grove <andygrov...@gmail.com> wrote: > > > I had been planning on adding a method to DataFusion's execution plan to > > indicate the sort-order of the results (if known), similar to how we > > currently have information about output partitioning. > > > > Would this cover your requirement or are you looking for something > outside > > the context of execution plans? > > > > On Tue, May 11, 2021 at 11:52 AM Andrew Lamb <al...@influxdata.com> > wrote: > > > > > We are building a system that will likely make heavy use of sorted > data, > > > and we are trying to figure out how to encode the metadata of "how is > > this > > > data sorted". We can certainly use our own custom metadata fields, but > > > wanted to check for prior art and gauge community interest in adding > > > something to Arrow. More details are on [1]. > > > > > > Recording sort-order in Schema would likely be useful for DataFusion > as > > > well (to optimize away redundant computation if the data is already > > sorted > > > or pick more efficient algorithms (e.g. a MERGING grouping operator). > > > > > > I didn't see any obvious prior art on the mailing list [2] or in JIRA > > > [3][4] so I figured I would ask if others had any backstory or other > > > reactions. > > > > > > Thank you > > > Andrew > > > > > > > > > > > > > > > [1] https://github.com/apache/arrow-rs/issues/284 > > > [2] > https://lists.apache.org/list.html?dev@arrow.apache.org:lte=1y:sort > > > [3] > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-12087?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20summary%20~%20sort%20ORDER%20BY%20created%20DESC > > > [4] > > > > > > > > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20description%20~%20sort%20and%20component%20in%20(format) > > > > > >