COUNT(DISTINCT varargs...) can be used either as a scalar aggregate function or a group aggregate function. For example
SELECT COUNT(DISTINCT expr1, expr2, ...) FROM TABLE; returns a single value. It can be used with GROUP BY to produce a distinct count per group. I think it would be useful to have available as a scalar aggregate function. Either way good to know that our aggregation exprs will need to support varargs SELECT DISTINCT is equivalent to our Unique. So one implementation of SELECT DISTINCT expr1, expr2, ... FROM TABLE; could be implemented by internally grouping the exprs into a StructArray and calling Unique on a struct array. We could also simply call the aggregation machinery with no aggregate exprs. Might want to make some Jira issues for the above if there are not already. On Fri, Jun 18, 2021 at 4:37 PM Ian Cook <i...@ursacomputing.com> wrote: > > > Aren't SELECT DISTINCT and COUNT DISTINCT just condensed variants of a > > GROUP BY query? Do they need to be exposed as standalone kernels? > > I listed SELECT DISTINCT and COUNT DISTINCT in the document only as > examples of SQL statements that take a variable number of arguments, > not to imply that these should be exposed as compute kernels in Arrow. > But I think you are right to suggest that they do not really belong in > this list, because as you say it is probably best to think of them as > shortcut SQL syntax for obtaining results that could instead be > obtained through a GROUP BY query. I have removed them. > > Thank you, > Ian > > On Fri, Jun 18, 2021 at 2:26 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > > Aren't SELECT DISTINCT and COUNT DISTINCT just condensed variants of a > > GROUP BY query? Do they need to be exposed as standalone kernels? > > > > > > Le 18/06/2021 à 00:58, Ian Cook a écrit : > > > Arrow developers, > > > > > > A couple of recent PRs have added new variadic scalar kernels to the > > > Arrow C++ library (ARROW-12751, ARROW-12709). There were some > > > questions raised in comments on Jira and GitHub about whether these > > > could instead be implemented as unary or binary kernels that take > > > ListArray or StructArray input. Since I believe we plan to add at > > > least a few more variadic kernels, I wrote a document [1] with help > > > from some colleagues at Ursa to describe the rationale behind why we > > > believe it is best to implement these as variadic kernels. Feedback is > > > welcome. > > > > > > Thank you, > > > Ian > > > > > > [1] > > > https://docs.google.com/document/d/1ExysJ43WpjZ_P6vnfx6dzCRSvM-3qlqpc6gPjy9cNXM/ > > >