Re: [DISCUSS][C++] Group by operation for RecordBatch and Table

Wes McKinney Wed, 05 Aug 2020 11:53:07 -0700

hi Kenta,

Yes, I think it only makes sense to implement this in the context of
the query engine project. Here's a list of assorted thoughts about it:

* I have been mentally planning to follow the Vectorwise-type query
engine architecture that's discussed in [1] [2] and many other
academic papers. I believe this is how some other current generation
open source columnar query engines work, such as Dremio [3] and DuckDB
[4][5].
* Hash (aka "group") aggregations need to be able to process arbitrary
expressions, not only a plain input column. So it's not enough to be
able to compute "sum(x) group by y" where "x" and "y" are fields in a
RecordBatch, we need to be able to compute "$AGG_FUNC($EXPR) GROUP BY
$GROUP_EXPR_1, $GROUP_EXPR_2, ..." where $EXPR / $GROUP_EXPR_1 / ...
are any column expressions computing from the input relations (keep in
mind that an aggregation could apply to stream of record batches
produced by a join). In any case, expression evaluation is a
closely-related task and should be implemented ASAP.
* Hash aggregation functions themselves should probably be introduced
as a new Function type in arrow::compute. I don't think it would be
appropriate to use the existing "SCALAR_AGGREGATE" functions, instead
we should introduce a new HASH_AGGREGATE function type that accepts
input data to be aggregated along with an array of pre-computed bucket
ids (which are computed by probing the HT). So rather than
Update(state, args) like we have for scalar aggregate, the primary
interface for group aggregation is Update(state, bucket_ids, args)
* The HashAggregation operator should be able to process an arbitrary
iterator of record batches
* We will probably want to adapt an existing or implement a new
concurrent hash table so that aggregations can be performed in
parallel without requiring a post-aggregation merge step
* There's some general support machinery for hashing multiple fields
and then doing efficient vectorized hash table probes (to assign
aggregation bucket id's to each row position)

I think it is worth investing the effort to build something that is
reasonably consistent with the "state of the art" in database systems
(at least according to what we are able to build with our current
resources) rather than building something more crude that has to be
replaced with new implementation later.

I'd like to help personally with this work (particularly since the
natural next step with my recent work in arrow/compute is to implement
expression evaluation) but I won't have significant bandwidth for it
until later this month or early September. If someone feels that they
sufficiently understand the state of the art for this type of workload
and wants to help with laying down the abstract C++ APIs for
Volcano-style query execution and an implementation of hash
aggregation, that sounds great.

Thanks,
Wes

[1]: https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf
[2]: https://github.com/TimoKersten/db-engine-paradigms
[3]: 
https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/aggregate/hash
[4]: 
https://github.com/cwida/duckdb/blob/master/src/include/duckdb/execution/aggregate_hashtable.hpp
[5]: 
https://github.com/cwida/duckdb/blob/master/src/execution/aggregate_hashtable.cpp

On Wed, Aug 5, 2020 at 10:23 AM Kenta Murata <m...@mrkn.jp> wrote:
>
> Hi folks,
>
> Red Arrow, the Ruby binding of Arrow GLib, implements grouped aggregation
> features for RecordBatch and Table.  Because these features are written in
> Ruby, they are too slow for large size data.  We need to make them much
> faster.
>
> To improve their calculation speed, they should be written in C++, and
> should be put in Arrow C++ instead of Red Arrow.
>
> Is anyone working on implementing group-by operation for RecordBatch and
> Table in Arrow C++?  If no one has worked on it, I would like to try it.
>
> By the way, I found that the grouped aggregation feature is mentioned in
> the design document of Arrow C++ Query Engine.  Is Query Engine, not Arrow
> C++ Core, a suitable location to implement group-by operation?

Re: [DISCUSS][C++] Group by operation for RecordBatch and Table

Reply via email to