[Proposal] Support approximate TopN function for Doris

杨文波 Sat, 26 Sep 2020 20:32:34 -0700

Is your feature request related to a problem? Please describe.
It is a common scenario for a database to calculate top-n from a dataset, sort 
a column and return the Top elements. TopN could uses an approximation 
algorithms to quickly return results with little overhead.


There are many algorithms to quickly calculate frequent items from the data 
set, among which the SpaceSaving algorithm has a good performance for reference 
[1]. Kylin and PostgresSQL have implemented TopN approximation algorithms based 
on this algorithm and have performed well. We can implement this algorithm in 
Doris to compute TopN quickly.

Describe the solution you'd like
Kylin's current implementation computes the topN of a measure column based on 
dimensional columns, such like group by dimension_cols order by measure_column 
limit 10 (refer 
https://kylin.apache.org/blog/2016/03/19/approximate-topn-measure/) while 
PostgresSQL uses the topN function to computes the frequent entries of a column 
(refer https://github.com/citusdata/postgresql-topn). 

For the current Doris, the lack of UDTF support makes it difficult to calculate 
topN on dimension columns and measure column like Kylin. So it is more 
appropriate to calculate topN using an aggregate function like PostgresSQL, and 
present the results in JSON format.

eg:

CREATE TABLE popular_ad
(
  event_date date NOT NULL ,
  ad_id BIGINT NOY NULL
);


For example, we can calculate the ad_id that is displayed or clicked most per 
day.

select event_date, topn(ad_id, 1)  as top_1 from popular_ad group by event_date;

               event_date         |      top_1
----------------------------------+--------------------
2020-01-01                        |  {"xxxx":"10000"}
2020-01-02                        |  {"xxx":"11111"}


The result represents the top1 item and its corresponding frequency.

In the future, with Doris supporting UDTF, we can display item and frequency 
based on this function, or topN for the SUM aggregate type column group by key 
columns

Additional context
[1] Efficient Computation of Frequent and Top-k Elements in Data Streams by 
Metwally, Agrawal, and Abbadi.

[Proposal] Support approximate TopN function for Doris

Reply via email to