Re:[Proposal] Support approximate TopN function for Doris

陈明雨 Sun, 27 Sep 2020 18:44:43 -0700

Hi Wenbo:


Nice proposal! Waiting for your PR!.
I have a question that the TopN result is shown as Json?




--

此致！Best Regards
陈明雨 Mingyu Chen

Email:
chenmin...@apache.org





At 2020-09-27 10:36:38, "杨文波" <yangwenbo_mail...@163.com> wrote:
>Is your feature request related to a problem? Please describe.
>It is a common scenario for a database to calculate top-n from a dataset, sort 
>a column and return the Top elements. TopN could uses an approximation 
>algorithms to quickly return results with little overhead.
>
>There are many algorithms to quickly calculate frequent items from the data 
>set, among which the SpaceSaving algorithm has a good performance for 
>reference [1]. Kylin and PostgresSQL have implemented TopN approximation 
>algorithms based on this algorithm and have performed well. We can implement 
>this algorithm in Doris to compute TopN quickly.
>
>Describe the solution you'd like
>Kylin's current implementation computes the topN of a measure column based on 
>dimensional columns, such like group by dimension_cols order by measure_column 
>limit 10 (refer 
>https://kylin.apache.org/blog/2016/03/19/approximate-topn-measure/) while 
>PostgresSQL uses the topN function to computes the frequent entries of a 
>column (refer https://github.com/citusdata/postgresql-topn). 
>
>For the current Doris, the lack of UDTF support makes it difficult to 
>calculate topN on dimension columns and measure column like Kylin. So it is 
>more appropriate to calculate topN using an aggregate function like 
>PostgresSQL, and present the results in JSON format.
>
>eg:
>
>CREATE TABLE popular_ad
>(
>  event_date date NOT NULL ,
>  ad_id BIGINT NOY NULL
>);
>
>
>For example, we can calculate the ad_id that is displayed or clicked most per 
>day.
>
>select event_date, topn(ad_id, 1)  as top_1 from popular_ad group by 
>event_date;
>
>               event_date         |      top_1
>----------------------------------+--------------------
>2020-01-01                        |  {"xxxx":"10000"}
>2020-01-02                        |  {"xxx":"11111"}
>
>
>The result represents the top1 item and its corresponding frequency.
>
>In the future, with Doris supporting UDTF, we can display item and frequency 
>based on this function, or topN for the SUM aggregate type column group by key 
>columns
>
>Additional context
>[1] Efficient Computation of Frequent and Top-k Elements in Data Streams by 
>Metwally, Agrawal, and Abbadi.

Re:[Proposal] Support approximate TopN function for Doris

Reply via email to