Hi Wenbo:
Nice proposal! Waiting for your PR!. I have a question that the TopN result is shown as Json? -- 此致!Best Regards 陈明雨 Mingyu Chen Email: chenmin...@apache.org At 2020-09-27 10:36:38, "杨文波" <yangwenbo_mail...@163.com> wrote: >Is your feature request related to a problem? Please describe. >It is a common scenario for a database to calculate top-n from a dataset, sort >a column and return the Top elements. TopN could uses an approximation >algorithms to quickly return results with little overhead. > >There are many algorithms to quickly calculate frequent items from the data >set, among which the SpaceSaving algorithm has a good performance for >reference [1]. Kylin and PostgresSQL have implemented TopN approximation >algorithms based on this algorithm and have performed well. We can implement >this algorithm in Doris to compute TopN quickly. > >Describe the solution you'd like >Kylin's current implementation computes the topN of a measure column based on >dimensional columns, such like group by dimension_cols order by measure_column >limit 10 (refer >https://kylin.apache.org/blog/2016/03/19/approximate-topn-measure/) while >PostgresSQL uses the topN function to computes the frequent entries of a >column (refer https://github.com/citusdata/postgresql-topn). > >For the current Doris, the lack of UDTF support makes it difficult to >calculate topN on dimension columns and measure column like Kylin. So it is >more appropriate to calculate topN using an aggregate function like >PostgresSQL, and present the results in JSON format. > >eg: > >CREATE TABLE popular_ad >( > event_date date NOT NULL , > ad_id BIGINT NOY NULL >); > > >For example, we can calculate the ad_id that is displayed or clicked most per >day. > >select event_date, topn(ad_id, 1) as top_1 from popular_ad group by >event_date; > > event_date | top_1 >----------------------------------+-------------------- >2020-01-01 | {"xxxx":"10000"} >2020-01-02 | {"xxx":"11111"} > > >The result represents the top1 item and its corresponding frequency. > >In the future, with Doris supporting UDTF, we can display item and frequency >based on this function, or topN for the SUM aggregate type column group by key >columns > >Additional context >[1] Efficient Computation of Frequent and Top-k Elements in Data Streams by >Metwally, Agrawal, and Abbadi.