[ 
https://issues.apache.org/jira/browse/FLINK-12173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906076#comment-17906076
 ] 

lincoln lee commented on FLINK-12173:
-------------------------------------

Thanks [~yiyutian] for your contribution!

I understand this is a performance optimization for deduplication (that is, 
runtime operator implementation optimization, which does not necessarily 
require intervention at the plan optimization stage).

Have there been any benchmark(e.g., nexmark) with the current solution? With 
some concrete benefit data would help us confirm the effectiveness of the 
optimization.

Additionally, for the specific implementation solutions, we need to consider 
different scenarios for both streaming and batch. Specifically for the 
streaming scenario, we need to take into account the choice of implementation 
algorithms under different inputs (append-only vs updating).

> Optimize "SELECT DISTINCT" into Deduplicate with keep first row
> ---------------------------------------------------------------
>
>                 Key: FLINK-12173
>                 URL: https://issues.apache.org/jira/browse/FLINK-12173
>             Project: Flink
>          Issue Type: New Feature
>          Components: Table SQL / Planner
>            Reporter: Jark Wu
>            Assignee: Yiyu Tian
>            Priority: Major
>              Labels: pull-request-available
>
> The following distinct query can be optimized into deduplicate on keys "a, b, 
> c, d" and keep the first row.
> {code:sql}
> SELECT DISTINCT a, b, c, d;
> {code}
> We can optimize this query into Deduplicate to get a better performance than 
> GroupAggregate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to