[ https://issues.apache.org/jira/browse/FLINK-12173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906076#comment-17906076 ]
lincoln lee commented on FLINK-12173: ------------------------------------- Thanks [~yiyutian] for your contribution! I understand this is a performance optimization for deduplication (that is, runtime operator implementation optimization, which does not necessarily require intervention at the plan optimization stage). Have there been any benchmark(e.g., nexmark) with the current solution? With some concrete benefit data would help us confirm the effectiveness of the optimization. Additionally, for the specific implementation solutions, we need to consider different scenarios for both streaming and batch. Specifically for the streaming scenario, we need to take into account the choice of implementation algorithms under different inputs (append-only vs updating). > Optimize "SELECT DISTINCT" into Deduplicate with keep first row > --------------------------------------------------------------- > > Key: FLINK-12173 > URL: https://issues.apache.org/jira/browse/FLINK-12173 > Project: Flink > Issue Type: New Feature > Components: Table SQL / Planner > Reporter: Jark Wu > Assignee: Yiyu Tian > Priority: Major > Labels: pull-request-available > > The following distinct query can be optimized into deduplicate on keys "a, b, > c, d" and keep the first row. > {code:sql} > SELECT DISTINCT a, b, c, d; > {code} > We can optimize this query into Deduplicate to get a better performance than > GroupAggregate. -- This message was sent by Atlassian Jira (v8.20.10#820010)