Re: Duplicate tasks for the same query

2020-01-07 Thread RKandoji
hi Kurt, Thanks for the additional info. RK On Sun, Jan 5, 2020 at 8:33 PM Kurt Young wrote: > Another common skew case we've seen is null handling, the value of the > join key > is NULL. We will shuffle the NULL value into one task even if the join > condition > won't stand by definition. > >

Re: Duplicate tasks for the same query

2020-01-05 Thread Kurt Young
Another common skew case we've seen is null handling, the value of the join key is NULL. We will shuffle the NULL value into one task even if the join condition won't stand by definition. For DeDuplication, I just want to make sure this behavior meets your requirement. Because for some other usage

Re: Duplicate tasks for the same query

2020-01-05 Thread RKandoji
Hi Kurt, I understand what you mean, some userIds may appear more frequently than the others but this distribution doesn't look in proportionate with the data skew. Do you think of any other possible reasons or anything I can try out to investigate this more? For DeDuplication, I query for the la

Re: Duplicate tasks for the same query

2020-01-03 Thread Kurt Young
Hi RKandoji, It looks like you have a data skew issue with your input data. Some or maybe only one "userId" appears more frequent than others. For join operator to work correctly, Flink will apply "shuffle by join key" before the operator, so same "userId" will go to the same sub-task to perform j

Re: Duplicate tasks for the same query

2020-01-03 Thread RKandoji
Hi, Thanks a ton for the help with earlier questions, I updated code to version 1.9 and started using Blink Planner (DeDuplication). This is working as expected! I have a new question, but thought of asking in the same email chain as this has more context about my use case etc. Workflow: Current

Re: Duplicate tasks for the same query

2020-01-03 Thread RKandoji
Thanks! On Thu, Jan 2, 2020 at 9:45 PM Jingsong Li wrote: > Yes, > > 1.9.2 or Coming soon 1.10 > > Best, > Jingsong Lee > > On Fri, Jan 3, 2020 at 12:43 AM RKandoji wrote: > >> Ok thanks, does it mean version 1.9.2 is what I need to use? >> >> On Wed, Jan 1, 2020 at 10:25 PM Jingsong Li >> wro

Re: Duplicate tasks for the same query

2020-01-02 Thread Jingsong Li
Yes, 1.9.2 or Coming soon 1.10 Best, Jingsong Lee On Fri, Jan 3, 2020 at 12:43 AM RKandoji wrote: > Ok thanks, does it mean version 1.9.2 is what I need to use? > > On Wed, Jan 1, 2020 at 10:25 PM Jingsong Li > wrote: > >> Blink planner was introduced in 1.9. We recommend use blink planner af

Re: Duplicate tasks for the same query

2020-01-02 Thread RKandoji
Ok thanks, does it mean version 1.9.2 is what I need to use? On Wed, Jan 1, 2020 at 10:25 PM Jingsong Li wrote: > Blink planner was introduced in 1.9. We recommend use blink planner after > 1.9. > After some bug fix, I think the latest version of 1.9 is OK. The > production environment has also

Re: Duplicate tasks for the same query

2020-01-01 Thread Jingsong Li
Blink planner was introduced in 1.9. We recommend use blink planner after 1.9. After some bug fix, I think the latest version of 1.9 is OK. The production environment has also been set up in some places. Best, Jingsong Lee On Wed, Jan 1, 2020 at 3:24 AM RKandoji wrote: > Thanks Jingsong and Kur

Re: Duplicate tasks for the same query

2019-12-31 Thread RKandoji
Thanks Jingsong and Kurt for more details. Yes, I'm planning to try out DeDuplication when I'm done upgrading to version 1.9. Hopefully deduplication is done by only one task and reused everywhere else. One more follow-up question, I see "For production use cases, we recommend the old planner tha

Re: Duplicate tasks for the same query

2019-12-30 Thread Kurt Young
BTW, you could also have a more efficient version of deduplicating user table by using the topn feature [1]. Best, Kurt [1] https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/sql.html#top-n On Tue, Dec 31, 2019 at 9:24 AM Jingsong Li wrote: > Hi RKandoji, > > In theory, you

Re: Duplicate tasks for the same query

2019-12-30 Thread Jingsong Li
Hi RKandoji, In theory, you don't need to do something. First, the optimizer will optimize by doing duplicate nodes. Second, after SQL optimization, if the optimized plan still has duplicate nodes, the planner will automatically reuse them. There are config options to control whether we should reu

Re: Duplicate tasks for the same query

2019-12-30 Thread RKandoji
Thanks Terry and Jingsong, Currently I'm on 1.8 version using Flink planner for stream proessing, I'll switch to 1.9 version to try out blink planner. Could you please point me to any examples (Java preferred) using SubplanReuser? Thanks, RK On Sun, Dec 29, 2019 at 11:32 PM Jingsong Li wrote:

Re: Duplicate tasks for the same query

2019-12-29 Thread Jingsong Li
Hi RKandoji, FYI: Blink-planner subplan reusing: [1] 1.9 available. Join Join / \ / \ Filter1 Filter2 Filter1 Filter2 ||=> \ / Project1 Project2Project1 ||

Re: Duplicate tasks for the same query

2019-12-29 Thread Terry Wang
Hi RKandoji~ Could you provide more info about your poc environment? Stream or batch? Flink planner or blink planner? AFAIK, blink planner has done some optimization to deal such duplicate task for one same query. You can have a try with blink planner : https://ci.apache.org/projects/flink/flin

Fwd: Duplicate tasks for the same query

2019-12-29 Thread RKandoji
Hi Team, I'm doing a POC with flink to understand if it's a good fit for my use case. As part of the process, I need to filter duplicate items and created below query to get only the latest records based on timestamp. For instance, I have "Users" table which may contain multiple messages for the