Re: Duplicate tasks for the same query

RKandoji Fri, 03 Jan 2020 10:08:04 -0800

Thanks!

On Thu, Jan 2, 2020 at 9:45 PM Jingsong Li <[email protected]> wrote:


> Yes,
>
> 1.9.2 or Coming soon 1.10
>
> Best,
> Jingsong Lee
>
> On Fri, Jan 3, 2020 at 12:43 AM RKandoji <[email protected]> wrote:
>
>> Ok thanks, does it mean version 1.9.2 is what I need to use?
>>
>> On Wed, Jan 1, 2020 at 10:25 PM Jingsong Li <[email protected]>
>> wrote:
>>
>>> Blink planner was introduced in 1.9. We recommend use blink planner
>>> after 1.9.
>>> After some bug fix, I think the latest version of 1.9 is OK. The
>>> production environment has also been set up in some places.
>>>
>>> Best,
>>> Jingsong Lee
>>>
>>> On Wed, Jan 1, 2020 at 3:24 AM RKandoji <[email protected]> wrote:
>>>
>>>> Thanks Jingsong and Kurt for more details.
>>>>
>>>> Yes, I'm planning to try out DeDuplication when I'm done upgrading to
>>>> version 1.9. Hopefully deduplication is done by only one task and reused
>>>> everywhere else.
>>>>
>>>> One more follow-up question, I see "For production use cases, we
>>>> recommend the old planner that was present before Flink 1.9 for now." 
>>>> warning
>>>> here https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/
>>>> This is actually the reason why started with version 1.8, could you
>>>> please let me know your opinion about this? and do you think there is any
>>>> production code running on version 1.9
>>>>
>>>> Thanks,
>>>> Reva
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Dec 30, 2019 at 9:02 PM Kurt Young <[email protected]> wrote:
>>>>
>>>>> BTW, you could also have a more efficient version of deduplicating
>>>>> user table by using the topn feature [1].
>>>>>
>>>>> Best,
>>>>> Kurt
>>>>>
>>>>> [1]
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/sql.html#top-n
>>>>>
>>>>>
>>>>> On Tue, Dec 31, 2019 at 9:24 AM Jingsong Li <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi RKandoji,
>>>>>>
>>>>>> In theory, you don't need to do something.
>>>>>> First, the optimizer will optimize by doing duplicate nodes.
>>>>>> Second, after SQL optimization, if the optimized plan still has
>>>>>> duplicate nodes, the planner will automatically reuse them.
>>>>>> There are config options to control whether we should reuse plan,
>>>>>> their default value is true. So you don't need modify them.
>>>>>> - table.optimizer.reuse-sub-plan-enabled
>>>>>> - table.optimizer.reuse-source-enabled
>>>>>>
>>>>>> Best,
>>>>>> Jingsong Lee
>>>>>>
>>>>>> On Tue, Dec 31, 2019 at 6:29 AM RKandoji <[email protected]> wrote:
>>>>>>
>>>>>>> Thanks Terry and Jingsong,
>>>>>>>
>>>>>>> Currently I'm on 1.8 version using Flink planner for stream
>>>>>>> proessing, I'll switch to 1.9 version to try out blink planner.
>>>>>>>
>>>>>>> Could you please point me to any examples (Java preferred) using
>>>>>>> SubplanReuser?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> RK
>>>>>>>
>>>>>>> On Sun, Dec 29, 2019 at 11:32 PM Jingsong Li <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi RKandoji,
>>>>>>>>
>>>>>>>> FYI: Blink-planner subplan reusing: [1] 1.9 available.
>>>>>>>>
>>>>>>>>        Join                      Join
>>>>>>>>      /      \                  /      \
>>>>>>>>  Filter1  Filter2          Filter1  Filter2
>>>>>>>>     |        |        =>       \     /
>>>>>>>>  Project1 Project2            Project1
>>>>>>>>     |        |                   |
>>>>>>>>   Scan1    Scan2               Scan1
>>>>>>>>
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://github.com/apache/flink/blob/master/flink-table/flink-table-planner-blink/src/main/scala/org/apache/flink/table/planner/plan/reuse/SubplanReuser.scala
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Jingsong Lee
>>>>>>>>
>>>>>>>> On Mon, Dec 30, 2019 at 12:28 PM Terry Wang <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi RKandoji~
>>>>>>>>>
>>>>>>>>> Could you provide more info about your poc environment?
>>>>>>>>> Stream or batch? Flink planner or blink planner?
>>>>>>>>> AFAIK, blink planner has done some optimization to deal such
>>>>>>>>> duplicate task for one same query. You can have a try with blink 
>>>>>>>>> planner :
>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/common.html#create-a-tableenvironment
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Terry Wang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2019年12月30日 03:07，RKandoji <[email protected]> 写道：
>>>>>>>>>
>>>>>>>>> Hi Team,
>>>>>>>>>
>>>>>>>>> I'm doing a POC with flink to understand if it's a good fit for my
>>>>>>>>> use case.
>>>>>>>>>
>>>>>>>>> As part of the process, I need to filter duplicate items and
>>>>>>>>> created below query to get only the latest records based on 
>>>>>>>>> timestamp. For
>>>>>>>>> instance, I have "Users" table which may contain multiple messages 
>>>>>>>>> for the
>>>>>>>>> same "userId". So I wrote below query to get only the latest message 
>>>>>>>>> for a
>>>>>>>>> given "userId"
>>>>>>>>>
>>>>>>>>> Table uniqueUsers = tEnv.sqlQuery("SELECT * FROM Users WHERE
>>>>>>>>> (userId, userTimestamp) IN (SELECT userId, MAX(userTimestamp) FROM 
>>>>>>>>> Users
>>>>>>>>> GROUP BY userId)");
>>>>>>>>>
>>>>>>>>> The above query works as expected and contains only the latest
>>>>>>>>> users based on timestamp.
>>>>>>>>>
>>>>>>>>> The issue is when I use "uniqueUsers" table multiple times in a
>>>>>>>>> JOIN operation, I see multiple tasks in the flink dashboard for the 
>>>>>>>>> same
>>>>>>>>> query that is creating "uniqueUsers" table. It is simply creating as 
>>>>>>>>> many
>>>>>>>>> tasks as many times I'm using the table.
>>>>>>>>>
>>>>>>>>> Below is the JOIN query.
>>>>>>>>> tEnv.registerTable("uniqueUsersTbl", uniqueUsers);
>>>>>>>>> Table joinOut = tEnv.sqlQuery("SELECT * FROM Car c
>>>>>>>>>                                        LEFT JOIN uniqueUsersTbl aa
>>>>>>>>> ON c.userId = aa.userId
>>>>>>>>>                                        LEFT JOIN uniqueUsersTbl
>>>>>>>>> ab ON c.ownerId = ab.userId
>>>>>>>>>                                        LEFT JOIN uniqueUsersTbl
>>>>>>>>> ac ON c.sellerId = ac.userId
>>>>>>>>>                                        LEFT JOIN uniqueUsersTbl
>>>>>>>>> ad ON c.buyerId = ad.userId");
>>>>>>>>>
>>>>>>>>> Could someone please help me understand how I can avoid these
>>>>>>>>> duplicate tasks?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> R Kandoji
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best, Jingsong Lee
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best, Jingsong Lee
>>>>>>
>>>>>
>>>
>>> --
>>> Best, Jingsong Lee
>>>
>>
>
> --
> Best, Jingsong Lee
>

Re: Duplicate tasks for the same query

Reply via email to