hi Kurt,
Thanks for the additional info.
RK
On Sun, Jan 5, 2020 at 8:33 PM Kurt Young wrote:
> Another common skew case we've seen is null handling, the value of the
> join key
> is NULL. We will shuffle the NULL value into one task even if the join
> condition
> won't stand by definition.
>
>
Another common skew case we've seen is null handling, the value of the join
key
is NULL. We will shuffle the NULL value into one task even if the join
condition
won't stand by definition.
For DeDuplication, I just want to make sure this behavior meets your
requirement.
Because for some other usage
Hi Kurt,
I understand what you mean, some userIds may appear more frequently than
the others but this distribution doesn't look in proportionate with the
data skew. Do you think of any other possible reasons or anything I can try
out to investigate this more?
For DeDuplication, I query for the la
Hi RKandoji,
It looks like you have a data skew issue with your input data. Some or
maybe only one "userId" appears more frequent than others. For join
operator to work correctly, Flink will apply "shuffle by join key" before
the
operator, so same "userId" will go to the same sub-task to perform j
Hi,
Thanks a ton for the help with earlier questions, I updated code to version
1.9 and started using Blink Planner (DeDuplication). This is working as
expected!
I have a new question, but thought of asking in the same email chain as
this has more context about my use case etc.
Workflow:
Current
Thanks!
On Thu, Jan 2, 2020 at 9:45 PM Jingsong Li wrote:
> Yes,
>
> 1.9.2 or Coming soon 1.10
>
> Best,
> Jingsong Lee
>
> On Fri, Jan 3, 2020 at 12:43 AM RKandoji wrote:
>
>> Ok thanks, does it mean version 1.9.2 is what I need to use?
>>
>> On Wed, Jan 1, 2020 at 10:25 PM Jingsong Li
>> wro
Yes,
1.9.2 or Coming soon 1.10
Best,
Jingsong Lee
On Fri, Jan 3, 2020 at 12:43 AM RKandoji wrote:
> Ok thanks, does it mean version 1.9.2 is what I need to use?
>
> On Wed, Jan 1, 2020 at 10:25 PM Jingsong Li
> wrote:
>
>> Blink planner was introduced in 1.9. We recommend use blink planner af
Ok thanks, does it mean version 1.9.2 is what I need to use?
On Wed, Jan 1, 2020 at 10:25 PM Jingsong Li wrote:
> Blink planner was introduced in 1.9. We recommend use blink planner after
> 1.9.
> After some bug fix, I think the latest version of 1.9 is OK. The
> production environment has also
Blink planner was introduced in 1.9. We recommend use blink planner after
1.9.
After some bug fix, I think the latest version of 1.9 is OK. The production
environment has also been set up in some places.
Best,
Jingsong Lee
On Wed, Jan 1, 2020 at 3:24 AM RKandoji wrote:
> Thanks Jingsong and Kur
Thanks Jingsong and Kurt for more details.
Yes, I'm planning to try out DeDuplication when I'm done upgrading to
version 1.9. Hopefully deduplication is done by only one task and reused
everywhere else.
One more follow-up question, I see "For production use cases, we recommend
the old planner tha
BTW, you could also have a more efficient version of deduplicating
user table by using the topn feature [1].
Best,
Kurt
[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/sql.html#top-n
On Tue, Dec 31, 2019 at 9:24 AM Jingsong Li wrote:
> Hi RKandoji,
>
> In theory, you
Hi RKandoji,
In theory, you don't need to do something.
First, the optimizer will optimize by doing duplicate nodes.
Second, after SQL optimization, if the optimized plan still has duplicate
nodes, the planner will automatically reuse them.
There are config options to control whether we should reu
Thanks Terry and Jingsong,
Currently I'm on 1.8 version using Flink planner for stream proessing, I'll
switch to 1.9 version to try out blink planner.
Could you please point me to any examples (Java preferred) using
SubplanReuser?
Thanks,
RK
On Sun, Dec 29, 2019 at 11:32 PM Jingsong Li wrote:
Hi RKandoji,
FYI: Blink-planner subplan reusing: [1] 1.9 available.
Join Join
/ \ / \
Filter1 Filter2 Filter1 Filter2
||=> \ /
Project1 Project2Project1
||
Hi RKandoji~
Could you provide more info about your poc environment?
Stream or batch? Flink planner or blink planner?
AFAIK, blink planner has done some optimization to deal such duplicate task for
one same query. You can have a try with blink planner :
https://ci.apache.org/projects/flink/flin
Hi Team,
I'm doing a POC with flink to understand if it's a good fit for my use
case.
As part of the process, I need to filter duplicate items and created below
query to get only the latest records based on timestamp. For instance, I
have "Users" table which may contain multiple messages for the
16 matches
Mail list logo