Re: Duplicate operators generated by plan

Yun Gao Thu, 03 Dec 2020 19:50:12 -0800

Hi Rex,

    Could  you also attach one example for these sql / table ? And one possible 
issue to confirm is that does the operators with the same names also have the 
same inputs ?


Best,
Yun

 ------------------Original Mail ------------------
Sender:Rex Fenley <r...@remind101.com>
Send Date:Fri Dec 4 02:55:41 2020
Recipients:user <user@flink.apache.org>
Subject:Duplicate operators generated by plan

Hello,

I'm running into an issue where my execution plan is creating the same exact 
join operator multiple times simply because the subsequent operator filters on 
a different boolean value. This is a massive duplication of storage and work. 
The filtered operators which follow result in only a small set of elements 
filtered out per set too.

eg. of two separate operators that are equal

 Join(joinType=[InnerJoin], where=[(id = id_splatted)], select=[id, 
organization_id, user_id, roles, id_splatted, org_user_is_admin, 
org_user_is_teacher, org_user_is_student, org_user_is_parent], 
leftInputSpec=[JoinKeyContainsUniqueKey], 
rightInputSpec=[JoinKeyContainsUniqueKey]) -> Calc(select=[id, organization_id, 
user_id, roles AS org_user_roles, org_user_is_admin, org_user_is_teacher, 
org_user_is_student, org_user_is_parent]

 Join(joinType=[InnerJoin], where=[(id = id_splatted)], select=[id, 
organization_id, user_id, roles, id_splatted, org_user_is_admin, 
org_user_is_teacher, org_user_is_student, org_user_is_parent], 
leftInputSpec=[JoinKeyContainsUniqueKey], 
rightInputSpec=[JoinKeyContainsUniqueKey]) -> Calc(select=[id, organization_id, 
user_id, roles AS org_user_roles, org_user_is_admin, org_user_is_teacher, 
org_user_is_student, org_user_is_parent]) 

Which are entirely the same datasets being processed.

The first one points to 
 GroupAggregate(groupBy=[user_id], select=[user_id, 
IDsAgg_RETRACT(organization_id) AS TMP_0]) -> Calc(select=[user_id, TMP_0.f0 AS 
admin_organization_ids]) 

The second one points to
 GroupAggregate(groupBy=[user_id], select=[user_id, 
IDsAgg_RETRACT(organization_id) AS TMP_0]) -> Calc(select=[user_id, TMP_0.f0 AS 
teacher_organization_ids]) 

And these are both intersecting sets of data though slightly different. I don't 
see why that would make the 1 join from before split into 2 though. There's 
even a case where I'm seeing a join tripled.

Is there a good reason why this should happen? Is there a way to tell flink to 
not duplicate operators where it doesn't need to?

Thanks!

-- 

Rex Fenley | Software Engineer - Mobile and Backend

Remind.com |  BLOG  |  FOLLOW US  |  LIKE US

Re: Duplicate operators generated by plan

Reply via email to