Hi Rex, Could you also attach one example for these sql / table ? And one possible issue to confirm is that does the operators with the same names also have the same inputs ?
Best, Yun ------------------Original Mail ------------------ Sender:Rex Fenley <r...@remind101.com> Send Date:Fri Dec 4 02:55:41 2020 Recipients:user <user@flink.apache.org> Subject:Duplicate operators generated by plan Hello, I'm running into an issue where my execution plan is creating the same exact join operator multiple times simply because the subsequent operator filters on a different boolean value. This is a massive duplication of storage and work. The filtered operators which follow result in only a small set of elements filtered out per set too. eg. of two separate operators that are equal Join(joinType=[InnerJoin], where=[(id = id_splatted)], select=[id, organization_id, user_id, roles, id_splatted, org_user_is_admin, org_user_is_teacher, org_user_is_student, org_user_is_parent], leftInputSpec=[JoinKeyContainsUniqueKey], rightInputSpec=[JoinKeyContainsUniqueKey]) -> Calc(select=[id, organization_id, user_id, roles AS org_user_roles, org_user_is_admin, org_user_is_teacher, org_user_is_student, org_user_is_parent] Join(joinType=[InnerJoin], where=[(id = id_splatted)], select=[id, organization_id, user_id, roles, id_splatted, org_user_is_admin, org_user_is_teacher, org_user_is_student, org_user_is_parent], leftInputSpec=[JoinKeyContainsUniqueKey], rightInputSpec=[JoinKeyContainsUniqueKey]) -> Calc(select=[id, organization_id, user_id, roles AS org_user_roles, org_user_is_admin, org_user_is_teacher, org_user_is_student, org_user_is_parent]) Which are entirely the same datasets being processed. The first one points to GroupAggregate(groupBy=[user_id], select=[user_id, IDsAgg_RETRACT(organization_id) AS TMP_0]) -> Calc(select=[user_id, TMP_0.f0 AS admin_organization_ids]) The second one points to GroupAggregate(groupBy=[user_id], select=[user_id, IDsAgg_RETRACT(organization_id) AS TMP_0]) -> Calc(select=[user_id, TMP_0.f0 AS teacher_organization_ids]) And these are both intersecting sets of data though slightly different. I don't see why that would make the 1 join from before split into 2 though. There's even a case where I'm seeing a join tripled. Is there a good reason why this should happen? Is there a way to tell flink to not duplicate operators where it doesn't need to? Thanks! -- Rex Fenley | Software Engineer - Mobile and Backend Remind.com | BLOG | FOLLOW US | LIKE US