[ 
https://issues.apache.org/jira/browse/PIG-3492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799728#comment-13799728
 ] 

Daniel Dai commented on PIG-3492:
---------------------------------

Thanks [~horaguchi] for detailed explanation. This patch fixed several issues:
1. Remove fixDuplicateUids from LOJoin/LOGenerate
That's for sure need to be removed. We cannot arbitrary reassign uid. Every 
operator should have lineage knowledge and be able to trace to input uids which 
generate a particular output uid. Sometimes we do need to reassign the uid, but 
if this happens, we need to remember the mapping inside the operator, and 
update the logic ColumnDependencyVisitor of tracing every output uid to its 
input uid. 

Take LOJoin for example, each output column is derived from one input column 
without transformation, so we keep uid unchanged. If input#1 of join has the 
uid (1, 2, 3), input#2 has the uid (4, 5), the join output will have the uid 
(1, 2, 3, 4, 5). When ColumnDependencyVisitor try to prune uid 5, we know that 
is from second column of input#2. In the case we do want to regenerate output 
uid to (21, 22, 23, 24, 25), join operator need to remember the mapping (1->21, 
2->22, 3->23, 4->24, 5->25), so that when ColumnDependencyVisitor try to prune 
uid 25, we know it is from second column of input#2 as well. One example of uid 
regeneration is LOSplit, which has a field uidMapping to remember the mapping. 
ColumnDependencyVisitor will make use of this information to trace the lineage. 

As we already mentioned, uid regeneration is already done in LOSplitOutput, so 
after ImplicitSplitInserter, we shall not have duplicate uids and should not 
have uid conflict. The issue seen in PIG-3020 is due to a bug in LOSplitOutput 
uid mapping process as Koji mentioned.

2. LogicalPlanBuilder.expandAndResetVisitor change
Not sure what drive the original change, seems there is no reason to visit the 
whole plan after building one operator. The logical plan is not in a consistent 
state at this moment, global SchemaResetter might not work

3. move DuplicateForEachColumnRewrite and ImplicitSplitInserter
When we design optimizer, we try to include every plan transformation into it. 
But now I think we may separate it into two part. One is to bring the plan into 
a valid state, the other is the real optimizer rule. The reason is at validate 
stage, the plan is still inconsistent and many optimizer rule cannot handle, 
and the global SchemaReset/UidResetter does not work. We may want to do some 
refactory and make the separation clear in another Jira. "describe" should 
happen after validation, but might not need to go through optimizer. Move 
DuplicateForEachColumnRewrite/ImplicitSplitInserter into compile for now seems 
to be ok.

Overall, the patch looks pretty good. It fix above mentioned issue cleanly. We 
might want to preceed with clear validator/optmizer separation in anther Jira.

Will commit if tests pass.

> ColumnPrune dropping used column due to 
> LogicalRelationalOperator.fixDuplicateUids changes not propagating
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-3492
>                 URL: https://issues.apache.org/jira/browse/PIG-3492
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.11.1, 0.12.1, 0.13.0
>            Reporter: Koji Noguchi
>            Assignee: Koji Noguchi
>         Attachments: pig-3492-trunk_04.patch, pig-3492-v0.12_01.patch
>
>
> I don't have a testcase I can upload at the moment, but here's my observation.
> SplitFilter -> schemaResetter -> LOGenerate.getSchema -> 
> LogicalRelationalOperator.fixDuplicateUids() creating a new UID but that UID 
> is not propagated to the entire plan (since SplitFilter.reportChanges only 
> returns subplan).
> As a result, I am seeing ColumnPruning cutting off those used columns.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to