[ 
https://issues.apache.org/jira/browse/PIG-3417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15741601#comment-15741601
 ] 

Nandor Kollar commented on PIG-3417:
------------------------------------

Investigated this issue, and it seems, that the problem is with the 
optimization of the sampling job. When the join key is a composite key, in the 
sampling job it is getting flattened, but since the secondary key optimizer 
expects composite keys to be wrapped into tuples, we get classcast exception 
(this also explains why the query didn't fail when secondary key optimizer is 
switched off). I think not flattening the tuples in the sampling job would 
solve the problem: PartitionSkewedKeys would work on ((key1, key2, ...), (tuple 
mem size, key count)) format for composite keys, and on (key, (tuple mem size, 
key count)) format for non-composite key. This way we can apply secondary key 
optimizer on the sampling job too. Attached a patch, tests in TestSkewedJoin 
(including Nick's test case) passed both on MR and on Tez mode, waiting for the 
result of the entire test suite to make sure it doesn't break other test cases.

> Skewed Join On Tuple Column Kills Job 
> --------------------------------------
>
>                 Key: PIG-3417
>                 URL: https://issues.apache.org/jira/browse/PIG-3417
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.11.1
>            Reporter: Nick White
>            Priority: Critical
>         Attachments: TestSkewJoinWithTuples.java
>
>
> I've attached a test case that fails, but should pass. The test case groups 
> two relations separately, then full-outer joins them on the grouped columns. 
> The test case passes if "using 'skewed'" is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to