[
https://issues.apache.org/jira/browse/PIG-3417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15741601#comment-15741601
]
Nandor Kollar commented on PIG-3417:
------------------------------------
Investigated this issue, and it seems, that the problem is with the
optimization of the sampling job. When the join key is a composite key, in the
sampling job it is getting flattened, but since the secondary key optimizer
expects composite keys to be wrapped into tuples, we get classcast exception
(this also explains why the query didn't fail when secondary key optimizer is
switched off). I think not flattening the tuples in the sampling job would
solve the problem: PartitionSkewedKeys would work on ((key1, key2, ...), (tuple
mem size, key count)) format for composite keys, and on (key, (tuple mem size,
key count)) format for non-composite key. This way we can apply secondary key
optimizer on the sampling job too. Attached a patch, tests in TestSkewedJoin
(including Nick's test case) passed both on MR and on Tez mode, waiting for the
result of the entire test suite to make sure it doesn't break other test cases.
> Skewed Join On Tuple Column Kills Job
> --------------------------------------
>
> Key: PIG-3417
> URL: https://issues.apache.org/jira/browse/PIG-3417
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: 0.11.1
> Reporter: Nick White
> Priority: Critical
> Attachments: TestSkewJoinWithTuples.java
>
>
> I've attached a test case that fails, but should pass. The test case groups
> two relations separately, then full-outer joins them on the grouped columns.
> The test case passes if "using 'skewed'" is removed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)