[jira] [Commented] (HIVE-5973) SMB joins produce incorrect results with multiple partitions and buckets

Vikram Dixit K (JIRA) Thu, 05 Dec 2013 20:24:04 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13840956#comment-13840956
 ]


Vikram Dixit K commented on HIVE-5973:
--------------------------------------

The naive fix is to have
{noformat}
output = new Object[eval.length];
try {
      for (; i < eval.length; ++i) {
        output[i] = eval[i].evaluate(row);
      }
}
{noformat}

in the select operator processOp.

However this affects all the other operations as well possibly leading to 
memory churn. All other approaches I could think of seem cumbersome.

1. Copy the object using the copyToStandardObject method in 
ObjectInspectorUtils modifies the object itself and requires re-initializing 
the joinKeys(ExprNodeEvaluators) with the new object inspector. However, this 
doesn't work with just these changes because we cannot re-initialize an 
ExprNodeEvaluator with a StandardObjectInspector. It expects a 
StructObjectInspector which will have to re-worked if we go with this approach.

2. Try to create a new object of the same composition with a shallow copy. 
However this is not straight-forward either. It requires the struct object 
inspector to be re-worked to return an object in the same composition as the 
original.

3. Special case SMB with an if in the select operator to create a new output 
object. This would hurt vectorization though because it adds an if condition in 
the tight loop.

4. Create a new select operator for SMB join which extends the current select 
operator. This could be fixed to have the naive solution above without the 
memory penalty for the other operations. However, this requires some plan side 
changes.

I am not sure if I have missed any other way of solving this. [~navis] Could 
you please provide your comments.

Thanks
Vikram.


> SMB joins produce incorrect results with multiple partitions and buckets
> ------------------------------------------------------------------------
>
>                 Key: HIVE-5973
>                 URL: https://issues.apache.org/jira/browse/HIVE-5973
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.13.0
>            Reporter: Vikram Dixit K
>            Assignee: Vikram Dixit K
>             Fix For: 0.13.0
>
>
> It looks like there is an issue with re-using the output object array in the 
> select operator. When we read rows of the non-big tables, we hold on to the 
> output object in the priority queue. This causes hive to produce incorrect 
> results because all the elements in the priority queue refer to the same 
> object and the join happens on only one of the buckets.
> {noformat}
> output[i] = eval[i].evaluate(row);
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HIVE-5973) SMB joins produce incorrect results with multiple partitions and buckets

Reply via email to