Hi all,

I'm working on HIVE-5595 to add vectorization support for SMB join operators. 
The problem I'm facing is that the vectorized record readers (eg. 
VectorizedOrcRecordReader) have a dependency on the MapWork.pathToPartitionInfo 
(see VectorizedRowBatchCtx.init).

What I discovered though is that for SMB join plans, this map (along with the 
related pathToAliases map) is incomplete. During the population, which occurs 
in GenMapRedUtils.setTaskPlan, the aliasToPartnInfo gets always populated:

plan.getAliasToPartnInfo().put(alias_id, aliasPartnDesc);

but the pathToAliases and pathToPartitionInfo maps are skipped for local case:

    if (!local) {
      while (iterPath.hasNext()) {
...
        plan.getPathToAliases().get(path).add(alias_id);
        plan.getPathToPartitionInfo().put(path, prtDesc);
...

And local in this case, for the 'small' alias, is true, being set up on the 
call stack by  MapJoinFactory$TableScanMapJoinProcessor.process:

      boolean local = pos != mapJoin.getConf().getPosBigTable();
      if (oldTask == null) {
        assert currPlan.getReduceWork() == null;
        initMapJoinPlan(mapJoin, currTask, ctx, local);


My question is towards SMB/MapJoin experts for clarification on this anomaly. 
SMB join is not local, but is treated as local. The resulted plan info has 
these anomalies, aforementioned maps are incomplete. Is the local-=true 
intentional in the SMB case, or is just leftover from the original MapJoin 
implementation? Should SMB join set it to false, or will the sky collapse? I 
can think of several 'workarounds', but there is too much context here that I 
don't have a strong grok on.

Relevant stack:

GenMapRedUtils.setTaskPlan(String, Operator<OperatorDesc>, Task<?>, boolean, 
GenMRProcContext, PrunedPartitionList) line: 658
GenMapRedUtils.setTaskPlan(String, Operator<OperatorDesc>, Task<?>, boolean, 
GenMRProcContext) line: 400
MapJoinFactory$TableScanMapJoinProcessor.initMapJoinPlan(AbstractMapJoinOperator<MapJoinDesc>,
 Task<Serializable>, GenMRProcContext, boolean) line: 157
MapJoinFactory$TableScanMapJoinProcessor.process(Node, Stack<Node>, 
NodeProcessorCtx, Object...) line: 219
DefaultRuleDispatcher.dispatch(Node, Stack<Node>, Object...) line: 90
GenMapRedWalker(DefaultGraphWalker).dispatchAndReturn(Node, Stack<Node>) line: 
94
GenMapRedWalker.walk(Node) line: 54
GenMapRedWalker.walk(Node) line: 65
GenMapRedWalker.walk(Node) line: 65
GenMapRedWalker(DefaultGraphWalker).startWalking(Collection<Node>, 
HashMap<Node,Object>) line: 109
MapReduceCompiler.compile(ParseContext, List<Task<Serializable>>, 
HashSet<ReadEntity>, HashSet<WriteEntity>) line: 267
SemanticAnalyzer.analyzeInternal(ASTNode) line: 8927


Thanks,
~Remus


Reply via email to