Hi all,
I'm working on HIVE-5595 to add vectorization support for SMB join operators.
The problem I'm facing is that the vectorized record readers (eg.
VectorizedOrcRecordReader) have a dependency on the MapWork.pathToPartitionInfo
(see VectorizedRowBatchCtx.init).
What I discovered though is that for SMB join plans, this map (along with the
related pathToAliases map) is incomplete. During the population, which occurs
in GenMapRedUtils.setTaskPlan, the aliasToPartnInfo gets always populated:
plan.getAliasToPartnInfo().put(alias_id, aliasPartnDesc);
but the pathToAliases and pathToPartitionInfo maps are skipped for local case:
if (!local) {
while (iterPath.hasNext()) {
...
plan.getPathToAliases().get(path).add(alias_id);
plan.getPathToPartitionInfo().put(path, prtDesc);
...
And local in this case, for the 'small' alias, is true, being set up on the
call stack by MapJoinFactory$TableScanMapJoinProcessor.process:
boolean local = pos != mapJoin.getConf().getPosBigTable();
if (oldTask == null) {
assert currPlan.getReduceWork() == null;
initMapJoinPlan(mapJoin, currTask, ctx, local);
My question is towards SMB/MapJoin experts for clarification on this anomaly.
SMB join is not local, but is treated as local. The resulted plan info has
these anomalies, aforementioned maps are incomplete. Is the local-=true
intentional in the SMB case, or is just leftover from the original MapJoin
implementation? Should SMB join set it to false, or will the sky collapse? I
can think of several 'workarounds', but there is too much context here that I
don't have a strong grok on.
Relevant stack:
GenMapRedUtils.setTaskPlan(String, Operator<OperatorDesc>, Task<?>, boolean,
GenMRProcContext, PrunedPartitionList) line: 658
GenMapRedUtils.setTaskPlan(String, Operator<OperatorDesc>, Task<?>, boolean,
GenMRProcContext) line: 400
MapJoinFactory$TableScanMapJoinProcessor.initMapJoinPlan(AbstractMapJoinOperator<MapJoinDesc>,
Task<Serializable>, GenMRProcContext, boolean) line: 157
MapJoinFactory$TableScanMapJoinProcessor.process(Node, Stack<Node>,
NodeProcessorCtx, Object...) line: 219
DefaultRuleDispatcher.dispatch(Node, Stack<Node>, Object...) line: 90
GenMapRedWalker(DefaultGraphWalker).dispatchAndReturn(Node, Stack<Node>) line:
94
GenMapRedWalker.walk(Node) line: 54
GenMapRedWalker.walk(Node) line: 65
GenMapRedWalker.walk(Node) line: 65
GenMapRedWalker(DefaultGraphWalker).startWalking(Collection<Node>,
HashMap<Node,Object>) line: 109
MapReduceCompiler.compile(ParseContext, List<Task<Serializable>>,
HashSet<ReadEntity>, HashSet<WriteEntity>) line: 267
SemanticAnalyzer.analyzeInternal(ASTNode) line: 8927
Thanks,
~Remus