> On Nov. 7, 2014, 9:51 p.m., Suhas Satish wrote: > > ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java, line > > 314 > > <https://reviews.apache.org/r/27745/diff/1/?file=754765#file754765line314> > > > > What if there are 2 partitions for big table? I guess they will then > > be processed on 2 separate spark nodes, right? > > > > So in this case, there are 2 replicas created for this HashTableSink. > > How do we control that these 2 replicas will be on the same data nodes as > > the ones where the 2 big table partitions will be processing map-joins ?
We can't, if we don't know where the big table partitions are. If there are just two partitions, if we copy the small table to more nodes, it may take more time, than fetch the data over network? - Jimmy ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27745/#review60388 ----------------------------------------------------------- On Nov. 7, 2014, 9:34 p.m., Jimmy Xiang wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/27745/ > ----------------------------------------------------------- > > (Updated Nov. 7, 2014, 9:34 p.m.) > > > Review request for hive and Xuefu Zhang. > > > Bugs: HIVE-8621 > https://issues.apache.org/jira/browse/HIVE-8621 > > > Repository: hive-git > > > Description > ------- > > In case spark, HashTableSinkOperator should dump files to a folder expected > by HashTableLoader. > > > Diffs > ----- > > ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java > f0e04e7 > > Diff: https://reviews.apache.org/r/27745/diff/ > > > Testing > ------- > > > Thanks, > > Jimmy Xiang > >