> On May 19, 2016, 12:06 a.m., Ashutosh Chauhan wrote: > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java, > > line 556 > > <https://reviews.apache.org/r/47554/diff/1/?file=1387390#file1387390line556> > > > > Can you also add comment why this is valid only for RS added by SPDO > > and not in general? > > Jesús Camacho Rodríguez wrote: > I was thinking further about this, and indeed I made many assumptions > thinking that SPDO will not add the sort columns at the end of the additional > RS keys, so then we need to make sure the ordering is correct... but this is > not true, as it will add them. > > I think this optimization can be done general, as ordering before > ordering again is a noop. I have changed the code accordingly, I would like > your opinion. Am I missing something? Let's see what QA returns... > > Ashutosh Chauhan wrote: > yeah.. thats what I think.. this is not specific to SPDO at logical > layer. > Thing I am not 100% sure about is in FileSink the way SPDO is implemented > I *think* it assumes all rows for a particular key come in sorted and in one > batch. Now, when we merge and sort by two keys and if there is a case that FS > was expecting all rows sorted on second key only, now will get them sorted on > 2 keys and it may have to close ORC writer and then reopen again, as oppose > to write it once and never opening it again. > > Jesús Camacho Rodríguez wrote: > I am trying to understand this part... The way the optimization works, we > will keep the second RS with its keys, we just remove the first RS... Thus, > this should not be a problem? Or maybe I am not understanding it correctly > > Ashutosh Chauhan wrote: > I think you are correct. Since relevant FS is active in last reducer > containing RS which we is not changing, we are not changing any assumptions > for FS. > > Ashutosh Chauhan wrote: > On a second thought I think there is more to it. e.g. we are transforming > RS (a) followed by RS (a,b) to RS(a,b) Than this is not valid transformation > in general, only if second RS is introduced by SPDO. e.g, if these two RS are > part of two GBYs (or a join & GBy), than its invalid because partitioning on > (a,b) != partitioning on (a)
But aggressiveDedup is really restrictive: only pRS-SEL*-cRS, thus we would bail out in those specific cases. - Jesús ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/47554/#review133844 ----------------------------------------------------------- On May 19, 2016, 10:49 a.m., Jesús Camacho Rodríguez wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/47554/ > ----------------------------------------------------------- > > (Updated May 19, 2016, 10:49 a.m.) > > > Review request for hive and Ashutosh Chauhan. > > > Bugs: HIVE-13750 > https://issues.apache.org/jira/browse/HIVE-13750 > > > Repository: hive-git > > > Description > ------- > > HIVE-13750 > > > Diffs > ----- > > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/SortedDynPartitionOptimizer.java > 010c89ed978296709b052cc7bc80256a27658e2b > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java > 733620b84657a21829248afe72ab16ad9692f37e > ql/src/test/results/clientpositive/dynpart_sort_opt_vectorization.q.out > d03bfe422743d9a5a6b85f9a6198e1e27024f129 > ql/src/test/results/clientpositive/dynpart_sort_optimization.q.out > dec872ab0eef54bd92d5c2bc068e2805cc14e272 > ql/src/test/results/clientpositive/dynpart_sort_optimization_acid.q.out > 832580325873dee741ba86239ee571873994a808 > ql/src/test/results/clientpositive/reducesink_dedup.q.out > b89df52f965385b85894757896eee487b29c52ae > ql/src/test/results/clientpositive/tez/dynpart_sort_opt_vectorization.q.out > a90e3f63b4646cf0ade9785a501ebd1a6b2a3406 > ql/src/test/results/clientpositive/tez/dynpart_sort_optimization.q.out > 723e8192f2735059005fc3c5c96732a2c4be49c1 > > Diff: https://reviews.apache.org/r/47554/diff/ > > > Testing > ------- > > > Thanks, > > Jesús Camacho Rodríguez > >