great question. your parallelization seems to trump hadoop's. I guess i'd ask what are the _total_ number of Mappers and Reducers that run on your cluster for these two scenarios? I'd be curious if there are the same.
On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 <ygnhz...@gmail.com> wrote: > Hi all, > > Here is the scenario, suppose I have 2 tables A and B, I would like to > perform a simple join on them, > > We can do it like this: > > INSERT OVERWRITE TABLE C > SELECT .... FROM A JOIN B on A.id=B.id > > In order to speed up this query since table A and B have lots of data, > another approach is : > > Say I partition table A and B into 10 partitions respectively, and write > the query like this > > INSERT OVERWRITE TABLE C PARTITION(pid=1) > SELECT .... FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1 > > then I run this query 10 times concurrently (pid ranges from 1 to 10) > > And my question is that , in my observation of some more complex queries, > the second solution is about 15% faster than the first solution, > is it simply because the setting of reducer num is not optimal? > If the resource is not a limit and it is possible to set the proper > reducer nums in the first solution , can they achieve the same performance? > Is there any other fact that can cause performance difference between > them(non-partition VS partition+concurrent) besides the job parameter > issues? > > Thanks! >