Re: Cross join/cartesian product explanation

2015-11-17 Thread Gopal Vijayaraghavan
>It¹s really a very simple query that I¹m trying to run: >select ... >bloom_contains(a_id, b_id_bloom) That's nearly impossible to optimize directly - there is no way to limit the number of table_b rows which may match table a. More than one bloom filter can successfully match a single row f

Re: Cross join/cartesian product explanation

2015-11-13 Thread Rory Sawyer
Hi Gopal, Thanks for the detailed response. It’s really a very simple query that I’m trying to run: select a.a_id, b.b_id, count(*) as c from table_a a, table_b b where bloom_contains(a_id, b_id_bloom) group by a.a_id, b.b_id; Where “bloom_contains” is a custom U

Re: Cross join/cartesian product explanation

2015-11-10 Thread Gopal Vijayaraghavan
>I¹m having trouble doing a cross join between two tables that are too big >for a map-side join. The actual query would help btw. Usually what is planned as a cross-join can be optimized out into a binning query with a custom UDF. In particular with 2-D geo queries with binning, which people ten

Re: Cross join/cartesian product explanation

2015-11-09 Thread Rory Sawyer
Hi Gopal, Thanks for the speedy response! A follow-up question though: 10Mb input sounds like that would work for a map join. I’m having trouble doing a cross join between two tables that are too big for a map-side join. Trying to break down one table into small enough partitions and then union

Re: Cross join/cartesian product explanation

2015-11-06 Thread Gopal Vijayaraghavan
> Over the last few week I¹ve been trying to use cross joins/cartesian >products and was wondering why, exactly, this all gets sent to one >reducer. All I¹ve heard or read is that Hive can¹t/doesn¹t parallelize >the job. The hashcode of the shuffle key is 0, since you need to process every row a

Cross join/cartesian product explanation

2015-11-06 Thread Rory Sawyer
Hi all, Over the last few week I’ve been trying to use cross joins/cartesian products and was wondering why, exactly, this all gets sent to one reducer. All I’ve heard or read is that Hive can’t/doesn’t parallelize the job. Is there some code people can point me to? Does anyone have a workaroun