>It¹s really a very simple query that I¹m trying to run:
>select
...
>bloom_contains(a_id, b_id_bloom)
That's nearly impossible to optimize directly - there is no way to limit
the number of table_b rows which may match table a.
More than one bloom filter can successfully match a single row f
Hi Gopal,
Thanks for the detailed response.
It’s really a very simple query that I’m trying to run:
select
a.a_id,
b.b_id,
count(*) as c
from
table_a a,
table_b b
where
bloom_contains(a_id, b_id_bloom)
group by
a.a_id,
b.b_id;
Where “bloom_contains” is a custom U
>I¹m having trouble doing a cross join between two tables that are too big
>for a map-side join.
The actual query would help btw. Usually what is planned as a cross-join
can be optimized out into a binning query with a custom UDF.
In particular with 2-D geo queries with binning, which people ten
Hi Gopal,
Thanks for the speedy response! A follow-up question though: 10Mb input sounds
like that would work for a map join. I’m having trouble doing a cross join
between two tables that are too big for a map-side join. Trying to break down
one table into small enough partitions and then union
> Over the last few week I¹ve been trying to use cross joins/cartesian
>products and was wondering why, exactly, this all gets sent to one
>reducer. All I¹ve heard or read is that Hive can¹t/doesn¹t parallelize
>the job.
The hashcode of the shuffle key is 0, since you need to process every row
a
Hi all,
Over the last few week I’ve been trying to use cross joins/cartesian products
and was wondering why, exactly, this all gets sent to one reducer. All I’ve
heard or read is that Hive can’t/doesn’t parallelize the job. Is there some
code people can point me to? Does anyone have a workaroun