Re: Efficient Equality Joins of Large Tables

Furcy Pin Tue, 10 Jun 2014 01:18:07 -0700

Hi Mark,

I still don't understand if you are trying to perform your join for one
specific known reference_id or for all of them?
In the first case, you should apply a pre-filter on each table first with
subqueries, and this would leave you with only a few rows for the join.


In the second case, you could try applying a GROUP BY on each table first
(again with subqueries) which would leave you with a few rows for the join
too.
This second method might work or not depending on what you are trying to
compute.
For complex GROUP BY operations, cleverly combining the UDFs collect_set
with split might help too...

If it does't work for you, then I would need more details on what you are
willing to compute precisely...

Regards,

Furcy







2014-06-09 20:30 GMT+02:00 Mark Desnoyer <[email protected]>:

> Hi Furcy,
>
> Thanks for the reply. I looked at MapJoin but it won't do what I need
> because all the tables will be large and actually, explicitly going through
> the entire table in an n^2 fashion is very inefficient.
>
> I have large tables, but the intersection is very small. In the Ad Click
> case, I would have two streams of data: one for loads and one for clicks.
> When a page is rendered, a unique reference id is generated, which gets
> tagged to the load and any click that might happen. So, if I had two
> tables, one for the loads and one for the clicks, the reference id in each
> table could be used to link the click that happened to its associated load.
> In other words, for a given reference id, there is probably only one load
> and a small number of clicks. So, my join would look some like:
>
> SELECT <some cols> FROM loads LEFT JOIN clicks ON loads.reference_id =
> clicks.reference_id;
>
> In the MapReduce framework, I could use that reference id as the map
> output key in order to collect those events that are associated with each
> other and then the reducer would generate the rows by doing the join with
> the small number of rows that it received. Technically, you could take this
> approach for any equality join because you're using the partitioning step
> in order to enforce the join predicate instead of actually evaluating it.
> This could be much more efficient in some cases, so I was wondering if Hive
> can do it and if so, how can I trigger it?
>
> Cheers,
> Mark
>
>
> On Mon, Jun 9, 2014 at 10:34 AM, Furcy Pin <[email protected]> wrote:
>
>> Hi Mark,
>>
>> I'm not sure if I understand what your trying to do correctly, do you
>> know the reference id on which you want to do the join beforehand?
>> Or is one of your tables small?
>> Or are they all big with a small intersection?
>>
>> I you haven't yet, I would suggest you to have a look at MapJoin:
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
>> Mapjoins allow hive to perform a join on the map side by putting the
>> small table in cache.
>> (depending on your hive version, this might or might not be enabled by
>> default)
>>
>>  Depending on your use case, you might have to pre-filter some tables,
>> for that you can use sub-queries like:
>>
>> FROM
>> (
>>    SELECT id FROM T WHERE id = myID
>> ) T1
>> JOIN T2 ON T1.id=T2.id
>> SELECT *
>>
>> Unlike a standard sql that would simplifies the query, I believe this
>> will force Hive to perform the sub-select with a first mapreduce, and apply
>> the join in the second
>> (as a mapjoin if the result of the subquery is small enough).
>>
>> Hope this helps,
>>
>> Furcy
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2014-06-09 18:48 GMT+02:00 Mark Desnoyer <[email protected]>:
>>
>> Hi,
>>>
>>> I was wondering if there was a way in Hive to trigger it to perform an
>>> efficient equality join on large tables? Specifically, I have two or more
>>> tables where the joined key is relatively rare in each table. A good
>>> example would be an AdClick scenario where you would have two tables, one
>>> for ad loads and one for ad clicks, where there is a reference id to
>>> connect a click to a specific ad load and that's the key to join on.
>>>
>>> In the MapReduce framework, this join could be done efficiently by using
>>> the reference id as the key from the map output and then the join would be
>>> done in the reducer since there are a small number of rows associated with
>>> each reference id. However, hive implements join by buffering all but the
>>> last tables in the reducer, and then streams the last table through. This
>>> works great if the first tables are relatively small, but if the tables are
>>> large and the associated rows are rare, it's very inefficient.
>>>
>>> So, can hive perform the join the first way I described? If so, any idea
>>> how I trigger it? Or do I have to write my own MR job?
>>>
>>> Cheers,
>>> Mark
>>>
>>
>>
>

Re: Efficient Equality Joins of Large Tables

Reply via email to