You are correct on the what I am hoping to do, basically emit two records
for every row.  What was interesting was when I just did the union in the
from, it didn't see to do a double table scan. I ended up doing:

INSERT OVERWRITE TABLE table_summary
select col1, unioned_col, count(distinct col4) from
( select col1, col2 as unioned_col, col4 from table
UNION ALL
select col1, col3 as unioned_col, col4 from table
) a
group by col1, unioned_col

When I ran that, it did not run two sets of maps (as I expected it would,
one for each part of the union) is there something that Hive is doing under
the covers to optimize that for me?

I didn't think to use the transform, but you are right, that would be a
very simple transform in any language.

Thoughts on my resultant query? perhaps I am just not getting all the data
I should be?

On Mon, Feb 11, 2013 at 11:08 PM, Mark Grover
<grover.markgro...@gmail.com>wrote:

> John,
> Please correct me if I didn't understand the problem correctly.
>
> I think in this scenario, it's best to think about the query in terms of
> MapReduce. In this case, you would want for each record sent as input to
> your mapper, two records to be emitted, one with col2's value and one with
> col3's value. Then, if you did whatever count/distinct magic you wanted to
> do in the reduce phase, you would have read the table only once.
>
> With the above in mind, if I were doing this, I would consider using
> Hive's transform functionality to use a custom mapper.
>
> Also, FWIW, this part seems unnecessary:
>  from (
>     select
>        col1, col2, col3, col4
>     from table
> ) a
>
> I think you might just be able to do (this is from top of my head, no
> guarantees):
>
> from table
> insert overwrite...
> ....
>
> And, if I misunderstood your problem, my apologies. If you could provide
> an example with sample data and expected output, that might be helpful.
>
>
> Mark
>
> On Mon, Feb 11, 2013 at 7:34 PM, John Omernik <j...@omernik.com> wrote:
>
>> I am trying to do a union, group by, and multi insert all at once. I know
>> this convoluted but I what I am trying to do is avoid having to scan
>> through the original table more than once... if I can get all my data from
>> two columns that I want to pull together, in one round of mappers, I win...
>>  Basically, col2 and col3 are they type of data, one is src, one is dst, I
>> want a single record for every uniq value of both col2 and col3.
>>
>> Any thoughts?
>>
>>
>>
>> from (
>>     select
>>        col1, col2, col3, col4
>>     from table
>> ) a
>> INSERT overwrite TABLE table_summary
>> select col1, unioned_col, count(1), count(distinct col4) from
>> (select col1, col2 as unioned_col, col4
>> UNION ALL
>> select col1, col3 as unioned_col, col4
>> ) b
>> group by col1, unioned_col
>>
>
>

Reply via email to