You are correct on the what I am hoping to do, basically emit two records for every row. What was interesting was when I just did the union in the from, it didn't see to do a double table scan. I ended up doing:
INSERT OVERWRITE TABLE table_summary select col1, unioned_col, count(distinct col4) from ( select col1, col2 as unioned_col, col4 from table UNION ALL select col1, col3 as unioned_col, col4 from table ) a group by col1, unioned_col When I ran that, it did not run two sets of maps (as I expected it would, one for each part of the union) is there something that Hive is doing under the covers to optimize that for me? I didn't think to use the transform, but you are right, that would be a very simple transform in any language. Thoughts on my resultant query? perhaps I am just not getting all the data I should be? On Mon, Feb 11, 2013 at 11:08 PM, Mark Grover <grover.markgro...@gmail.com>wrote: > John, > Please correct me if I didn't understand the problem correctly. > > I think in this scenario, it's best to think about the query in terms of > MapReduce. In this case, you would want for each record sent as input to > your mapper, two records to be emitted, one with col2's value and one with > col3's value. Then, if you did whatever count/distinct magic you wanted to > do in the reduce phase, you would have read the table only once. > > With the above in mind, if I were doing this, I would consider using > Hive's transform functionality to use a custom mapper. > > Also, FWIW, this part seems unnecessary: > from ( > select > col1, col2, col3, col4 > from table > ) a > > I think you might just be able to do (this is from top of my head, no > guarantees): > > from table > insert overwrite... > .... > > And, if I misunderstood your problem, my apologies. If you could provide > an example with sample data and expected output, that might be helpful. > > > Mark > > On Mon, Feb 11, 2013 at 7:34 PM, John Omernik <j...@omernik.com> wrote: > >> I am trying to do a union, group by, and multi insert all at once. I know >> this convoluted but I what I am trying to do is avoid having to scan >> through the original table more than once... if I can get all my data from >> two columns that I want to pull together, in one round of mappers, I win... >> Basically, col2 and col3 are they type of data, one is src, one is dst, I >> want a single record for every uniq value of both col2 and col3. >> >> Any thoughts? >> >> >> >> from ( >> select >> col1, col2, col3, col4 >> from table >> ) a >> INSERT overwrite TABLE table_summary >> select col1, unioned_col, count(1), count(distinct col4) from >> (select col1, col2 as unioned_col, col4 >> UNION ALL >> select col1, col3 as unioned_col, col4 >> ) b >> group by col1, unioned_col >> > >