Collect_set() is built into hive. If you want a version that does not
de-duplicate look here.
https://github.com/edwardcapriolo/hive-collect
Caution both of these functions can produce out of memory if the
results are later then a mapper can store in memory.
On Thu, Nov 1, 2012 at 2:27 PM, Ratner
Sorry to ask what is probably a very naïve Hive question but here goes:
I have a table as follows:
Col1 Col2
K1 V1
K1 V1
K2 V1
K3 V1
K1 V2
K1 V3
K2 V2
Now I have managed to SELECT Col1,COUNT(DISTINCT Col2) FROM ... BY COL1; to
obtain
K1 3
K2