Hive performance issue

Deepak Khandelwal Thu, 16 Feb 2017 20:11:34 -0800

I have to move data from table A -> table B. 3 different queries running
parallel. Below are the details.


Table A has 90G or 7.2bn data with below columns..
C1, C2, C3, C4, C5, custID, amt
Table A is partitioned on C1 and 256 bucketing on C2.
C1 can have max 10 different values and C2 cardinality is going to be
very high..(it is like an address that will be different most of the time)

I need to group by data based on below column combinations and calculate
distinct custID count for each combination and total amount.. Group by
combinations are
C1, C2
C1, C2, C3, C4
C1, C2, C3, C4, C5


I have made 3 queries (one for each group by combination) to move data from
table a to table b. All queries run parallel and takes about 3.5 hours in
total. I have disabled concurrency on table b and enabled vectorization..

I need to finish this job of moving data in less than max 1 hour or if
possible in 40 mins.. should I use union all and run one query, will that
help improve the job completion time? Or any other good suggestions here
will be much appreciated..

I am open to design suggestions as well to make it as fast as possible.
Thanks a lot

Hive performance issue

Reply via email to