optimize hive query for multitable join where one table is huge

Srinivasan Ramaswamy Wed, 26 Mar 2014 15:55:31 -0700

I have a join query where i am joining huge tables and i am trying to
optimize this hive query.


    INSERT OVERWRITE TABLE result
    SELECT /*+ STREAMTABLE(product) */
    i.IMAGE_ID,
    p.PRODUCT_NO,
    p.STORE_NO,
    p.PRODUCT_CAT_NO,
    p.CAPTION,
    p.PRODUCT_DESC,
    p.IMAGE1_ID,
    p.IMAGE2_ID,
    s.STORE_ID,
    s.STORE_NAME,
    p.CREATE_DATE,
    CASE WHEN custImg.IMAGE_ID is NULL THEN 0 ELSE 1 END,
    CASE WHEN custImg1.IMAGE_ID is NULL THEN 0 ELSE 1 END,
    CASE WHEN custImg2.IMAGE_ID is NULL THEN 0 ELSE 1 END
    FROM image i
    JOIN PRODUCT p ON i.IMAGE_ID = p.IMAGE1_ID
    JOIN PRODUCT_CAT pcat ON p.PRODUCT_CAT_NO = pcat.PRODUCT_CAT_NO
    JOIN STORE s ON p.STORE_NO = s.STORE_NO
    JOIN STOCK_INFO si ON si.STOCK_INFO_ID = pcat.STOCK_INFO_ID
    LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg ON i.IMAGE_ID =
custImg.IMAGE_ID
    LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg1 ON p.IMAGE1_ID =
custImg1.IMAGE_ID
    LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg2 ON p.IMAGE2_ID =
custImg2.IMAGE_ID;

Here are some facts about the tables
image table has 60 million rows
product table has 1 billion rows
product_cat has 1000 rows
store has 1m rows
stock_info has 100 rows
customizable_image has 200k rows

a product can have one or two images (image1 and image2) and product level
information are stored only in product table. i tried moving the join with
product to the bottom but i couldnt as all other following joins require
data from the product table.

Here is what i tried so far:
1. I gave the hint to hive to stream product table as its the biggest one
2. I bucketed the table (during create table of image and product) into 256
buckets (on image_id) and then did the join - didnt give me any significant
performance gain
3. changed the input format to sequence file from textfile(gzip files) , so
that it can be splittable and hence more mappers can be run if hive want to
run more mappers

The query is still taking longer than 5 hours in Hive (running in aws with
3 large nodes) where as in RDBMS it takes only 5 hrs. I need some help in
optimizing this query, so that it executes much faster. what else can i
try, does partitioning the table help in improving join performance ?

This brings me to the question, "is Hive even the right choice (compared to
rdbms) for such complex joins" ?

Thanks
Srini

optimize hive query for multitable join where one table is huge

Reply via email to