I have a join query where i am joining huge tables and i am trying to optimize this hive query.
INSERT OVERWRITE TABLE result SELECT /*+ STREAMTABLE(product) */ i.IMAGE_ID, p.PRODUCT_NO, p.STORE_NO, p.PRODUCT_CAT_NO, p.CAPTION, p.PRODUCT_DESC, p.IMAGE1_ID, p.IMAGE2_ID, s.STORE_ID, s.STORE_NAME, p.CREATE_DATE, CASE WHEN custImg.IMAGE_ID is NULL THEN 0 ELSE 1 END, CASE WHEN custImg1.IMAGE_ID is NULL THEN 0 ELSE 1 END, CASE WHEN custImg2.IMAGE_ID is NULL THEN 0 ELSE 1 END FROM image i JOIN PRODUCT p ON i.IMAGE_ID = p.IMAGE1_ID JOIN PRODUCT_CAT pcat ON p.PRODUCT_CAT_NO = pcat.PRODUCT_CAT_NO JOIN STORE s ON p.STORE_NO = s.STORE_NO JOIN STOCK_INFO si ON si.STOCK_INFO_ID = pcat.STOCK_INFO_ID LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg ON i.IMAGE_ID = custImg.IMAGE_ID LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg1 ON p.IMAGE1_ID = custImg1.IMAGE_ID LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg2 ON p.IMAGE2_ID = custImg2.IMAGE_ID; Here are some facts about the tables image table has 60 million rows product table has 1 billion rows product_cat has 1000 rows store has 1m rows stock_info has 100 rows customizable_image has 200k rows a product can have one or two images (image1 and image2) and product level information are stored only in product table. i tried moving the join with product to the bottom but i couldnt as all other following joins require data from the product table. Here is what i tried so far: 1. I gave the hint to hive to stream product table as its the biggest one 2. I bucketed the table (during create table of image and product) into 256 buckets (on image_id) and then did the join - didnt give me any significant performance gain 3. changed the input format to sequence file from textfile(gzip files) , so that it can be splittable and hence more mappers can be run if hive want to run more mappers The query is still taking longer than 5 hours in Hive (running in aws with 3 large nodes) where as in RDBMS it takes only 5 hrs. I need some help in optimizing this query, so that it executes much faster. what else can i try, does partitioning the table help in improving join performance ? This brings me to the question, "is Hive even the right choice (compared to rdbms) for such complex joins" ? Thanks Srini