Hello, I've encountered a weird issue with hive and I'm not sure if I'm doing something wrong or if it is a bug. I'm trying to do a multi-group-by select statement on a partitioned table. I wan't only data from one partition, therefore all the WHERE statements are exactly the same and contain only the partition columns. Now, I would expect that the optimization kicks in and hadoop will process only files from the given partition, but in fact, it reads all the data in the table. Which makes it practicaly unusable for big tables...
Simple testcase: -- create a simple partitioned table DROP TABLE IF EXISTS partition_test; CREATE TABLE partition_test (col1 array<bigint>) PARTITIONED BY (part_col bigint); -- add some partitions filled with some random data INSERT OVERWRITE TABLE partition_test PARTITION (part_col=1) SELECT array(1,count(*)) FROM partition_test LIMIT 1; INSERT OVERWRITE TABLE partition_test PARTITION (part_col=2) SELECT array(2,count(*)) FROM partition_test LIMIT 2; -- see what happens when you try to perform multi-group-by query on one of the partitions EXPLAIN EXTENDED FROM partition_test LATERAL VIEW explode(col1) tmp AS exp_col1 INSERT OVERWRITE DIRECTORY '/test/1' SELECT exp_col1 WHERE (part_col=2) INSERT OVERWRITE DIRECTORY '/test/2' SELECT exp_col1 WHERE (part_col=2); -- result: it wants to scan all partitions :-( I also tried to trick Hive by using a view (CREATE VIEW part_view AS SELECT * FROM partition_test WHERE(part_col=2);) and than running the select on that, but the behavior is still the same... The version of Hive I tested this on is 0.7.1 (Cloudera distribution) if that matters. I would be very grateful if you could point me to some other way how to get the data without reading entire table... And in case this is a bug, where should I report it? Best regards, J. Dolinar