Hello,

I've encountered a weird issue with hive and I'm not sure if I'm doing
something wrong or if it is a bug. I'm trying to do a multi-group-by select
statement on a partitioned table. I wan't only data from one partition,
therefore all the WHERE statements are exactly the same and contain only
the partition columns. Now, I would expect that the optimization kicks in
and hadoop will process only files from the given partition, but in fact,
it reads all the data in the table. Which makes it practicaly unusable for
big tables...

Simple testcase:

-- create a simple partitioned table
DROP TABLE IF EXISTS partition_test;
CREATE TABLE partition_test (col1 array<bigint>)
  PARTITIONED BY (part_col bigint);

-- add some partitions filled with some random data
INSERT OVERWRITE TABLE partition_test PARTITION (part_col=1) SELECT
array(1,count(*)) FROM partition_test LIMIT 1;
INSERT OVERWRITE TABLE partition_test PARTITION (part_col=2) SELECT
array(2,count(*)) FROM partition_test LIMIT 2;

-- see what happens when you try to perform multi-group-by query on one of
the partitions
EXPLAIN EXTENDED
FROM partition_test
LATERAL VIEW explode(col1) tmp AS exp_col1
INSERT OVERWRITE DIRECTORY '/test/1'
    SELECT exp_col1
    WHERE (part_col=2)
INSERT OVERWRITE DIRECTORY '/test/2'
    SELECT exp_col1
    WHERE (part_col=2);
-- result: it wants to scan all partitions :-(


I also tried to trick Hive by using a view (CREATE VIEW part_view AS SELECT
* FROM partition_test WHERE(part_col=2);) and than running the select on
that, but the behavior is still the same...

The version of Hive I tested this on is 0.7.1 (Cloudera distribution) if
that matters. I would be very grateful if you could point me to some other
way how to get the data without reading entire table...  And in case this
is a bug, where should I report it?

Best regards,
J. Dolinar

Reply via email to