[ https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13587776#comment-13587776 ]
Tamas Tarjanyi commented on HIVE-4014: -------------------------------------- I could not check the code but experience is different. See my results. set dfs.replication=1; set mapred.submit.replication=1; set mapred.submit.replication=1; set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true; set hive.exec.max.dynamic.partitions.pernode=1000; -- # Below command will create an artificial csv input file ~ 1.5Gb with 10m rows -- # i=0 ; while [ $i -lt 10000000 ] ; do echo "$i,$((i%5)),12345678901234567890123456789012345678901234567890,abcdefghjkabcdefghjkabcdefghjkabcdefghjkabcdefghjk,00000000000000000000000000000000000000000000000000,,," ; ((i=i+1)) ; done >/mnt/hadoop/RCTEST.csv -- Load this data into a base table DROP TABLE RCTEST_CSV; CREATE TABLE RCTEST_CSV ( id BIGINT, counter STRING, value2 STRING, value3 STRING, value4 STRING, value5 STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '\n'; LOAD DATA LOCAL INPATH "/mnt/hadoop/RCTEST.csv" INTO TABLE RCTEST_CSV; -- Multiply it into a "BAD" table. -- This will require ~30Gb altogether and will have 200m rows. DROP TABLE RCTEST_BAD; CREATE TABLE RCTEST_BAD STORED AS RCFILE AS SELECT * FROM RCTEST_CSV ; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; -- Create a table to simulate the expected column pruning. -- Below will require ~1.5Gb and will also have 200m rows. DROP TABLE RCTEST_GOOD; CREATE TABLE RCTEST_GOOD STORED AS RCFILE AS SELECT id,counter FROM RCTEST_BAD ; -- Lets start measuring performance. -- For this test my cluster had 1 node with 2 disks in a stripe. (Rest of the nodes were decomissioned to eliminate parallel reads accross nodes.) -- Read speed is roughly 110Mb/sec/disk. In global this means 220Mb/sec max. -- -- Please execute the below OS command to eliminate OS caching everytime before you execute any of the below SQL command!!! -- -- sync && echo 3 >/proc/sys/vm/drop_caches -- -- 1.) Should be slow. This is the reference value for full read. SELECT count(*) FROM RCTEST_BAD; -- RESULT: 158 sec -- -- 2.) Should be faster than 1 because of column pruning. SELECT count(id) FROM RCTEST_BAD; -- RESULT: 156 sec -- Actually I belive this is a proof that column pruning is not working. But lets go further. -- -- 3.) Roughly should be like 2 but faster than 1. SELECT COUNT(counter) FROM RCTEST_BAD; -- RESULT: 159 sec -- -- -- -- Lets see how it works with the simulated pruning. -- -- 4.) Should be faster than 1. In theory roughly should be same as 2 with proper pruning. SELECT count(*) from RCTEST_GOOD; -- RESULT: 29 sec -- -- 5.) Should be same as 4 since pruning does not work here either. SELECT count(id) FROM RCTEST_GOOD; -- RESULT: 31 sec -- -- 6.) Should be same as 4 since pruning does not work here either SELECT count(counter) FROM RCTEST_GOOD; -- RESULT: 31 sec -- -- -- SHORT TEST WITH A GOOD VERSION -- -- Now lets see 1. and 2. on a well working apache 1.0.4 release. -- The machine is my desktop notebook. One harddisk only. -- Amount of data is ONLY 10m rows now. -- 2.1.) Should be slow. This is the reference value for full read. SELECT count(*) FROM RCTEST_BAD; -- RESULT: 80 sec -- -- 2.2.) Should be faster than 2.1 because of column pruning. SELECT count(id) FROM RCTEST_BAD; -- RESULT: 47 sec -- YEEEES!!!! That is what I assumed. More data would mean more performance gain especially on better hardwares. -- -- 2.5.) Should be faster than 2.1. In theory roughly should be same as 2.2 with proper pruning. SELECT count(id) from RCTEST_GOOD; -- RESULT: 44 sec -- HERE WE ARE! AS EXPECTED. > Hive+RCFile is not doing column pruning and reading much more data than > necessary > --------------------------------------------------------------------------------- > > Key: HIVE-4014 > URL: https://issues.apache.org/jira/browse/HIVE-4014 > Project: Hive > Issue Type: Bug > Reporter: Vinod Kumar Vavilapalli > Assignee: Vinod Kumar Vavilapalli > > With even simple projection queries, I see that HDFS bytes read counter > doesn't show any reduction in the amount of data read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira