[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary

Tamas Tarjanyi (JIRA) Tue, 26 Feb 2013 16:16:14 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13587776#comment-13587776
 ]


Tamas Tarjanyi commented on HIVE-4014:
--------------------------------------

I could not check the code but experience is different. See my results.

set dfs.replication=1;
set mapred.submit.replication=1;
set mapred.submit.replication=1;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions.pernode=1000;

-- # Below command will create an artificial csv input file ~ 1.5Gb with 10m 
rows
-- # i=0 ; while [ $i -lt 10000000 ] ; do echo 
"$i,$((i%5)),12345678901234567890123456789012345678901234567890,abcdefghjkabcdefghjkabcdefghjkabcdefghjkabcdefghjk,00000000000000000000000000000000000000000000000000,,,"
 ; ((i=i+1)) ; done >/mnt/hadoop/RCTEST.csv

-- Load this data into a base table
DROP TABLE RCTEST_CSV;
CREATE TABLE RCTEST_CSV (
  id       BIGINT,
  counter  STRING,
  value2   STRING,
  value3   STRING,
  value4   STRING,
  value5   STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INPATH "/mnt/hadoop/RCTEST.csv" INTO TABLE RCTEST_CSV;

-- Multiply it into a "BAD" table. 
-- This will require ~30Gb altogether and will have 200m rows.
DROP TABLE RCTEST_BAD;
CREATE TABLE RCTEST_BAD
STORED AS RCFILE
AS 
  SELECT * FROM RCTEST_CSV
;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;

-- Create a table to simulate the expected column pruning.
-- Below will require ~1.5Gb and will also have 200m rows.
DROP TABLE RCTEST_GOOD;
CREATE TABLE RCTEST_GOOD
STORED AS RCFILE
AS 
  SELECT id,counter FROM RCTEST_BAD
;

-- Lets start measuring performance.
-- For this test my cluster had 1 node with 2 disks in a stripe. (Rest of the 
nodes were decomissioned to eliminate parallel reads accross nodes.)
-- Read speed is roughly 110Mb/sec/disk. In global this means 220Mb/sec max.
--
-- Please execute the below OS command to eliminate OS caching everytime before 
you execute any of the below SQL command!!!
--
-- sync && echo 3 >/proc/sys/vm/drop_caches
--
-- 1.) Should be slow. This is the reference value for full read.
SELECT count(*) FROM RCTEST_BAD;
-- RESULT: 158 sec
--
-- 2.) Should be faster than 1 because of column pruning.
SELECT count(id) FROM RCTEST_BAD;
-- RESULT: 156 sec
-- Actually I belive this is a proof that column pruning is not working. But 
lets go further.
--
-- 3.) Roughly should be like 2 but faster than 1.
SELECT COUNT(counter) FROM RCTEST_BAD;
-- RESULT: 159 sec
--
--
--
-- Lets see how it works with the simulated pruning.
--
-- 4.) Should be faster than 1. In theory roughly should be same as 2 with 
proper pruning.
SELECT count(*) from RCTEST_GOOD;
-- RESULT: 29 sec
--
-- 5.) Should be same as 4 since pruning does not work here either.
SELECT count(id) FROM RCTEST_GOOD;
-- RESULT: 31 sec
--
-- 6.) Should be same as 4 since pruning does not work here either
SELECT count(counter) FROM RCTEST_GOOD;
-- RESULT: 31 sec
--
--
-- SHORT TEST WITH A GOOD VERSION
-- 
-- Now lets see 1. and 2. on a well working apache 1.0.4 release.
-- The machine is my desktop notebook. One harddisk only. 
-- Amount of data is ONLY 10m rows now.
-- 2.1.) Should be slow. This is the reference value for full read.
SELECT count(*) FROM RCTEST_BAD;
-- RESULT: 80 sec
--
-- 2.2.) Should be faster than 2.1 because of column pruning.
SELECT count(id) FROM RCTEST_BAD;
-- RESULT: 47 sec
-- YEEEES!!!! That is what I assumed. More data would mean more performance 
gain especially on better hardwares.
--
-- 2.5.) Should be faster than 2.1. In theory roughly should be same as 2.2 
with proper pruning.
SELECT count(id) from RCTEST_GOOD;
-- RESULT: 44 sec
-- HERE WE ARE! AS EXPECTED.


                
> Hive+RCFile is not doing column pruning and reading much more data than 
> necessary
> ---------------------------------------------------------------------------------
>
>                 Key: HIVE-4014
>                 URL: https://issues.apache.org/jira/browse/HIVE-4014
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>
> With even simple projection queries, I see that HDFS bytes read counter 
> doesn't show any reduction in the amount of data read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary

Reply via email to