terminal_type =0,  260,000,000 rows,  almost cover half of the whole 
data.terminal_type =25066, just 3800 rows.
orc 
tblproperties("orc.compress"="SNAPPY","orc.compress.size"="262141","orc.stripe.size"="268435456","orc.row.index.stride"="100000","orc.create.index"="true","orc.bloom.filter.columns"="");
The table "gprs" has sorted by terminal_type.  Before sort, I have another 
table named "gprs_orc", I use sparkSQL to sort the data as follows:(before do 
this, I set  hive.enforce.sorting=true)sql> INSERT INTO TABLE gprs SELECT * 
FROM gprs_orc sort by terminal_type ;Because the table gprs_orc has 800 files, 
so generate 800 Tasks, and create 800 files also in table gprs. But I am not 
sure whether each file be sorted separately or not.
I have tried  bloom filter ,but it makes no improvement。I know about tez, but 
never use, I will try it later.
The following is my test in hive 1.2.1: 1. enable hive.optimize.index.filter 
and hive.optimize.ppd:    select count(*) from gprs where terminal_type=25080;  
  will not scan data                   Time taken: 353.345 seconds    select 
count(*) from gprs where terminal_type=25066;    just scan a few row groups    
Time taken:  354.860 seconds    select count(*) from gprs where 
terminal_type=0;            scan half of the data              Time taken:  
378.312 seconds
2. disable hive.optimize.index.filter and hive.optimize.ppd:      select 
count(*) from gprs where terminal_type=25080;   scan all the data               
       Time taken: 389.700 seconds
    select count(*) from gprs where terminal_type=25066;   scan all the data    
                  Time taken:  386.600 seconds
    select count(*) from gprs where terminal_type=0;            scan all the 
data                     Time taken:  395.968 seconds

The following is my environment:
      3 nodes,    12 cpu cores per node,    48G memory free per node,   4 disks 
per node,  3 replications per block , hadoop 2.7.2,    hive 1.2.1




Joseph
 
From: Jörn Franke
Date: 2016-03-16 20:27
To: Joseph
CC: user; user
Subject: Re: The build-in indexes in ORC file does not work.
Not sure it should work. How many rows are affected? The data is sorted?
Have you tried with Tez? Tez has some summary statistics that tells you if you 
use push down. Maybe you need to use HiveContext.
Perhaps a bloom filter could make sense for you as well.

On 16 Mar 2016, at 12:45, Joseph <wxy81...@sina.com> wrote:

Hi,

I have only one table named "gprs",  it has 560,000,000 rows,  and 57 columns.  
The block size is 256M,  total ORC file number is 800, each of them is about 
51M.

my query statement is :
select count(*) from gprs  where  terminal_type = 25080;
select * from gprs  where  terminal_type = 25080;

In the gprs table, the "terminal_type"  column's  value is in [0, 25066]



Joseph
 
From: Jörn Franke
Date: 2016-03-16 19:26
To: Joseph
CC: user; user
Subject: Re: The build-in indexes in ORC file does not work.
How much data are you querying? What is the query? How selective it is supposed 
to be? What is the block size?

On 16 Mar 2016, at 11:23, Joseph <wxy81...@sina.com> wrote:

Hi all,

I have known that ORC provides three level of indexes within each file, file 
level, stripe level, and row level. 
The file and stripe level statistics are in the file footer so that they are 
easy to access to determine if the rest of the file needs to be read at all. 
Row level indexes include both column statistics for each row group and 
position for seeking to the start of the row group. 

The following is my understanding:
1. The file and stripe level indexes are forcibly generated, we can not control 
them.
2. The row level indexes can be configured by "orc.create.index"(whether to 
create row indexes) and "orc.row.index.stride"(number of rows between index 
entries).
3. Each Index has statistics of min, max for each column, so sort data by the 
filter column will bring better performance.
4. To use any one of the three level of indexes,we should enable predicate 
push-down by setting spark.sql.orc.filterPushdown=true (in sparkSQL) or 
hive.optimize.ppd=true (in hive).

But I found the  build-in indexes in ORC files did not work both in spark 1.5.2 
and hive 1.2.1:
First, when the query statement with where clause did't match any record (the 
filter column had a value beyond the range of data),  the performance when 
enabled  predicate push-down was almost the same with when disabled predicate 
push-down.  I think, when the filter column has a value beyond the range of 
data, all of the orc files will not be scanned if use file level indexes,  so 
the performance should improve obviously.

The second, when enabled "orc.create.index" and sorted data by filter column 
and where clause can only match a few records, the performance when enabled  
predicate push-down was almost the same with when disabled predicate push-down. 

The third, when enabled  predicate push-down and "orc.create.index", the 
performance when  filter column had a value beyond the range of data was almost 
the same with when filter column had a value covering almost the whole data. 

So,  has anyone used ORC's build-in indexes before (especially in spark SQL)?  
What's my issue?

Thanks!



Joseph

Reply via email to