Re: The build-in indexes in ORC file does not work.

Jörn Franke Sat, 19 Mar 2016 17:50:22 -0700

Not sure it should work. How many rows are affected? The data is sorted?
Have you tried with Tez? Tez has some summary statistics that tells you if you 
use push down. Maybe you need to use HiveContext.
Perhaps a bloom filter could make sense for you as well.


> On 16 Mar 2016, at 12:45, Joseph <wxy81...@sina.com> wrote:
> 
> Hi,
> 
> I have only one table named "gprs",  it has 560,000,000 rows,  and 57 
> columns.  The block size is 256M,  total ORC file number is 800, each of them 
> is about 51M.
> 
> my query statement is :
> select count(*) from gprs  where  terminal_type = 25080;
> select * from gprs  where  terminal_type = 25080;
> 
> In the gprs table, the "terminal_type"  column's  value is in [0, 25066]
> 
> Joseph
>  
> From: Jörn Franke
> Date: 2016-03-16 19:26
> To: Joseph
> CC: user; user
> Subject: Re: The build-in indexes in ORC file does not work.
> How much data are you querying? What is the query? How selective it is 
> supposed to be? What is the block size?
> 
>> On 16 Mar 2016, at 11:23, Joseph <wxy81...@sina.com> wrote:
>> 
>> Hi all,
>> 
>> I have known that ORC provides three level of indexes within each file, file 
>> level, stripe level, and row level. 
>> The file and stripe level statistics are in the file footer so that they are 
>> easy to access to determine if the rest of the file needs to be read at all. 
>> Row level indexes include both column statistics for each row group and 
>> position for seeking to the start of the row group. 
>> 
>> The following is my understanding:
>> 1. The file and stripe level indexes are forcibly generated, we can not 
>> control them.
>> 2. The row level indexes can be configured by "orc.create.index"(whether to 
>> create row indexes) and "orc.row.index.stride"(number of rows between index 
>> entries).
>> 3. Each Index has statistics of min, max for each column, so sort data by 
>> the filter column will bring better performance.
>> 4. To use any one of the three level of indexes,we should enable predicate 
>> push-down by setting spark.sql.orc.filterPushdown=true (in sparkSQL) or 
>> hive.optimize.ppd=true (in hive).
>> 
>> But I found the  build-in indexes in ORC files did not work both in spark 
>> 1.5.2 and hive 1.2.1:
>> First, when the query statement with where clause did't match any record 
>> (the filter column had a value beyond the range of data),  the performance 
>> when enabled  predicate push-down was almost the same with when disabled 
>> predicate push-down.  I think, when the filter column has a value beyond the 
>> range of data, all of the orc files will not be scanned if use file level 
>> indexes,  so the performance should improve obviously.
>> 
>> The second, when enabled "orc.create.index" and sorted data by filter column 
>> and where clause can only match a few records, the performance when enabled  
>> predicate push-down was almost the same with when disabled predicate 
>> push-down. 
>> 
>> The third, when enabled  predicate push-down and "orc.create.index", the 
>> performance when  filter column had a value beyond the range of data was 
>> almost the same with when filter column had a value covering almost the 
>> whole data. 
>> 
>> So,  has anyone used ORC's build-in indexes before (especially in spark 
>> SQL)?  What's my issue?
>> 
>> Thanks!
>> 
>> Joseph

Re: The build-in indexes in ORC file does not work.

Reply via email to