Spark: The build-in indexes in ORC file do not work.

Joseph Sun, 20 Mar 2016 20:30:28 -0700

Hi all,

Has anyone used ORC indexes in sparkSQL? Does SparkSQL support ORC indexes 
completely?


I user  the shell script "${SPARK_HOME}/bin/spark-sql" to run sparksql REPL and 
execute my query statement.

The following is my test in sparksql REPL：
spark-sql>set spark.sql.orc.filterPushdown=true;
spark-sql>select count(*) from gprs where terminal_type=25080;    Time taken: 
about 5 senconds
spark-sql>select * from gprs where terminal_type=25080;                Time 
taken: about 107 senconds

The value of column terminal_type is in [0,25066] in my data.
Both of the two query statements would not scan the whole data (if used file 
stats), but why was the time gap so large?

spark-sql>set spark.sql.orc.filterPushdown=false;
spark-sql>select count(*) from gprs where terminal_type=25080;    Time taken: 
about 5 senconds
spark-sql>select * from gprs where terminal_type=25080;                Time 
taken: about 107 senconds

So, when I disaled spark.sql.orc.filterPushdown,  there was no difference (I 
mean select * from ...) of time taken relative to enable 
spark.sql.orc.filterPushdown. 

I have tried explain extended command, but it did not show any information that 
indicated the query statement had used ORC stats.
Is there any way to check the use of stats? 

Appendix:
Cluster enviroment:
    Hadoop 2.7.2, spark 1.6.1, 3 nodes, 3 works per node, 8 cores per work, 16 
GB per work, 16GB per executor, block size is 256M, 3 replications per block, 4 
disks per datanode.

data size:
    Toal 800 ORC files, each file is about 51MB, total 560,000,000 rows，57 
colunms, only one table named gprs(ORC format).

Thanks！


Joseph

Spark: The build-in indexes in ORC file do not work.

Reply via email to