Hi all, Has anyone used ORC indexes in sparkSQL? Does SparkSQL support ORC indexes completely?
I user the shell script "${SPARK_HOME}/bin/spark-sql" to run sparksql REPL and execute my query statement. The following is my test in sparksql REPL: spark-sql>set spark.sql.orc.filterPushdown=true; spark-sql>select count(*) from gprs where terminal_type=25080; Time taken: about 5 senconds spark-sql>select * from gprs where terminal_type=25080; Time taken: about 107 senconds The value of column terminal_type is in [0,25066] in my data. Both of the two query statements would not scan the whole data (if used file stats), but why was the time gap so large? spark-sql>set spark.sql.orc.filterPushdown=false; spark-sql>select count(*) from gprs where terminal_type=25080; Time taken: about 5 senconds spark-sql>select * from gprs where terminal_type=25080; Time taken: about 107 senconds So, when I disaled spark.sql.orc.filterPushdown, there was no difference (I mean select * from ...) of time taken relative to enable spark.sql.orc.filterPushdown. I have tried explain extended command, but it did not show any information that indicated the query statement had used ORC stats. Is there any way to check the use of stats? Appendix: Cluster enviroment: Hadoop 2.7.2, spark 1.6.1, 3 nodes, 3 works per node, 8 cores per work, 16 GB per work, 16GB per executor, block size is 256M, 3 replications per block, 4 disks per datanode. data size: Toal 800 ORC files, each file is about 51MB, total 560,000,000 rows,57 colunms, only one table named gprs(ORC format). Thanks! Joseph