> I have tried bloom filter ,but it makes no improvement。I know about > tez, but never use, I will try it later. ... > select count(*) from gprs where terminal_type=25080; > will not scan data > Time taken: 353.345 seconds
CombineInputFormat does not do any split-elimination, so MapReduce does not get container speedups there. Most of your ~300s looks to be the fixed overheads of setting up each task. We could not fix this in MRv2 due to historical compatibility issues with merge-joins & schema evolution (see HiveSplitGenerator.java). This is not recommended for regular use (other than in Tez), but you can force split-elimination with set hive.input.format=${hive.tez.input.format}; >>>> So, has anyone used ORC's build-in indexes before (especially in >>>>spark SQL)? What's my issue? We work on SparkSQL perf issues as well - this has to do with OrcRelation https://github.com/apache/spark/pull/10938 + https://github.com/apache/spark/pull/10842 Cheers, Gopal