> I have tried  bloom filter ,but it makes no improvement。I know about
> tez, but never use, I will try it later.
...
>    select count(*) from gprs where terminal_type=25080;
>   will not scan data
>      Time taken: 353.345 seconds

CombineInputFormat does not do any split-elimination, so MapReduce does
not get container speedups there.

Most of your ~300s looks to be the fixed overheads of setting up each task.

We could not fix this in MRv2 due to historical compatibility issues with
merge-joins & schema evolution (see HiveSplitGenerator.java).

This is not recommended for regular use (other than in Tez), but you can
force split-elimination with


set hive.input.format=${hive.tez.input.format};

>>>> So,  has anyone used ORC's build-in indexes before (especially in
>>>>spark SQL)?  What's my issue?

We work on SparkSQL perf issues as well - this has to do with OrcRelation

https://github.com/apache/spark/pull/10938

+
https://github.com/apache/spark/pull/10842


Cheers,
Gopal


Reply via email to