Re: sql query orc slow

Patcharee Thongtra Tue, 13 Oct 2015 01:03:03 -0700

Hi Zhan Zhang,

Is my problem (which is ORC predicate is not generated from WHERE clauseeven though spark.sql.orc.filterPushdown=true) can be related to somefactors below ?


- orc file version (File Version: 0.12 with HIVE_8732)
- hive version (using Hive 1.2.1.2.3.0.0-2557)
- orc table is not sorted / indexed
- the split strategy hive.exec.orc.split.strategy

BR,
Patcharee


On 10/09/2015 08:01 PM, Zhan Zhang wrote:

That is weird. Unfortunately, there is no debug info available on thispart. Can you please open a JIRA to add some debug information on thedriver side?
Thanks.

Zhan Zhang
On Oct 9, 2015, at 10:22 AM, patcharee <patcharee.thong...@uni.no<mailto:patcharee.thong...@uni.no>> wrote:
I set hiveContext.setConf("spark.sql.orc.filterPushdown", "true").But from the log No ORC pushdown predicate for my query with WHEREclause.
15/10/09 19:16:01 DEBUG OrcInputFormat: No ORC pushdown predicate

I did not understand what wrong with this.

BR,
Patcharee

On 09. okt. 2015 19:10, Zhan Zhang wrote:
In your case, you manually set an AND pushdown, and the predicate isright based on your setting, : leaf-0 = (EQUALS x 320)
The right way is to enable the predicate pushdown as follows.
sqlContext.setConf("spark.sql.orc.filterPushdown", "true”)

Thanks.

Zhan Zhang







On Oct 9, 2015, at 9:58 AM, patcharee <patcharee.thong...@uni.no> wrote:
Hi Zhan Zhang
Actually my query has WHERE clause "select date, month, year, hh,(u*0.9122461 - v*-0.40964267), (v*0.9122461 + u*-0.40964267), zfrom 4D where x = 320 and y = 117 and zone == 2 and year=2009 and z>= 2 and z <= 8", column "x", "y" is not partition column, theothers are partition columns. I expected the system will usepredicate pushdown. I turned on the debug and found pushdownpredicate was not generated ("DEBUG OrcInputFormat: No ORC pushdownpredicate")
Then I tried to set the search argument explicitly (on the column"x" which is not partition column)
val xs =SearchArgumentFactory.newBuilder().startAnd().equals("x",320).end().build()
   hiveContext.setConf("hive.io.file.readcolumn.names", "x")
   hiveContext.setConf("sarg.pushdown", xs.toKryo())
this time in the log pushdown predicate was generated but resultswas wrong (no results at all)
15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate:leaf-0 = (EQUALS x 320)
expr = leaf-0
Any ideas What wrong with this? Why the ORC pushdown predicate isnot applied by the system?
BR,
Patcharee

On 09. okt. 2015 18:31, Zhan Zhang wrote:
Hi Patcharee,
>From the query, it looks like only the column pruning will beapplied. Partition pruning and predicate pushdown does not haveeffect. Do you see big IO difference between two methods?
The potential reason of the speed difference I can think of may bethe different versions of OrcInputFormat. The hive path may useNewOrcInputFormat, but the spark path use OrcInputFormat.
Thanks.

Zhan Zhang
On Oct 8, 2015, at 11:55 PM, patcharee <patcharee.thong...@uni.no>wrote:
Yes, the predicate pushdown is enabled, but still take longertime than the first method
BR,
Patcharee

On 08. okt. 2015 18:43, Zhan Zhang wrote:
Hi Patcharee,

Did you enable the predicate pushdown in the second method?

Thanks.

Zhan Zhang
On Oct 8, 2015, at 1:43 AM, patcharee<patcharee.thong...@uni.no> wrote:
Hi,
I am using spark sql 1.5 to query a hive table stored aspartitioned orc file. We have the total files is about 6000files and each file size is about 245MB.
What is the difference between these two query methods below:

1. Using query on hive table directly

hiveContext.sql("select col1, col2 from table1")
2. Reading from orc file, register temp table and query fromthe temp table
val c =hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")
c.registerTempTable("regTable")
hiveContext.sql("select col1, col2 from regTable")
When the number of files is large (query all from the total6000 files) , the second case is much slower then the firstone. Any ideas why?
BR,




---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>For additional commands, e-mail:<mailto:user-h...@spark.apache.org>user-h...@spark.apache.org

Re: sql query orc slow

Reply via email to