Apache Spark orc read performance when reading large number of small files

gpatcham Wed, 31 Oct 2018 10:24:02 -0700


When reading large number of orc files from HDFS under a directory spark
doesn't launch any tasks until some amount of time and I don't see any tasks
running during that time. I'm using below command to read orc and spark.sql
configs.

What spark is doing under hoods when spark.read.orc is issued?

spark.read.schema(schame1).orc("hdfs://test1").filter("date >= 20181001")
"spark.sql.orc.enabled": "true",
"spark.sql.orc.filterPushdown": "true

Also instead of directly reading orc files I tried running Hive query on
same dataset. But I was not able to push filter predicate. Where should I
set the below config's "hive.optimize.ppd":"true",
"hive.optimize.ppd.storage":"true"

Suggest what is the best way to read orc files from HDFS and tuning
parameters ?




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Apache Spark orc read performance when reading large number of small files

Reply via email to