Hi All,
Suppose I have a parquet file of 100 MB in HDFS & my HDFS block is 64MB, so
I have 2 block of data.
When I do, *sqlContext.parquetFile("path")* followed by an action , two
tasks are stared on two partitions.
My intend is to read this 2 blocks in more partitions to fully utilize my
cluster resources & increase parallelism.
Is there a way to do so like in case of
sc.textFile("path",*numberOfPartitions*).
Please note, I don't want to do *repartition* as that would result in lot of
shuffle.
Thanks in advance.
Regards,
Sam
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-file-increase-read-parallelism-tp22190.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]