Hi Lan, thanks for the response yes I know and I have confirmed in UI that it has only 12 partitions because of 12 HDFS blocks and hive orc file strip size is 33554432.
On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang <ljia...@gmail.com> wrote: > The partition number should be the same as the HDFS block number instead > of file number. Did you confirmed from the spark UI that only 12 partitions > were created? What is your ORC orc.stripe.size? > > Lan > > > > On Oct 8, 2015, at 1:13 PM, unk1102 <umesh.ka...@gmail.com> wrote: > > > > Hi I have the following code where I read ORC files from HDFS and it > loads > > directory which contains 12 ORC files. Now since HDFS directory contains > 12 > > files it will create 12 partitions by default. These directory is huge > and > > when ORC files gets decompressed it becomes around 10 GB how do I > increase > > partitions for the below code so that my Spark job runs faster and does > not > > hang for long time because of reading 10 GB files through shuffle in 12 > > partitions. Please guide. > > > > DataFrame df = > > hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/"); > > df.select().groupby(..) > > > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > >