bq. contains 12 files/blocks Looks like you hit the limit of parallelism these files can provide.
If you have larger dataset, you would have more partitions. On Thu, Oct 8, 2015 at 12:21 PM, Umesh Kacha <[email protected]> wrote: > Hi Lan thanks for the reply. I have tried to do the following but it did > not increase partition > > DataFrame df = hiveContext.read().format("orc").load("/hdfs/path/to/orc/ > files/").repartition(100); > > Yes I have checked in namenode ui ORC files contains 12 files/blocks of > 128 MB each and ORC files when decompressed its around 10 GB and its > uncompressed file size is around 1 GB > > On Fri, Oct 9, 2015 at 12:43 AM, Lan Jiang <[email protected]> wrote: > >> Hmm, that’s odd. >> >> You can always use repartition(n) to increase the partition number, but >> then there will be shuffle. How large is your ORC file? Have you used >> NameNode UI to check how many HDFS blocks each ORC file has? >> >> Lan >> >> >> On Oct 8, 2015, at 2:08 PM, Umesh Kacha <[email protected]> wrote: >> >> Hi Lan, thanks for the response yes I know and I have confirmed in UI >> that it has only 12 partitions because of 12 HDFS blocks and hive orc file >> strip size is 33554432. >> >> On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang <[email protected]> wrote: >> >>> The partition number should be the same as the HDFS block number instead >>> of file number. Did you confirmed from the spark UI that only 12 partitions >>> were created? What is your ORC orc.stripe.size? >>> >>> Lan >>> >>> >>> > On Oct 8, 2015, at 1:13 PM, unk1102 <[email protected]> wrote: >>> > >>> > Hi I have the following code where I read ORC files from HDFS and it >>> loads >>> > directory which contains 12 ORC files. Now since HDFS directory >>> contains 12 >>> > files it will create 12 partitions by default. These directory is huge >>> and >>> > when ORC files gets decompressed it becomes around 10 GB how do I >>> increase >>> > partitions for the below code so that my Spark job runs faster and >>> does not >>> > hang for long time because of reading 10 GB files through shuffle in 12 >>> > partitions. Please guide. >>> > >>> > DataFrame df = >>> > hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/"); >>> > df.select().groupby(..) >>> > >>> > >>> > >>> > >>> > -- >>> > View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html >>> > Sent from the Apache Spark User List mailing list archive at >>> Nabble.com. >>> > >>> > --------------------------------------------------------------------- >>> > To unsubscribe, e-mail: [email protected] >>> > For additional commands, e-mail: [email protected] >>> > >>> >>> >> >> >
