Re: How to increase Spark partitions for the DataFrame?

Umesh Kacha Thu, 08 Oct 2015 12:09:07 -0700

Hi Lan, thanks for the response yes I know and I have confirmed in UI that
it has only 12 partitions because of 12 HDFS blocks and hive orc file strip
size is 33554432.


On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang <ljia...@gmail.com> wrote:

> The partition number should be the same as the HDFS block number instead
> of file number. Did you confirmed from the spark UI that only 12 partitions
> were created? What is your ORC orc.stripe.size?
>
> Lan
>
>
> > On Oct 8, 2015, at 1:13 PM, unk1102 <umesh.ka...@gmail.com> wrote:
> >
> > Hi I have the following code where I read ORC files from HDFS and it
> loads
> > directory which contains 12 ORC files. Now since HDFS directory contains
> 12
> > files it will create 12 partitions by default. These directory is huge
> and
> > when ORC files gets decompressed it becomes around 10 GB how do I
> increase
> > partitions for the below code so that my Spark job runs faster and does
> not
> > hang for long time because of reading 10 GB files through shuffle in 12
> > partitions. Please guide.
> >
> > DataFrame df =
> > hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/");
> > df.select().groupby(..)
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>

Re: How to increase Spark partitions for the DataFrame?

Reply via email to