Re: How to increase Spark partitions for the DataFrame?

Ted Yu Thu, 08 Oct 2015 14:40:11 -0700

bq. contains 12 files/blocks

Looks like you hit the limit of parallelism these files can provide.


If you have larger dataset, you would have more partitions.

On Thu, Oct 8, 2015 at 12:21 PM, Umesh Kacha <[email protected]> wrote:

> Hi Lan thanks for the reply. I have tried to do the following but it did
> not increase partition
>
> DataFrame df = hiveContext.read().format("orc").load("/hdfs/path/to/orc/
> files/").repartition(100);
>
> Yes I have checked in namenode ui ORC files contains 12 files/blocks of
> 128 MB each and ORC files when decompressed its around 10 GB and its
> uncompressed file size is around 1 GB
>
> On Fri, Oct 9, 2015 at 12:43 AM, Lan Jiang <[email protected]> wrote:
>
>> Hmm, that’s odd.
>>
>> You can always use repartition(n) to increase the partition number, but
>> then there will be shuffle. How large is your ORC file? Have you used
>> NameNode UI to check how many HDFS blocks each ORC file has?
>>
>> Lan
>>
>>
>> On Oct 8, 2015, at 2:08 PM, Umesh Kacha <[email protected]> wrote:
>>
>> Hi Lan, thanks for the response yes I know and I have confirmed in UI
>> that it has only 12 partitions because of 12 HDFS blocks and hive orc file
>> strip size is 33554432.
>>
>> On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang <[email protected]> wrote:
>>
>>> The partition number should be the same as the HDFS block number instead
>>> of file number. Did you confirmed from the spark UI that only 12 partitions
>>> were created? What is your ORC orc.stripe.size?
>>>
>>> Lan
>>>
>>>
>>> > On Oct 8, 2015, at 1:13 PM, unk1102 <[email protected]> wrote:
>>> >
>>> > Hi I have the following code where I read ORC files from HDFS and it
>>> loads
>>> > directory which contains 12 ORC files. Now since HDFS directory
>>> contains 12
>>> > files it will create 12 partitions by default. These directory is huge
>>> and
>>> > when ORC files gets decompressed it becomes around 10 GB how do I
>>> increase
>>> > partitions for the below code so that my Spark job runs faster and
>>> does not
>>> > hang for long time because of reading 10 GB files through shuffle in 12
>>> > partitions. Please guide.
>>> >
>>> > DataFrame df =
>>> > hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/");
>>> > df.select().groupby(..)
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html
>>> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: [email protected]
>>> > For additional commands, e-mail: [email protected]
>>> >
>>>
>>>
>>
>>
>

Re: How to increase Spark partitions for the DataFrame?

Reply via email to