Re: How default partitioning in spark is deployed

Renganathan Mutthiah Tue, 16 Mar 2021 02:13:16 -0700

Hi Mich,

Thanks for your precious time looking into my query. Yes, when we increase
the number of objects, all partitions start having the data. I actually
tried to understand what happens in my particular case.


Thanks!

On Tue, Mar 16, 2021 at 2:10 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> Well as it appears you have 5 entries in your data and 12 cores. The
> theory is that you run multiple tasks in parallel across multiple cores
> on a desktop which applies to your case. The statistics is not there to
> give a meaningful interpretation why Spark decided to put all data in one
> partition. If an RDD has too many partitions, then task scheduling may
> take more time than the actual execution time. In summary you just do not
> have enough statistics to draw a meaningful conclusion.
>
> Try to generate 100,000 rows and run your query and look at the pattern.
>
> HTH
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 16 Mar 2021 at 04:35, Renganathan Mutthiah <
> renganatha...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a question with respect to default partitioning in RDD.
>>
>>
>>
>>
>> *case class Animal(id:Int, name:String)   val myRDD =
>> session.sparkContext.parallelize( (Array( Animal(1, "Lion"),
>> Animal(2,"Elephant"), Animal(3,"Jaguar"), Animal(4,"Tiger"), Animal(5,
>> "Chetah") ) ))Console println myRDD.getNumPartitions  *
>>
>> I am running the above piece of code in my laptop which has 12 logical
>> cores.
>> Hence I see that there are 12 partitions created.
>>
>> My understanding is that hash partitioning is used to determine which
>> object needs to go to which partition. So in this case, the formula would
>> be: hashCode() % 12
>> But when I further examine, I see all the RDDs are put in the last
>> partition.
>>
>> *myRDD.foreachPartition( e => { println("----------"); e.foreach(println)
>> } )*
>>
>> Above code prints the below(first eleven partitions are empty and the
>> last one has all the objects. The line is separate the partition contents):
>> ----------
>> ----------
>> ----------
>> ----------
>> ----------
>> ----------
>> ----------
>> ----------
>> ----------
>> ----------
>> ----------
>> ----------
>> Animal(2,Elephant)
>> Animal(4,Tiger)
>> Animal(3,Jaguar)
>> Animal(5,Chetah)
>> Animal(1,Lion)
>>
>> I don't know why this happens. Can you please help.
>>
>> Thanks!
>>
>

Re: How default partitioning in spark is deployed

Reply via email to