Hi Mich, Thanks for your precious time looking into my query. Yes, when we increase the number of objects, all partitions start having the data. I actually tried to understand what happens in my particular case.
Thanks! On Tue, Mar 16, 2021 at 2:10 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi, > > Well as it appears you have 5 entries in your data and 12 cores. The > theory is that you run multiple tasks in parallel across multiple cores > on a desktop which applies to your case. The statistics is not there to > give a meaningful interpretation why Spark decided to put all data in one > partition. If an RDD has too many partitions, then task scheduling may > take more time than the actual execution time. In summary you just do not > have enough statistics to draw a meaningful conclusion. > > Try to generate 100,000 rows and run your query and look at the pattern. > > HTH > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 16 Mar 2021 at 04:35, Renganathan Mutthiah < > renganatha...@gmail.com> wrote: > >> Hi, >> >> I have a question with respect to default partitioning in RDD. >> >> >> >> >> *case class Animal(id:Int, name:String) val myRDD = >> session.sparkContext.parallelize( (Array( Animal(1, "Lion"), >> Animal(2,"Elephant"), Animal(3,"Jaguar"), Animal(4,"Tiger"), Animal(5, >> "Chetah") ) ))Console println myRDD.getNumPartitions * >> >> I am running the above piece of code in my laptop which has 12 logical >> cores. >> Hence I see that there are 12 partitions created. >> >> My understanding is that hash partitioning is used to determine which >> object needs to go to which partition. So in this case, the formula would >> be: hashCode() % 12 >> But when I further examine, I see all the RDDs are put in the last >> partition. >> >> *myRDD.foreachPartition( e => { println("----------"); e.foreach(println) >> } )* >> >> Above code prints the below(first eleven partitions are empty and the >> last one has all the objects. The line is separate the partition contents): >> ---------- >> ---------- >> ---------- >> ---------- >> ---------- >> ---------- >> ---------- >> ---------- >> ---------- >> ---------- >> ---------- >> ---------- >> Animal(2,Elephant) >> Animal(4,Tiger) >> Animal(3,Jaguar) >> Animal(5,Chetah) >> Animal(1,Lion) >> >> I don't know why this happens. Can you please help. >> >> Thanks! >> >