Re: How default partitioning in spark is deployed

2021-03-16 Thread Attila Zsolt Piros
Oh, sure that was the reason. You can keep using the `foreachPartition` and get the partition ID from the `TaskContext`: scala> import org.apache.spark.TaskContext import org.apache.spark.TaskContext scala> myRDD.foreachPartition( e => { println(TaskContext.getPartitionId + ":" + e.mkString(",")

Re: How default partitioning in spark is deployed

2021-03-16 Thread Renganathan Mutthiah
That's a very good idea, thanks for sharing German! On Tue, Mar 16, 2021 at 7:08 PM German Schiavon wrote: > Hi all, > > I guess you could do something like this too: > > [image: Captura de pantalla 2021-03-16 a las 14.35.46.png] > > On Tue, 16 Mar 2021 at 13:31, Renganathan Mutthiah < > rengana

Re: How default partitioning in spark is deployed

2021-03-16 Thread German Schiavon
Hi all, I guess you could do something like this too: [image: Captura de pantalla 2021-03-16 a las 14.35.46.png] On Tue, 16 Mar 2021 at 13:31, Renganathan Mutthiah wrote: > Hi Attila, > > Thanks for looking into this! > > I actually found the issue and it turned out to be that the print > stat

Re: How default partitioning in spark is deployed

2021-03-16 Thread Renganathan Mutthiah
Hi Attila, Thanks for looking into this! I actually found the issue and it turned out to be that the print statements misled me. The records are indeed stored in different partitions. What happened is since the foreachpartition method is run parallelly by different threads, they all printed the f

Re: How default partitioning in spark is deployed

2021-03-16 Thread Attila Zsolt Piros
Hi! This is weird. The code of foreachPartition leads to ParallelCollectionRDD

Re: How default partitioning in spark is deployed

2021-03-16 Thread Renganathan Mutthiah
Hi Mich, Thanks for your precious time looking into my query. Yes, when we increase the number of objects, all partitions start having the data. I actually tried to understand what happens in my particular case. Thanks! On Tue, Mar 16, 2021 at 2:10 PM Mich Talebzadeh wrote: > Hi, > > Well as i

Re: How default partitioning in spark is deployed

2021-03-16 Thread Mich Talebzadeh
Hi, Well as it appears you have 5 entries in your data and 12 cores. The theory is that you run multiple tasks in parallel across multiple cores on a desktop which applies to your case. The statistics is not there to give a meaningful interpretation why Spark decided to put all data in one partiti

How default partitioning in spark is deployed

2021-03-15 Thread Renganathan Mutthiah
Hi, I have a question with respect to default partitioning in RDD. *case class Animal(id:Int, name:String) val myRDD = session.sparkContext.parallelize( (Array( Animal(1, "Lion"), Animal(2,"Elephant"), Animal(3,"Jaguar"), Animal(4,"Tiger"), Animal(5, "Chetah") ) ))Console println myRDD.getNu