Re: How default partitioning in spark is deployed

2021-03-16 Thread Attila Zsolt Piros
Oh, sure that was the reason. You can keep using the `foreachPartition` and get the partition ID from the `TaskContext`: scala> import org.apache.spark.TaskContext import org.apache.spark.TaskContext scala> myRDD.foreachPartition( e => { println(TaskContext.getPartitionId + ":" + e.mkString(",")

Re: How default partitioning in spark is deployed

2021-03-16 Thread Renganathan Mutthiah
That's a very good idea, thanks for sharing German! On Tue, Mar 16, 2021 at 7:08 PM German Schiavon wrote: > Hi all, > > I guess you could do something like this too: > > [image: Captura de pantalla 2021-03-16 a las 14.35.46.png] > > On Tue, 16 Mar 2021 at 13:31, Renganathan Mutthiah < > rengana

Re: How default partitioning in spark is deployed

2021-03-16 Thread German Schiavon
Hi all, I guess you could do something like this too: [image: Captura de pantalla 2021-03-16 a las 14.35.46.png] On Tue, 16 Mar 2021 at 13:31, Renganathan Mutthiah wrote: > Hi Attila, > > Thanks for looking into this! > > I actually found the issue and it turned out to be that the print > stat

Re: How default partitioning in spark is deployed

2021-03-16 Thread Renganathan Mutthiah
Hi Attila, Thanks for looking into this! I actually found the issue and it turned out to be that the print statements misled me. The records are indeed stored in different partitions. What happened is since the foreachpartition method is run parallelly by different threads, they all printed the f

Re: How default partitioning in spark is deployed

2021-03-16 Thread Attila Zsolt Piros
Hi! This is weird. The code of foreachPartition leads to ParallelCollectionRDD

Re: How default partitioning in spark is deployed

2021-03-16 Thread Renganathan Mutthiah
Hi Mich, Thanks for your precious time looking into my query. Yes, when we increase the number of objects, all partitions start having the data. I actually tried to understand what happens in my particular case. Thanks! On Tue, Mar 16, 2021 at 2:10 PM Mich Talebzadeh wrote: > Hi, > > Well as i

Re: How default partitioning in spark is deployed

2021-03-16 Thread Mich Talebzadeh
Hi, Well as it appears you have 5 entries in your data and 12 cores. The theory is that you run multiple tasks in parallel across multiple cores on a desktop which applies to your case. The statistics is not there to give a meaningful interpretation why Spark decided to put all data in one partiti

How default partitioning in spark is deployed

2021-03-15 Thread Renganathan Mutthiah
Hi, I have a question with respect to default partitioning in RDD. *case class Animal(id:Int, name:String) val myRDD = session.sparkContext.parallelize( (Array( Animal(1, "Lion"), Animal(2,"Elephant"), Animal(3,"Jaguar"), Animal(4,"Tiger"), Animal(5, "Chetah") ) ))Console println myRDD.getNu

Partitioning in spark while reading from RDBMS via JDBC

2017-03-31 Thread Devender Yadav
Hi All, I am running spark in cluster mode and reading data from RDBMS via JDBC. As per spark docs, these partitioning parameters describe how to partition the table when reading in parallel from multiple

Re: Partitioning in spark

2016-06-24 Thread Darshan Singh
Thanks but the whole point is not setting it explicitly but it should be derived from its parent RDDS. Thanks On Fri, Jun 24, 2016 at 6:09 AM, ayan guha wrote: > You can change paralllism like following: > > conf = SparkConf() > conf.set('spark.sql.shuffle.partitions',10) > sc = SparkContext(co

Re: Partitioning in spark

2016-06-23 Thread ayan guha
You can change paralllism like following: conf = SparkConf() conf.set('spark.sql.shuffle.partitions',10) sc = SparkContext(conf=conf) On Fri, Jun 24, 2016 at 6:46 AM, Darshan Singh wrote: > Hi, > > My default parallelism is 100. Now I join 2 dataframes with 20 partitions > each , joined dataf

Partitioning in spark

2016-06-23 Thread Darshan Singh
Hi, My default parallelism is 100. Now I join 2 dataframes with 20 partitions each , joined dataframe has 100 partition. I want to know what is the way to keep it to 20 (except re-partition and coalesce. Also, when i join these 2 dataframes I am using 4 columns as joined columns. The dataframes a

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Michael Armbrust
Caching the partitioned_df <- this one, but you have to do the partitioning using something like sql("SELECT * FROM ... CLUSTER BY a") as there is no such operation exposed on dataframes. 2) Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-5354

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Cesar Flores
Thanks Michael for your input. By 1) do you mean: - Caching the partitioned_rdd - Caching the partitioned_df - *Or* just caching unpartitioned_df without the need of creating the partitioned_rdd variable? Can you expand a little bit more 2) Thanks! On Wed, Oct 14, 2015 at 12:11

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Michael Armbrust
This won't help as for two reasons: 1) Its all still just creating lineage since you aren't caching the partitioned data. It will still fetch the shuffled blocks for each query. 2) The query optimizer is not aware of RDD level partitioning since its mostly a blackbox. 1) could be fixed by addin

Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Cesar Flores
My current version of spark is 1.3.0 and my question is the next: I have large data frames where the main field is an user id. I need to do many group by's and joins using that field. Do the performance will increase if before doing any group by or join operation I first convert to rdd to partitio

Re: Partitioning in spark streaming

2015-08-12 Thread Tathagata Das
r for the blocks >>>> created during the batchInterval. The blocks generated during the >>>> batchInterval are partitions of the RDD. >>>> >>>> Now if you want to repartition based on a key, a shuffle is needed. >>>> >>>> On Wed, Aug 12, 2015 at 4:36 AM, Mohit Anchlia >>>> wrote: >>>> >>>>> How does partitioning in spark work when it comes to streaming? What's >>>>> the best way to partition a time series data grouped by a certain tag like >>>>> categories of product video, music etc. >>>>> >>>> >>>> >>> >> >

Re: Partitioning in spark streaming

2015-08-11 Thread Mohit Anchlia
generated during the > batchInterval are partitions of the RDD. > > Now if you want to repartition based on a key, a shuffle is needed. > > On Wed, Aug 12, 2015 at 4:36 AM, Mohit Anchlia > wrote: > >> How does partitioning in spark work when it comes to streaming? What's &g

Re: Partitioning in spark streaming

2015-08-11 Thread Hemant Bhanawat
created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD. Now if you want to repartition based on a key, a shuffle is needed. On Wed, Aug 12, 2015 at 4:36 AM, Mohit Anchlia wrote: > How does partitioning in sp

Re: Partitioning in spark streaming

2015-08-11 Thread Mohit Anchlia
partition too much. > > Best > Ayan > > On Wed, Aug 12, 2015 at 9:06 AM, Mohit Anchlia > wrote: > >> How does partitioning in spark work when it comes to streaming? What's >> the best way to partition a time series data grouped by a certain tag like >> cate

Re: Partitioning in spark streaming

2015-08-11 Thread ayan guha
partition scheme should not skew data into one partition too much. Best Ayan On Wed, Aug 12, 2015 at 9:06 AM, Mohit Anchlia wrote: > How does partitioning in spark work when it comes to streaming? What's the > best way to partition a time series data grouped by a certain tag like >

Partitioning in spark streaming

2015-08-11 Thread Mohit Anchlia
How does partitioning in spark work when it comes to streaming? What's the best way to partition a time series data grouped by a certain tag like categories of product video, music etc.