subject:"Spark \- Partitions"

Spark Partitions Size control

2022-11-27 Thread vijay khatri

Hi Team, I am reading data from sql server tables through pyspark and storing data into S3 as parquet file format. In some table I have lots of data so I am getting file size in S3 for those tables in GBs. I need help on this following: I want to assign 128 MB to each partition. How we can assi

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread lucas.g...@gmail.com

t;> >>>> Thanks >>>> Deepak >>>> >>>> On Thu, Oct 26, 2017 at 10:05 PM, Noorul Islam Kamal Malmiyoda < >>>> noo...@noorul.com> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I have the fo

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Daniel Siegmann

t; >>>> Hi all, >>>> >>>> I have the following spark configuration >>>> >>>> spark.app.name=Test >>>> spark.cassandra.connection.host=127.0.0.1 >>>> spark.cassandra.connection.keep_alive_ms=5000 >>>>

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread lucas.g...@gmail.com

>>> spark.cassandra.connection.host=127.0.0.1 >>> spark.cassandra.connection.keep_alive_ms=5000 >>> spark.cassandra.connection.port=1 >>> spark.cassandra.connection.timeout_ms=30000 >>> spark.cleaner.ttl=3600 >>> spark.default.parallelism=4

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Daniel Siegmann

rk.cassandra.connection.port=1 >> spark.cassandra.connection.timeout_ms=3 >> spark.cleaner.ttl=3600 >> spark.default.parallelism=4 >> spark.master=local[2] >> spark.ui.enabled=false >> spark.ui.showConsoleProgress=false >> >> Because I am sett

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Deepak Sharma

lt.parallelism=4 > spark.master=local[2] > spark.ui.enabled=false > spark.ui.showConsoleProgress=false > > Because I am setting spark.default.parallelism to 4, I was expecting > only 4 spark partitions. But it looks like it is not the case > > When I do the following > >

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread lucas.g...@gmail.com

t; spark.default.parallelism=4 > spark.master=local[2] > spark.ui.enabled=false > spark.ui.showConsoleProgress=false > > Because I am setting spark.default.parallelism to 4, I was expecting > only 4 spark partitions. But it looks like it is not the case > > When I do the foll

Controlling number of spark partitions in dataframes

2017-10-26 Thread Noorul Islam Kamal Malmiyoda

spark.master=local[2] spark.ui.enabled=false spark.ui.showConsoleProgress=false Because I am setting spark.default.parallelism to 4, I was expecting only 4 spark partitions. But it looks like it is not the case When I do the following df.foreachPartition { partition => val groupedPartit

Re: Spark - Partitions

2017-10-17 Thread Sebastian Piu

formance? >>>>>> >>>>>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara < >>>>>> tushar_adesh...@persistent.com> wrote: >>>>>> >>>>>>> You can also try coalesce as it will avoid full shuffle. >>>

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed

>>>>> >>>>>> You can also try coalesce as it will avoid full shuffle. >>>>>> >>>>>> >>>>>> Regards, >>>>>> >>>>>> *Tushar Adeshara* >>>>>>

Re: Spark - Partitions

2017-10-17 Thread Sebastian Piu

; >>>>> You can also try coalesce as it will avoid full shuffle. >>>>> >>>>> >>>>> Regards, >>>>> >>>>> *Tushar Adeshara* >>>>> >>>>> *Technical Specialist – Analytics Practice* >>>&

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed

tics Practice* >>>> >>>> *Cell: +91-81490 04192 <+91%2081490%2004192>* >>>> >>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* >>>> *www.persistentsys.com >>>> <http://www.persistentsys.com/>* >&

Re: Spark - Partitions

2017-10-17 Thread Sebastian Piu

gt;>> >>> *Cell: +91-81490 04192 <+91%2081490%2004192>* >>> >>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* >>> *www.persistentsys.com >>> <http://www.persistentsys.com/>* >>> >>> >>>

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed

l: +91-81490 04192 >>> >>> Persistent Systems Ltd. | Partners in Innovation | www.persistentsys.com >>> >>> >>> From: KhajaAsmath Mohammed >>> Sent: 13 October 2017 09:35 >>> To: user @spark >>> Subject: Spark - Partitions >

Re: Spark - Partitions

2017-10-17 Thread Michael Artz

al Specialist – Analytics Practice* >> >> *Cell: +91-81490 04192 <+91%2081490%2004192>* >> >> *Persistent Systems** Ltd. **| **Partners in Innovation **|* >> *www.persistentsys.com >> <http://www.persistentsys.com/>* >> >> >>

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed

w.persistentsys.com > <http://www.persistentsys.com/>* > > > -- > *From:* KhajaAsmath Mohammed > *Sent:* 13 October 2017 09:35 > *To:* user @spark > *Subject:* Spark - Partitions > > Hi, > > I am reading hive query and wi

Re: Spark - Partitions

2017-10-13 Thread Tushar Adeshara

_ From: KhajaAsmath Mohammed Sent: 13 October 2017 09:35 To: user @spark Subject: Spark - Partitions Hi, I am reading hive query and wiriting the data back into hive after doing some transformations. I have changed setting spark.sql.shuffle.partitions to 2000 and since then job completes fast but th

Re: Spark - Partitions

2017-10-12 Thread Chetan Khatri

Use repartition On 13-Oct-2017 9:35 AM, "KhajaAsmath Mohammed" wrote: > Hi, > > I am reading hive query and wiriting the data back into hive after doing > some transformations. > > I have changed setting spark.sql.shuffle.partitions to 2000 and since then > job completes fast but the main problem

Spark - Partitions

2017-10-12 Thread KhajaAsmath Mohammed

Hi, I am reading hive query and wiriting the data back into hive after doing some transformations. I have changed setting spark.sql.shuffle.partitions to 2000 and since then job completes fast but the main problem is I am getting 2000 files for each partition size of file is 10 MB . is there a w

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Ted Yu

s decompressed it becomes around 10 GB how do I >>> increase >>> > partitions for the below code so that my Spark job runs faster and >>> does not >>> > hang for long time because of reading 10 GB files through shuffle in 12 >>> > partitions. P

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Umesh Kacha

s for the below code so that my Spark job runs faster and does >> not >> > hang for long time because of reading 10 GB files through shuffle in 12 >> > partitions. Please guide. >> > >> > DataFrame df = >> > hiveContext.read().format("orc").load("/hdf

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Lan Jiang

> > > > DataFrame df = > > hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/"); > > df.select().groupby(..) > > > > > > > > > > -- > > View this message in context: > > http://apache-spark-user-list.1001560.n3.nabble.com/Ho

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Umesh Kacha

/path/to/orc/files/"); > > df.select().groupby(..) > > > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html > > Sent from the

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Lan Jiang

f.select().groupby(..) > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > >

How to increase Spark partitions for the DataFrame?

2015-10-08 Thread unk1102

upby(..) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-24 Thread Sabarish Sasidharan

September 24, 2015 at 2:43 AM > To: Anfernee Xu > Cc: "user@spark.apache.org" > Subject: Re: Custom Hadoop InputSplit, Spark partitions, spark > executors/task and Yarn containers > > Hi Anfernee, > > That's correct that each InputSplit will map to exactly

Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-24 Thread Adrian Tanase

: Anfernee Xu Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers Hi Anfernee, That's correct that each InputSplit will map to exactly a Spark partition. On YARN, each Spa

Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-23 Thread Sandy Ryza

Hi Anfernee, That's correct that each InputSplit will map to exactly a Spark partition. On YARN, each Spark executor maps to a single YARN container. Each executor can run multiple tasks over its lifetime, both parallel and sequentially. If you enable dynamic allocation, after the stage includi

Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-23 Thread Anfernee Xu

Hi Spark experts, I'm coming across these terminologies and having some confusions, could you please help me understand them better? For instance I have implemented a Hadoop InputFormat to load my external data in Spark, in turn my custom InputFormat will create a bunch of InputSplit's, my questi

Re: Spark partitions from CassandraRDD

2015-09-04 Thread Ankur Srivastava

Oh if that is the case then you can try tuning " spark.cassandra.input.split.size" spark.cassandra.input.split.sizeapprox number of Cassandra partitions in a Spark partition 10 Hope this helps. Thanks Ankur On Thu, Sep 3, 2015 at 12:22 PM, Alaa Zubaidi (PDF) wrote: > Thanks Ankur, >

Re: Spark partitions from CassandraRDD

2015-09-03 Thread Alaa Zubaidi (PDF)

Thanks Ankur, But I grabbed some keys from the Spark results and ran "nodetool -h getendpoints " and it showed the data is coming from at least 2 nodes? Regards, Alaa On Thu, Sep 3, 2015 at 12:06 PM, Ankur Srivastava < ankur.srivast...@gmail.com> wrote: > Hi Alaa, > > Partition when usin

Re: Spark partitions from CassandraRDD

2015-09-03 Thread Ankur Srivastava

Hi Alaa, Partition when using CassandraRDD depends on your partition key in Cassandra table. If you see only 1 partition in the RDD it means all the rows you have selected have same partition_key in C* Thanks Ankur On Thu, Sep 3, 2015 at 11:54 AM, Alaa Zubaidi (PDF) wrote: > Hi, > > I testin

Spark partitions from CassandraRDD

2015-09-03 Thread Alaa Zubaidi (PDF)

Hi, I testing Spark and Cassandra, Spark 1.4, Cassandra 2.1.7 cassandra spark connector 1.4, running in standalone mode. I am getting 4000 rows from Cassandra (4mb row), where the row keys are random. .. sc.cassandraTable[RES](keyspace,res_name).where(res_where).cache I am expecting that it

Re: Assigning input files to spark partitions

2014-11-17 Thread Daniel Siegmann

I'm not aware of any such mechanism. On Mon, Nov 17, 2014 at 2:55 PM, Pala M Muthaia wrote: > Hi Daniel, > > Yes that should work also. However, is it possible to setup so that each > RDD has exactly one partition, without repartitioning (and thus incurring > extra cost)? Is there a mechanism si

Re: Assigning input files to spark partitions

2014-11-17 Thread Pala M Muthaia

Hi Daniel, Yes that should work also. However, is it possible to setup so that each RDD has exactly one partition, without repartitioning (and thus incurring extra cost)? Is there a mechanism similar to MR where we can ensure each partition is assigned some amount of data by size, by setting some

Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann

On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia wrote > > No i don't want separate RDD because each of these partitions are being > processed the same way (in my case, each partition corresponds to HBase > keys belonging to one region server, and i will do HBase lookups). After > that i have aggr

Re: Assigning input files to spark partitions

2014-11-13 Thread Pala M Muthaia

Thanks for the responses Daniel and Rishi. No i don't want separate RDD because each of these partitions are being processed the same way (in my case, each partition corresponds to HBase keys belonging to one region server, and i will do HBase lookups). After that i have aggregations too, hence al

Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann

I believe Rishi is correct. I wouldn't rely on that though - all it would take is for one file to exceed the block size and you'd be setting yourself up for pain. Also, if your files are small - small enough to fit in a single record - you could use SparkContext.wholeTextFile. On Thu, Nov 13, 2014

Re: Assigning input files to spark partitions

2014-11-13 Thread Rishi Yadav

If your data is in hdfs and you are reading as textFile and each file is less than block size, my understanding is it would always have one partition per file. On Thursday, November 13, 2014, Daniel Siegmann wrote: > Would it make sense to read each file in as a separate RDD? This way you > woul

Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann

Would it make sense to read each file in as a separate RDD? This way you would be guaranteed the data is partitioned as you expected. Possibly you could then repartition each of those RDDs into a single partition and then union them. I think that would achieve what you expect. But it would be easy

Assigning input files to spark partitions

2014-11-12 Thread Pala M Muthaia

Hi, I have a set of input files for a spark program, with each file corresponding to a logical data partition. What is the API/mechanism to assign each input file (or a set of files) to a spark partition, when initializing RDDs? When i create a spark RDD pointing to the directory of files, my und

41 matches

Mail list logo