Hi Team,
I am reading data from sql server tables through pyspark and storing data
into S3 as parquet file format.
In some table I have lots of data so I am getting file size in S3 for those
tables in GBs.
I need help on this following:
I want to assign 128 MB to each partition. How we can assi
t;>
>>>> Thanks
>>>> Deepak
>>>>
>>>> On Thu, Oct 26, 2017 at 10:05 PM, Noorul Islam Kamal Malmiyoda <
>>>> noo...@noorul.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have the fo
t;
>>>> Hi all,
>>>>
>>>> I have the following spark configuration
>>>>
>>>> spark.app.name=Test
>>>> spark.cassandra.connection.host=127.0.0.1
>>>> spark.cassandra.connection.keep_alive_ms=5000
>>>>
>>> spark.cassandra.connection.host=127.0.0.1
>>> spark.cassandra.connection.keep_alive_ms=5000
>>> spark.cassandra.connection.port=1
>>> spark.cassandra.connection.timeout_ms=30000
>>> spark.cleaner.ttl=3600
>>> spark.default.parallelism=4
rk.cassandra.connection.port=1
>> spark.cassandra.connection.timeout_ms=3
>> spark.cleaner.ttl=3600
>> spark.default.parallelism=4
>> spark.master=local[2]
>> spark.ui.enabled=false
>> spark.ui.showConsoleProgress=false
>>
>> Because I am sett
lt.parallelism=4
> spark.master=local[2]
> spark.ui.enabled=false
> spark.ui.showConsoleProgress=false
>
> Because I am setting spark.default.parallelism to 4, I was expecting
> only 4 spark partitions. But it looks like it is not the case
>
> When I do the following
>
>
t; spark.default.parallelism=4
> spark.master=local[2]
> spark.ui.enabled=false
> spark.ui.showConsoleProgress=false
>
> Because I am setting spark.default.parallelism to 4, I was expecting
> only 4 spark partitions. But it looks like it is not the case
>
> When I do the foll
spark.master=local[2]
spark.ui.enabled=false
spark.ui.showConsoleProgress=false
Because I am setting spark.default.parallelism to 4, I was expecting
only 4 spark partitions. But it looks like it is not the case
When I do the following
df.foreachPartition { partition =>
val groupedPartit
formance?
>>>>>>
>>>>>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
>>>>>> tushar_adesh...@persistent.com> wrote:
>>>>>>
>>>>>>> You can also try coalesce as it will avoid full shuffle.
>>>
>>>>>
>>>>>> You can also try coalesce as it will avoid full shuffle.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> *Tushar Adeshara*
>>>>>>
;
>>>>> You can also try coalesce as it will avoid full shuffle.
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> *Tushar Adeshara*
>>>>>
>>>>> *Technical Specialist – Analytics Practice*
>>>&
tics Practice*
>>>>
>>>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>>>
>>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|*
>>>> *www.persistentsys.com
>>>> <http://www.persistentsys.com/>*
>&
gt;>>
>>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>>
>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|*
>>> *www.persistentsys.com
>>> <http://www.persistentsys.com/>*
>>>
>>>
>>>
l: +91-81490 04192
>>>
>>> Persistent Systems Ltd. | Partners in Innovation | www.persistentsys.com
>>>
>>>
>>> From: KhajaAsmath Mohammed
>>> Sent: 13 October 2017 09:35
>>> To: user @spark
>>> Subject: Spark - Partitions
>
al Specialist – Analytics Practice*
>>
>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>
>> *Persistent Systems** Ltd. **| **Partners in Innovation **|*
>> *www.persistentsys.com
>> <http://www.persistentsys.com/>*
>>
>>
>>
w.persistentsys.com
> <http://www.persistentsys.com/>*
>
>
> --
> *From:* KhajaAsmath Mohammed
> *Sent:* 13 October 2017 09:35
> *To:* user @spark
> *Subject:* Spark - Partitions
>
> Hi,
>
> I am reading hive query and wi
_
From: KhajaAsmath Mohammed
Sent: 13 October 2017 09:35
To: user @spark
Subject: Spark - Partitions
Hi,
I am reading hive query and wiriting the data back into hive after doing some
transformations.
I have changed setting spark.sql.shuffle.partitions to 2000 and since then job
completes fast but th
Use repartition
On 13-Oct-2017 9:35 AM, "KhajaAsmath Mohammed"
wrote:
> Hi,
>
> I am reading hive query and wiriting the data back into hive after doing
> some transformations.
>
> I have changed setting spark.sql.shuffle.partitions to 2000 and since then
> job completes fast but the main problem
Hi,
I am reading hive query and wiriting the data back into hive after doing
some transformations.
I have changed setting spark.sql.shuffle.partitions to 2000 and since then
job completes fast but the main problem is I am getting 2000 files for each
partition
size of file is 10 MB .
is there a w
s decompressed it becomes around 10 GB how do I
>>> increase
>>> > partitions for the below code so that my Spark job runs faster and
>>> does not
>>> > hang for long time because of reading 10 GB files through shuffle in 12
>>> > partitions. P
s for the below code so that my Spark job runs faster and does
>> not
>> > hang for long time because of reading 10 GB files through shuffle in 12
>> > partitions. Please guide.
>> >
>> > DataFrame df =
>> > hiveContext.read().format("orc").load("/hdf
> >
> > DataFrame df =
> > hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/");
> > df.select().groupby(..)
> >
> >
> >
> >
> > --
> > View this message in context:
> > http://apache-spark-user-list.1001560.n3.nabble.com/Ho
/path/to/orc/files/");
> > df.select().groupby(..)
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html
> > Sent from the
f.select().groupby(..)
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>
upby(..)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
September 24, 2015 at 2:43 AM
> To: Anfernee Xu
> Cc: "user@spark.apache.org"
> Subject: Re: Custom Hadoop InputSplit, Spark partitions, spark
> executors/task and Yarn containers
>
> Hi Anfernee,
>
> That's correct that each InputSplit will map to exactly
: Anfernee Xu
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task
and Yarn containers
Hi Anfernee,
That's correct that each InputSplit will map to exactly a Spark partition.
On YARN, each Spa
Hi Anfernee,
That's correct that each InputSplit will map to exactly a Spark partition.
On YARN, each Spark executor maps to a single YARN container. Each
executor can run multiple tasks over its lifetime, both parallel and
sequentially.
If you enable dynamic allocation, after the stage includi
Hi Spark experts,
I'm coming across these terminologies and having some confusions, could you
please help me understand them better?
For instance I have implemented a Hadoop InputFormat to load my external
data in Spark, in turn my custom InputFormat will create a bunch of
InputSplit's, my questi
Oh if that is the case then you can try tuning "
spark.cassandra.input.split.size"
spark.cassandra.input.split.sizeapprox number of Cassandra
partitions in a Spark partition 10
Hope this helps.
Thanks
Ankur
On Thu, Sep 3, 2015 at 12:22 PM, Alaa Zubaidi (PDF)
wrote:
> Thanks Ankur,
>
Thanks Ankur,
But I grabbed some keys from the Spark results and ran "nodetool -h
getendpoints " and it showed the data is coming from at least 2 nodes?
Regards,
Alaa
On Thu, Sep 3, 2015 at 12:06 PM, Ankur Srivastava <
ankur.srivast...@gmail.com> wrote:
> Hi Alaa,
>
> Partition when usin
Hi Alaa,
Partition when using CassandraRDD depends on your partition key in
Cassandra table.
If you see only 1 partition in the RDD it means all the rows you have
selected have same partition_key in C*
Thanks
Ankur
On Thu, Sep 3, 2015 at 11:54 AM, Alaa Zubaidi (PDF)
wrote:
> Hi,
>
> I testin
Hi,
I testing Spark and Cassandra, Spark 1.4, Cassandra 2.1.7 cassandra spark
connector 1.4, running in standalone mode.
I am getting 4000 rows from Cassandra (4mb row), where the row keys are
random.
.. sc.cassandraTable[RES](keyspace,res_name).where(res_where).cache
I am expecting that it
I'm not aware of any such mechanism.
On Mon, Nov 17, 2014 at 2:55 PM, Pala M Muthaia wrote:
> Hi Daniel,
>
> Yes that should work also. However, is it possible to setup so that each
> RDD has exactly one partition, without repartitioning (and thus incurring
> extra cost)? Is there a mechanism si
Hi Daniel,
Yes that should work also. However, is it possible to setup so that each
RDD has exactly one partition, without repartitioning (and thus incurring
extra cost)? Is there a mechanism similar to MR where we can ensure each
partition is assigned some amount of data by size, by setting some
On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia wrote
>
> No i don't want separate RDD because each of these partitions are being
> processed the same way (in my case, each partition corresponds to HBase
> keys belonging to one region server, and i will do HBase lookups). After
> that i have aggr
Thanks for the responses Daniel and Rishi.
No i don't want separate RDD because each of these partitions are being
processed the same way (in my case, each partition corresponds to HBase
keys belonging to one region server, and i will do HBase lookups). After
that i have aggregations too, hence al
I believe Rishi is correct. I wouldn't rely on that though - all it would
take is for one file to exceed the block size and you'd be setting yourself
up for pain. Also, if your files are small - small enough to fit in a
single record - you could use SparkContext.wholeTextFile.
On Thu, Nov 13, 2014
If your data is in hdfs and you are reading as textFile and each file is
less than block size, my understanding is it would always have one
partition per file.
On Thursday, November 13, 2014, Daniel Siegmann
wrote:
> Would it make sense to read each file in as a separate RDD? This way you
> woul
Would it make sense to read each file in as a separate RDD? This way you
would be guaranteed the data is partitioned as you expected.
Possibly you could then repartition each of those RDDs into a single
partition and then union them. I think that would achieve what you expect.
But it would be easy
Hi,
I have a set of input files for a spark program, with each file
corresponding to a logical data partition. What is the API/mechanism to
assign each input file (or a set of files) to a spark partition, when
initializing RDDs?
When i create a spark RDD pointing to the directory of files, my
und
41 matches
Mail list logo