Setting spark.sql.shuffle.partitions = 2000 solved my issue. I am able to
join 2 1 billion rows tables in 3 minutes.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24782.html
Sent from the A
By skewed did you mean it's not distributed uniformly across partition?
All of my columns are string and almost of same size. i.e.
id1,field11,fields12
id2,field21,field22
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-
Or at least tell us how many partitions you are using.
Yong
> Date: Tue, 22 Sep 2015 02:06:15 -0700
> From: belevts...@gmail.com
> To: user@spark.apache.org
> Subject: Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables
>
> Could it be that your data is skewed? Do
Or at least tell us how many partitions you are using.
Yong
> Date: Tue, 22 Sep 2015 02:06:15 -0700
> From: belevts...@gmail.com
> To: user@spark.apache.org
> Subject: Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables
>
> Could it be that your data is skewed? Do
Could it be that your data is skewed? Do you have variable-length column
types?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24762.html
Sent from the Apache Spark User List mailing list ar
Did you get any solution to this? I am getting same issue.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24759.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
If you data is evenly distributed (i.e. no skewed datapoints in your join
keys), it can also help to increase spark.sql.shuffle.partitions (default
is 200).
On Mon, May 4, 2015 at 8:03 AM, Richard Marscher
wrote:
> In regards to the large GC pauses, assuming you allocated all 100GB of
> memory p
In regards to the large GC pauses, assuming you allocated all 100GB of
memory per worker you may consider running with less memory on your Worker
nodes, or splitting up the available memory on the Worker nodes amongst
several worker instances. The JVM's garbage collection starts to become
very slow
Could you be more specific in how this is done?
A DataFrame class doesn't have that method.
On Sun, May 3, 2015 at 11:07 PM, ayan guha wrote:
> You can use custom partitioner to redistribution using partitionby
> On 4 May 2015 15:37, "Nick Travers" wrote:
>
>> I'm currently trying to join two
You can use custom partitioner to redistribution using partitionby
On 4 May 2015 15:37, "Nick Travers" wrote:
> I'm currently trying to join two large tables (order 1B rows each) using
> Spark SQL (1.3.0) and am running into long GC pauses which bring the job to
> a halt.
>
> I'm reading in both
I'm currently trying to join two large tables (order 1B rows each) using
Spark SQL (1.3.0) and am running into long GC pauses which bring the job to
a halt.
I'm reading in both tables using a HiveContext with the underlying files
stored as Parquet Files. I'm using something along the lines of
Hiv
11 matches
Mail list logo