nstraints, the database
approach turned out to be the only thing we could get working (without paying
double / treble for nodes that have more disk space to hold the temp files).
Regards,Ashic.
From: gourav.sengu...@gmail.com
Date: Thu, 11 Aug 2016 21:52:07 +0100
Subject: Re: Spark join and large
The point is that if you have skewed data then one single reducer will
finally take a very long time, and you do not need to try this even, just
search in Google and skewed data is a known problem in joins even in SPARK.
Therefore instead of using join, in case the used case permits, just write
a
a straight flatmap of the
> big dataset (doing lookups in redis). This avoids shuffling, and prevents
> the /tmp fill-up issue.
>
> -Ashic.
>
> --------------
> Subject: Re: Spark join and large temp files
> From: bteeu...@gmail.com
> Date: Thu, 11 Aug
ap of the big dataset
(doing lookups in redis). This avoids shuffling, and prevents the /tmp fill-up
issue.
-Ashic.
Subject: Re: Spark join and large temp files
From: bteeu...@gmail.com
Date: Thu, 11 Aug 2016 22:24:42 +0200
CC: user@spark.apache.org
To: gourav.sengu...@gmail.com
Hmm, hashing
Hmm, hashing will probably send all of the records with the same key to the
same partition / machine.
I’d try it out, and hope that if you have a few superlarge keys bigger than the
RAM available of one node, they spill to disk. Maybe play with persist() and
using a different Storage Level.
> O
When you read both ‘a’ and ‘b', can you try repartitioning both by column ‘id’?
If you .cache() and .count() to force a shuffle, it'll push the records that
will be joined to the same executors.
So;
a = spark.read.parquet(‘path_to_table_a’).repartition(‘id’).cache()
a.count()
b = spark.read.pa
0 Aug 2016 08:22:24 +0200
Subject: Re: Spark join and large temp files
To: as...@live.com
Hi Ashic,
I think this approach should solve your problem, i.e., by broadcasting the
small RDD. However you should do it propertly.
IMO, you should try
val smallRDDBroadcasted: Map[Int, YouT
+0100
Subject: Re: Spark join and large temp files
To: as...@live.com
CC: mich.talebza...@gmail.com; samkiller@gmail.com; deepakmc...@gmail.com;
user@spark.apache.org
In case of skewed data the joins will mess things up. Try to write a UDF with
the lookup on broadcast variable and then let me
de gets
> killed. And then everything dies.
>
> -Ashic.
>
> --
> From: mich.talebza...@gmail.com
> Date: Tue, 9 Aug 2016 17:25:23 +0100
> Subject: Re: Spark join and large temp files
> To: as...@live.com
> CC: samkiller@gmail.com; deepakmc.
g dies.
-Ashic.
From: mich.talebza...@gmail.com
Date: Tue, 9 Aug 2016 17:25:23 +0100
Subject: Re: Spark join and large temp files
To: as...@live.com
CC: samkiller@gmail.com; deepakmc...@gmail.com; user@spark.apache.org
Hi Sam,
What is your spark Hardware spec, No of nodes, RAM per node and disks
> clever broadcasting, or maybe falling back to chunking up the input,
> producing interim output, and unioning them for the final output. Might
> even try using Spark Streaming pointing to the parquet and seeing if that
> helps.
>
> -Ashic.
>
>
Hi Sam,Yup. It seems it stalls when broadcasting. CPU goes to 100%, but there's
no progress. The spark UI doesn't even show up.
-Ashic.
From: samkiller@gmail.com
Date: Tue, 9 Aug 2016 16:21:27 +0200
Subject: Re: Spark join and large temp files
To: as...@live.com
CC: deepakmc...
g to the parquet and seeing if that
> helps.
>
> -Ashic.
>
> --
> From: deepakmc...@gmail.com
> Date: Tue, 9 Aug 2016 17:31:19 +0530
> Subject: Re: Spark join and large temp files
> To: as...@live.com
>
> Hi Ashic
> Did you find the res
them for the final output. Might even try using
Spark Streaming pointing to the parquet and seeing if that helps.
-Ashic.
From: deepakmc...@gmail.com
Date: Tue, 9 Aug 2016 17:31:19 +0530
Subject: Re: Spark join and large temp files
To: as...@live.com
Hi AshicDid you find the resolution to this i
Mahtab
Sent: Monday, August 8, 2016 2:53 PM
To: Deepak Sharma
Cc: Apache Spark
Subject: RE: Spark join and large temp files
Hi Deepak,
Thanks for the response.
Registering the temp tables didn't help. Here's what I have:
val a = sqlContext..read.parquet(...).select("
quot;, "number")
a.registerTempTable("a")b.registerTempTable("b")
val results = sqlContext.sql("SELECT x.id, x.name, y.number FROM a x join b y
on x.id=y.id)
results.write.parquet(...)
Is there something I'm missing?
Cheers,Ashic.
From: deepakmc...@gmail.com
Register you dataframes as temp tables and then try the join on the temp
table.
This should resolve your issue.
Thanks
Deepak
On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab wrote:
> Hello,
> We have two parquet inputs of the following form:
>
> a: id:String, Name:String (1.5TB)
> b: id:String,
17 matches
Mail list logo