Hi all,

I've got something which I think should be straightforward but it's not so
I'm not getting it.

I have an 8 node spark 0.9.0 cluster also running HDFS.  Workers have 16g of
memory using 8 cores.

In HDFS I have a CSV file of 110M lines of 9 columns (e.g., [key,a,b,c...]). 
I have another file of 25K lines containing some number of keys which might
be in my CSV file.  (Yes, I know I should use an RDBMS or shark or
something.  I'll get to that but this is toy problem that I'm using to get
some intuition with spark.)

Working on each file individually spark has no problem manipulating the
files.  If I try and join or union+filter though I can't seem to find the
join of the two files.  Code is along the lines of

val fileA =
sc.textFile("hdfs://.../fileA_110M.csv").map{_.split(",")}.keyBy{_(0)}
val fileB = sc.textFile("hdfs://.../fileB_25k.csv").keyBy{x => x}

And trying things like fileA.join(fileB) gives me heap OOM.  Trying

(fileA ++ fileB.map{case (k,v) => (k, Array(v))}).groupBy{_._1}.filter{case
(k, (_, xs)) => xs.exists{_.length == 1}

just causes spark to freeze.  (In all the cases I'm trying I just use a
final .count to force the results.)

I suspect I'm missing something fundamental about bringing the keyed data
together into the same partitions so it can be efficiently joined but I've
given up for now.  If anyone can shed some light (Beyond, "No really.  Use
shark.") on what I'm not understanding it would be most helpful.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Not-getting-it-tp3316.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to