Hi all, I've got something which I think should be straightforward but it's not so I'm not getting it.
I have an 8 node spark 0.9.0 cluster also running HDFS. Workers have 16g of memory using 8 cores. In HDFS I have a CSV file of 110M lines of 9 columns (e.g., [key,a,b,c...]). I have another file of 25K lines containing some number of keys which might be in my CSV file. (Yes, I know I should use an RDBMS or shark or something. I'll get to that but this is toy problem that I'm using to get some intuition with spark.) Working on each file individually spark has no problem manipulating the files. If I try and join or union+filter though I can't seem to find the join of the two files. Code is along the lines of val fileA = sc.textFile("hdfs://.../fileA_110M.csv").map{_.split(",")}.keyBy{_(0)} val fileB = sc.textFile("hdfs://.../fileB_25k.csv").keyBy{x => x} And trying things like fileA.join(fileB) gives me heap OOM. Trying (fileA ++ fileB.map{case (k,v) => (k, Array(v))}).groupBy{_._1}.filter{case (k, (_, xs)) => xs.exists{_.length == 1} just causes spark to freeze. (In all the cases I'm trying I just use a final .count to force the results.) I suspect I'm missing something fundamental about bringing the keyed data together into the same partitions so it can be efficiently joined but I've given up for now. If anyone can shed some light (Beyond, "No really. Use shark.") on what I'm not understanding it would be most helpful. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Not-getting-it-tp3316.html Sent from the Apache Spark User List mailing list archive at Nabble.com.