Hi, I'm trying to perform an ETL using Spark, but as soon as I start performing joins performance degrades a lot. Let me explain what I'm doing and what I found out until now.
First of all, I'm reading avro files that are on a Cloudera cluster, using commands like this: /val tab1 = sc.hadoopFile("hdfs:///path/to/file", classOf[org.apache.avro.mapred.AvroInputFormat[GenericRecord]], classOf[org.apache.avro.mapred.AvroWrapper[GenericRecord]], classOf[org.apache.hadoop.io.NullWritable], 10)/ After this, I'm applying some filter functions to data (to reproduce "where" clauses of the original query) and then I'm using one map for each table in order to translate RDD elements in (key,record) format. Let's say I'm doing this: /val elabTab1 = tab1.filter(...).map(....)/ It is important to notice that if I do something like /elabTab1.first/ or /elabTab1.count/ the task is performed in a short time, let's say around impala's time. Now I need to do the following: /val joined = elabTab1.leftOuterJoin(elabTab2)/ Then I tried something like /joined.count/ to test performance, but it degraded really a lot (let's say that a count on a single table takes like 4 seconds and the count on the joined table takes 12 minutes). I think there's a problem with the configuration, but what might it be? I'll give you some more information: 1] Spark is running on YARN on a Cloudera cluster 2] I'm starting spark-shell with a command like /spark-shell --executor-cores 5 --executor-memory 10G/ that gives the shell approx 10 vcores and 25 GB of memory 3] The task seems still for a lot of time after the map tasks, with the following message in console: /Asked to send map output locations for shuffle ... to .../ 4] If I open the stderr of the executors, I can read plenty of messages like the following: /Thread ... spilling in-memory map of ... MB to disk/, where MBs are in the order of 300-400 5] I tried to raise the number of executors, but the situation didn't seem to change much. I also tried to change the number of splits of the avro files (currently set to 10), but it didn't seem to change much as well 6] Tables aren't particularly big, the bigger one should be few GBs Regards, Luca -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-with-Spark-join-tp24458.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org