Hi,

I'm trying to perform an ETL using Spark, but as soon as I start performing
joins performance degrades a lot. Let me explain what I'm doing and what I
found out until now.

First of all, I'm reading avro files that are on a Cloudera cluster, using
commands like this:
/val tab1 = sc.hadoopFile("hdfs:///path/to/file",
classOf[org.apache.avro.mapred.AvroInputFormat[GenericRecord]],
classOf[org.apache.avro.mapred.AvroWrapper[GenericRecord]],
classOf[org.apache.hadoop.io.NullWritable], 10)/

After this, I'm applying some filter functions to data (to reproduce "where"
clauses of the original query) and then I'm using one map for each table in
order to translate RDD elements in (key,record) format. Let's say I'm doing
this:
/val elabTab1 = tab1.filter(...).map(....)/

It is important to notice that if I do something like /elabTab1.first/ or
/elabTab1.count/ the task is performed in a short time, let's say around
impala's time. Now I need to do the following:
/val joined = elabTab1.leftOuterJoin(elabTab2)/
Then I tried something like /joined.count/ to test performance, but it
degraded really a lot (let's say that a count on a single table takes like 4
seconds and the count on the joined table takes 12 minutes). I think there's
a problem with the configuration, but what might it be?

I'll give you some more information:
1] Spark is running on YARN on a Cloudera cluster
2] I'm starting spark-shell with a command like /spark-shell
--executor-cores 5 --executor-memory 10G/ that gives the shell approx 10
vcores and 25 GB of memory
3] The task seems still for a lot of time after the map tasks, with the
following message in console: /Asked to send map output locations for
shuffle ... to .../
4] If I open the stderr of the executors, I can read plenty of messages like
the following: /Thread ... spilling in-memory map of ... MB to disk/, where
MBs are in the order of 300-400
5] I tried to raise the number of executors, but the situation didn't seem
to change much. I also tried to change the number of splits of the avro
files (currently set to 10), but it didn't seem to change much as well
6] Tables aren't particularly big, the bigger one should be few GBs

Regards,
Luca



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-with-Spark-join-tp24458.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to