It might be a memory issue. Try adding .persist(MEMORY_AND_DISK_ONLY) so that
if the RDD can't fit into memory it will persist parts of the RDD into disk.
cm_go.registerTempTable("x")
ko.registerTempTable("y")
joined_df = sqlCtx.sql("select * from x FULL OUTER JOIN y ON field1=field2")
joined_
Your error message is not clear about what really happens.
Is your container killed by Yarn, or it indeed runs OOM?
When I run the spark job with big data, here is normally what I will do:
1) Enable GC output. You need to monitor the GC output in the executor, to
understand the GC pressure. If