Hello, I am new to Spark. I have a basic question about the memory requirement of using Spark.
I need to join multiple data sources between multiple data sets. The join is not a straightforward join. The logic is more like: first join T1 on column A with T2, then for all the records that couldn't find the match in the Join, join T1 on column B with T2, and then join on C and son on. I was using HIVE, but it requires multiple scans on T1, which turns out slow. It seems like if I load T1 and T2 in memory using Spark, I could improve the performance. However, T1 and T2 totally is around 800G. Does that mean I need to have 800G memory (I don't have that amount of memory)? Or Spark could do something like streaming but then again will the performance sacrifice as a result? Thanks JT -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-requirement-of-using-Spark-tp17177.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org