HI All, I am writing and executing a Spark Batch program which only use SPARK-SQL , But it is taking lot of time and finally giving GC overhead .
Here is the program , 1.Read 3 files ,one medium size and 2 small files, and register them as DF. 2. fire sql with complex aggregation and windowing . register result as DF. 3. .........Repeat step 2 almost 50 times .so 50 sql . 4. All SQLs are sequential , i.e next step requires prev step result . 5. Finally save the final DF .(This is the only action called). Note :: 1. I haven't persists the intermediate DF , as I think Spark will optimize multiple SQL into one physical execution plan . 2. Executor memory and Driver memory is set as 4gb which is too high as data size is in MB. Questions :: 1. Will Spark optimize multiple SQL queries into one single plysical plan ? 2. In DAG I can see a lot of file read and lot of stages , Why ? I only called action once ? 3. Is every SQL will execute and its intermediate result will be stored in memory ? 4. What is something that causing OOM and GC overhead here ? 5. What is optimization that could be taken care of ? Spark Version 1.5.x Thanks in advance . Rabin