HI All,

 I am writing and executing a Spark Batch program which only use SPARK-SQL
, But it is taking lot of time and finally giving GC overhead .

Here is the program ,

1.Read 3 files ,one medium size and 2 small files, and register them as DF.
2.
     fire sql with complex aggregation and windowing .
     register result as DF.

3.  .........Repeat step 2 almost 50 times .so 50 sql .

4. All SQLs are sequential , i.e next step requires prev step result .

5. Finally save the final DF .(This is the only action called).

Note ::

1. I haven't persists the intermediate DF , as I think Spark will optimize
multiple SQL into one physical execution plan .
2. Executor memory and Driver memory is set as 4gb which is too high as
data size is in MB.

Questions ::

1. Will Spark optimize multiple SQL queries into one single plysical plan ?
2. In DAG I can see a lot of file read and lot of stages , Why ? I only
called action once ?
3. Is every SQL will execute and its intermediate result will be stored in
memory ?
4. What is something that causing OOM and GC overhead here ?
5. What is optimization that could be taken care of ?

Spark Version 1.5.x


Thanks in advance .
Rabin

Reply via email to