Let multiple jobs share one rdd？

Gang Li Thu, 24 Sep 2020 06:52:30 -0700

Hi all,

There are three jobs, among which the first rdd is the same. Can the first
rdd be calculated once, and then the subsequent operations will be
calculated in parallel?


<http://apache-spark-user-list.1001560.n3.nabble.com/file/t11001/1600954813075-image.png>
 

My code is as follows:

sqls = ["""
            INSERT OVERWRITE TABLE `spark_input_test3` PARTITION
(dt='20200917')
            SELECT id, plan_id, status, retry_times, start_time,
schedule_datekey
            FROM temp_table where status=3""",
            """ 
            INSERT OVERWRITE TABLE `spark_input_test4` PARTITION
(dt='20200917')
            SELECT id, cur_inst_id, status, update_time, schedule_time,
task_name
            FROM temp_table where schedule_time > '2020-09-01 00:00:00' 
            """]

def multi_thread():
    sql = """SELECT id, plan_id, status, retry_times, start_time,
schedule_datekey, task_name, update_time, schedule_time, cur_inst_id,
scheduler_id
        FROM table
        where dt < '20200801'"""
    data = spark.sql(sql).persist(StorageLevel.MEMORY_AND_DISK)
    threads = []
    for i in range(2):
        try:
            t = threading.Thread(target=insert_overwrite_thread,
args=(sqls[i], data, ))
            t.start()
            threads.append(t)
        except Exception as x:
            print x
    for t in threads:
        t.join()

def insert_overwrite_thread(sql, data):
    data.createOrReplaceTempView('temp_table')
    spark.sql(sql)



Since spark is in lazy mode, the first RDD will still be calculated multiple
times during parallel submission.
I would like to ask you if there are other ways, thanks！

Cheers,
Gang Li



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Let multiple jobs share one rdd？

Reply via email to