Hi all,

There are three jobs, among which the first rdd is the same. Can the first
rdd be calculated once, and then the subsequent operations will be
calculated in parallel?

<http://apache-spark-user-list.1001560.n3.nabble.com/file/t11001/1600954813075-image.png>
 

My code is as follows:

sqls = ["""
            INSERT OVERWRITE TABLE `spark_input_test3` PARTITION
(dt='20200917')
            SELECT id, plan_id, status, retry_times, start_time,
schedule_datekey
            FROM temp_table where status=3""",
            """ 
            INSERT OVERWRITE TABLE `spark_input_test4` PARTITION
(dt='20200917')
            SELECT id, cur_inst_id, status, update_time, schedule_time,
task_name
            FROM temp_table where schedule_time > '2020-09-01 00:00:00' 
            """]

def multi_thread():
    sql = """SELECT id, plan_id, status, retry_times, start_time,
schedule_datekey, task_name, update_time, schedule_time, cur_inst_id,
scheduler_id
        FROM table
        where dt < '20200801'"""
    data = spark.sql(sql).persist(StorageLevel.MEMORY_AND_DISK)
    threads = []
    for i in range(2):
        try:
            t = threading.Thread(target=insert_overwrite_thread,
args=(sqls[i], data, ))
            t.start()
            threads.append(t)
        except Exception as x:
            print x
    for t in threads:
        t.join()

def insert_overwrite_thread(sql, data):
    data.createOrReplaceTempView('temp_table')
    spark.sql(sql)



Since spark is in lazy mode, the first RDD will still be calculated multiple
times during parallel submission.
I would like to ask you if there are other ways, thanks!

Cheers,
Gang Li



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to